What is rate limiting?

Rate limiting is a technique used to control how frequently a user, application, or system can interact with a service. It sets boundaries on the number of requests that can be made within a defined time period (per second, per minute, or per hour, etc). The goal of rate limiting is to prevent any single source of traffic from overwhelming a system.

When a system (website, application etc) is overwhelmed, any additional requests to the system can be delayed or rejected. Think of someone accessing a webpage - when the page is overloaded by requests, additional requests could be either significantly slowed (the user has to wait for the webpage to load), or the webpage may be completely inaccessible. All around a very poor user experience.

Implementing rate knitting helios to ensure that services (like the webpage in the example above) remain stable, responsive and accessible, even when they are under a heavy traffic load.

Why is rate limiting important?

Modern applications and APIs are designed to serve many users at once, globally. Without safeguards in place, excessive or unexpected traffic (from things like viral events, or malicious attacks) can degrade overall performance or cause outages.

Rate limiting helps protect systems from abuse like scraping, brute-force attacks, or denial-of-service attempts. It also promotes fair usage by ensuring that one client cannot monopolize resources at the expense of others. Beyond security, rate limiting plays a key role in maintaining predictable performance and protecting backend infrastructure from sudden spikes.

Think of it as a traffic cop - by limiting the flow of traffic, streets remain accessible and safe for all legitimate users (drivers).

Who uses rate limiting?

Rate limiting is widely used across the internet and within internal systems. Public APIs rely on it to manage traffic from third-party developers, while websites use it to protect login pages and sensitive endpoints. CDNs, edge platforms, and load balancers frequently apply rate limits before traffic ever reaches an origin server.

Even internal services benefit from rate limiting, as it can prevent cascading failures caused by bugs, retries, or unexpected usage patterns within distributed systems.

How does rate limiting work?

At a high level, rate limiting works by tracking incoming requests and comparing them against predefined ‘rules’. These ‘rules’ typically specify how many requests are allowed, over what time window, and how requests are identified (by things like IP address, API key, user account, or authentication token).

When a request arrives, the system checks whether the client is still within its allowed limit. If so, the request proceeds as normal. If not, the system applies the appropriate (pre-defined) policy, rejecting the request, delaying it, or slowing the response. This decision is usually made automatically and in real time.

What is the difference between rate limiting and throttling?

Rate limiting and throttling are closely related concepts, but they are not exactly the same. Rate limiting defines the rules (how many requests are allowed within a given timeframe). Throttling describes how the system behaves once those limits are approached or exceeded.

A service might enforce a strict rate limit but also throttle traffic gradually as usage increases, smoothing out spikes rather than immediately blocking requests. In practice, many systems use both techniques together to balance protection and user experience.

What are common rate limiting strategies?

There are several strategies for implementing rate limits, each with different trade-offs. Some approaches use fixed time windows, where requests are counted within clearly defined intervals. Others use sliding windows or token-based models that distribute requests more evenly over time and allow for short bursts of traffic.

More advanced strategies may limit the number of concurrent requests being processed at once, rather than focusing solely on request frequency. The choice of strategy depends on factors like traffic patterns, system performance requirements, and how tolerant the service is of brief spikes.

What happens when a rate limit is exceeded?

When a client exceeds a rate limit, the system typically stops processing additional requests from that client for a period of time. In HTTP-based systems, this is often communicated using a ‘too many requests’ response.

The response may also include information that helps the client recover, like how long to wait before retrying or how much quota remains. Importantly, rate-limited requests are usually not processed at all, which protects the system from unnecessary work.

Can rate limiting block legitimate users?

Yes. If configured poorly, rate limiting can impact legitimate traffic. This is why effective rate limiting is designed with real-world usage patterns in mind. Reasonable limits, allowance for short bursts, and clear feedback to clients all help minimize disruption.

Monitoring and adjusting limits over time is also essential, as traffic patterns evolve and applications grow.

How CDNs Use Rate Limiting

Content Delivery Networks (CDNs) use rate limiting to control how much traffic individual clients can send to applications and origin servers. By enforcing limits at the edge, closer to users, CDNs can stop abusive or excessive requests before they consume backend resources. This helps protect origin infrastructure from overload while keeping applications fast and available for legitimate users.

Rate limiting at the CDN layer is especially effective because it operates at global scale and in real time. CDNs can apply limits based on signals like IP address, geographic region, request path, headers, or authentication tokens. This allows teams to tailor policies for sensitive endpoints like login pages, APIs, or checkout flows without affecting the rest of the site.

In addition to improving reliability, CDN-based rate limiting strengthens security. It helps mitigate brute-force attacks, bot-driven abuse, and traffic spikes caused by misconfigured clients. Because decisions are made at the edge, rate-limited requests are blocked or throttled before they reach the origin, reducing latency for legitimate traffic and lowering infrastructure costs.

How Fastly can help

Fastly enables advanced rate limiting at the edge, allowing organizations to control traffic before it reaches their applications or origin infrastructure. By enforcing limits across Fastly’s global network, customers can stop excessive or abusive requests close to the source, reducing load, latency, and operational risk.

With Fastly, rate limiting rules can be applied using flexible request attributes such as client IP address, request path, headers, geographic location, or authentication signals. This makes it possible to protect high-risk or high-value endpoints (APIs, login pages, or checkout flows) without impacting normal traffic elsewhere on the site.

Fastly’s edge-based approach means rate limit decisions are made in real-time and at scale. When a client exceeds a defined threshold, Fastly can block, throttle, or respond immediately, preventing unnecessary requests from consuming origin resources. This not only improves application reliability but also helps mitigate abuse scenarios like brute-force attacks, bot activity, and traffic spikes caused by misconfigured clients.