How To Scale Your Microservices To Millions Of Requests Per Second

Building microservices is the easy part, scaling them to high numbers is the challenging part.

Uriel Bitton

May 21, 2024

So you’ve built a microservices architecture on the cloud.

It works well, it is performant and reliable.

But what happens when your user base scales?

Can your application handle large traffic spikes or consistently high traffic requests daily?

Or do your users experience downtimes and degradation causing your application to suffer accordingly?

Oftentimes, with high traffic or sudden spikes of requests, your users will get throttled or your application’s response time will decrease.

The question is how can we prevent this and allow our application to scale past thousands, hundreds of thousands, and even millions of requests concurrently?

That is no easy feat, but AWS provides you with the necessary services and tools to accomplish this.

Let’s take a look at a few practices we can implement to scale up our applications on the cloud with AWS and allow them to be reliable and performant.

1. Provisioned Concurrency

Serverless functions like AWS Lambda are designed to scale high by default.

However, when your application experiences sudden spikes in traffic or consistently high numbers of requests, latency can often be impacted by cold starts.

Cold starts are common with Lambda functions when a certain function has not run in a while. The underlying server is temporarily turned off after some idle time and promptly turned on when your function receives a request.

That small extra time to spin up the underlying server is what causes cold starts.

For small scale applications this is often not an issue, but at larger scales it starts to become an issue.

Provisioned capacity is one such effective way to avoid these cold starts. By provisioning capacity, Lambda allocates resources in advance, pre-warming your serverless functions, and ensuring a consistent performance at high scales.

With non-serverless computing, like EC2 instances, you can also accommodate high concurrency by using a combination of resources like an elastic load balancer and auto-scaling groups.

The elastic load balancer will distribute the load to multiple servers evenly or on a per-traffic basis, routing traffic dynamically to be optimally efficient.

The auto-scaling group will dynamically add more server instances as needed by your growing requests, to ensure you always have the capacity required.

2. Throttling & Caching

Throttling is the practice of limiting requests to your API from an individual client.

Just like traffic lights control the flow of vehicles on a road, throttling limits the number of requests per second from individual clients, preventing overload on Lambda functions.

API Gateway, a REST API service, can ensure that the backend infrastructure isn’t overwhelmed by a sudden surge in traffic, by enforcing request limits, thus avoiding degradation of performance or downtime.

Finally, caching mechanisms are also very powerful for scaling up. You can use ElastiCache or CloudFront to cache frequently accessed data and content close to end users. This will greatly reduce the load on your servers.

3. Load Testing & Monitoring

Before deploying your application to production, it’s important to conduct load testing to understand how your system behaves in a high traffic environment or sudden spike events.

Use load testing tools to simulate millions of concurrent requests and monitor performance metrics such as latency, error rates, and resource utilization.

This will help you identify potential bottlenecks and optimize your architecture accordingly.

AWS CloudWatch can help you a lot with this.

CloudWatch Synthetics allows you to create canaries that monitor your endpoints and APIs by simulating user interactions.

While primarily used for monitoring and synthetic monitoring, you can leverage canaries to generate simulated traffic and test the performance of your applications under load.

You can also use third-party tools such as Loader.io, or BlazeMeter to generate high levels of traffic and simulate real-world traffic scenarios.

These tools provide features for distributed load generation, performance monitoring, and reporting.

4. Regional & Multi-AZ Deployments

Deploying your cloud resources (database, servers and APIs) across multiple regions in the AWS network will provide lower latency and high availability for your end users.

Replicating your data across multiple availability zones in the same region will help mitigate downtime or failures, letting your application be highly available.

You can configure your Elastic Load Balancer to automatically redirect traffic away from failed instances and towards working ones.

With DynamoDB you can use Global Tables to replicate your data to multiple regions around the world, providing lower latency data access to your users worldwide.

For static assets in S3, you can enable Cross Region Replication (CRR) to replicate your static files across multiple regions as well.

With AWS Route 53 you can also help your application be more globally resilient and protected from DDoS attacks.

5. Advanced Architectural Patterns

Lastly, you can implement advanced architectural patterns such as event-driven architectures and distributed caching to further scale and optimize your application.

For example, services like AWS SQS can be used to send asynchronous messages between components.

Instead of having services synchronously perform transactions, slowing down the process, you can use SQS to asynchronously process these transactions, enabling them to be highly scalable.

You can also use Amazon ElastiCache for distributed caching to handle large volumes of data and improve performance.

Event-driven architectures can be very effective for extreme-scale applications. You can use AWS Lambda to react to different events, also decoupling services, to process transactions.

With Lambda, you have the added benefits of serverless functions which can already scale high by default.

Finally, don’t forget to consider costs at every layer of your solutions architecture, especially at such high scales.

The more your application scales the higher your costs will be.

Even though Lambda remains cheap, databases like DynamoDB can raise costs significantly at high concurrency.

To learn how to keep costs low while scaling high I recommend you read this article.

Conclusion

Scaling an application to handle millions of requests per second requires a combination of architectural best practices and AWS services.

By leveraging AWS tools such as provisioned concurrency, throttling, caching, and load testing, you can optimize performance and reliability under high traffic conditions.

Additionally, deploying your application across multiple AWS regions, implementing advanced architectural patterns like event-driven processing, and using distributed caching can also enhance scalability and resilience.

With these practices in place, your application can scale past millions of requests concurrently while maintaining the reliability and performance your end users expect.

👋 My name is Uriel Bitton and I hope you learned something of value in this article of The Serverless Spotlight.

🔗 Please consider sharing it with your network to help others learn as well.

😍 You can also explore my social media links here:

✍️ *my blog website is coming soon, so stay tuned for that.

🙌 I hope to see you in the next week's edition!

Uriel

The Serverless Spotlight