Backend Performance Testing & Capacity Planning

Backend Performance Testing & Capacity Planning

Why Should I?

Imagine for a moment that you’ve been working tirelessly for months on a shiny new product with killer features. This one product alone will drive a 20% revenue bump in the year following launch. You’ve written maintainable code, great unit tests, and have an automated deploy pipeline that makes rapid iteration a joy.  The product launches. People are installing the app. Active user count is climbing fast. 1-star reviews are flooding in…wait, what? “I can’t log in,” “all I see are 504 timeouts,” “trash doesn’t work,” ….oh dear. You didn’t do your capacity planning. 

Disambiguation of Terms

The term “performance testing” tends to be used as a sort of catch-all; but really, it should be broken down into a few distinct categories. You can name the categories whatever you want, but for this article we’ll be using the definitions below:

  1. Performance testing
    • You are measuring ‘how does this thing perform, under various specific circumstances.’ Can be thought of as “metric sampling.” 
  2. Stress testing
    • You are measuring ‘how does this thing respond to more load than it can handle,’ and ‘how does it recover.’ Can be thought of as “resiliency testing” or “failure testing.” 
  3. Load testing
    • You are measuring ‘given a specified amount of load, is the response acceptable?’ Can be thought of as a “unit test.”  

Given the above definitions, you may note that a “stress test” is in fact a “performance test,” simply executed at unreasonable load levels. Similarly, a “load test” can be a composition of the other two, with the additional step of asserting a pass or fail.


What do you need in order to perform these kinds of tests? The answer will of course vary (significantly) from case to case. We can, however, make generalizations about what is needed for the vast majority, especially given our end goal of capacity planning.

Production-Equivalent Infrastructure

The first step is to ensure that the environment you are performance testing against is provisioned the same as production. Same instance class, same storage amounts, same everything. This also includes any dependencies of your application, such as databases or other services. This is the most important prerequisite. Testing on your laptop or your under-provisioned QA environment may give you directional information, however it IS NOT acceptable nor sufficient to make capacity planning decisions.

Testing Tools

It is usually a good idea to use tools which make it simple and easy to record and version control your test cases. For instance, the tool Locust uses test cases which are simply Python classes; this is perfect for code review, and allows for rapid and simple customization of the tests (ex. Randomize a string without reuse). In contrast, JMeter uses an XML format which is significantly more difficult to code review, and generally must be edited from within its own GUI tool. Whatever tool you choose should be quick/easy to set up and provision. We want our effort to be on the testing, not fighting the tools.

Distributed Tracing

Distributed tracing allows you to ‘follow’ an event throughout your system, even across different services, and view metrics about each step along the way. Distributed tracing is strongly recommended, as it will enable you to simply review the data after your test, and rapidly zero in on problem areas without needing additional diagnostic steps. It may even help you uncover defects that you otherwise would not have noticed such as repeated/accidental calls to dependencies.

Achievable, Concrete Goal

As a group, you must come up with a goal for the system to adhere to. This goal should be expressed as a number of requests per time unit. You may additionally wish to include an acceptable failure rate as well; for example 5000 requests per second, <=2% failure rate. The business MUST be able to articulate what success looks like. That 20% revenue projection? That came from something tangible. Active users performing specific tasks over time. Non-technical stakeholders may balk at being asked to provide concrete non-revenue numbers. It’s important to partner with these individuals and help them understand where the revenue goal comes from, and how these concrete numbers impact the likelihood of success.

Revenue is effectively the ANSWER to a complex math problem with multiple input variables; we must reverse-engineer the equation to solve for one of those inputs. If that 20% revenue projection is based on nothing but hopes and dreams then just do something reasonable. More on this in the Partner section below.


So, how do we go about doing all of this? Instrument, Provision, Explore, Extrapolate, Partner, Assess, Correct. IPEEPAC! This acronym is my contribution to our industry’s collection of awful buzzwords. You are quite welcome.


We can begin with instrumenting the application. The specifics of how to do this with any given framework/vendor could be an entire article by itself, so for our purposes we’ll stick with some fairly agnostic terminology and general approaches. You’ll need to identify a vendor/product  which supports your tech stack and follow their instructions for actually getting it online. They will likely use the terms below, or something similar.

  • Span
    • An ‘individual operation’ to be measured
      • Spans can, well, span multiple physical operations; for example a span could be a function call, which actually makes several API calls
      • Can be automatically generated or user-defined
    • May have many attributes, but typically we care about the elapsed time
    • Can contain other spans
  • Trace
    • An end-to-end collection of spans
    • The complete picture of everything that went on from the moment a request enters the application until the response is transmitted and the trace is closed

Here’s an example visualization of a trace and it’s associated spans, taken from the Jaeger documentation:

Trace Visualization Example

It’s evident why this is useful. We see every operation performed for the given request, how long it took, and even if things are happening in series or in parallel. We could even see what queries are being executed against the database, as long as the tooling supports it.

Most instrumentation tools will provide you with some form of auto-instrumentation out of the box. Whether it’s an agent which hijacks the JVM, or a regular SDK that you bootstrap into your application at startup. Many times, this will be sufficient for our purposes but always verify that this is the case. At a minimum, we need to ensure that we are getting the overall trace and spans for every individual dependency; be it http, sql, smtp, or what have you. This is enough to give us a rough idea of where problems lie. Mostly, it will help us to identify if the source of slowdown is our application code, some dependency call, or a combination thereof.

Ideally we would want some finer grained detail. It’s typically good to add spans, or enrich existing ones with contextual information. For instance, you have a nested loop making api calls – it would be good to have a span encompassing just this so that you can easily see how much execution time it takes up without needing to sum up the dependencies. You could add metadata to that span about the request – perhaps certain parameters are yielding a much larger list to iterate over.


This one is straightforward; creation of your compute and/or other resources – be they cloud or on-prem. Provision your production analog, and your test tooling. Verify that the tooling works as expected and is able to reach the application under test. 

Importantly, DISABLE any WAF, bot detection, rate limiting, etc. These are vital for an actual production environment, but will make our testing difficult or impossible. Remember what we are testing here – it’s our application and architecture, not your cloud vendor’s fancy AI intrusion detection.


This is where the party starts. For right now, disable autoscaling and lock yourself to a single instance of the application. We’ll get back to this, but in the meantime, let’s start crafting some test cases. This is a lot of art mixed into the science unfortunately…We want to be as thorough as possible. Ideally, we would like to exercise every operation the application may perform, however, this isn’t always feasible. There may be too many permutations, or some operations may require proper sequencing in order to execute at all…so what to do?

Start by separating the application into its different resources, and their possible operations. Be sure to include prerequisites such as user profiles. For example, you can fetch or update a single user’s profile or search for multiple profiles…but you can’t really do any of that until a user has been registered, because the registration process creates the profile.


  • Register user <randomized guid + searchable tag>
    • Fetch profile <id>
    • Search profiles matching <tag>
    • Update profile <id>

We could have a test that does exactly that sequence of events. While this does exercise the code, it’s somewhat conflating issues and also misses some points.

  • This exercises a large number of users registering/viewing/updating their profiles at once
  • The longer the test runs, the larger the list of profiles that will be found/returned for the tag search

Let’s try again:

  • Register user
    • Fetch profile
    • Update profile
  • Search profiles
    • fetch a subset of these individually

This more closely resembles what actual traffic would look like. People generally register, go to their profile, then update it.  Other users generally will search for stuff, and then view it. Taken together, this gives us a better picture of how our user experience will progress as we add more users to the system.

However, we’re still missing something. While we are exercising both read and update, we are doing so more or less sequentially. So we might want to add an additional, separate test:

  • Register n users
  • In parallel
    • Fetch individual profiles
    • Update individual profiles
    • Search by tag

Now, taken at face value this may not seem like a relevant test. Users do not generally update their profiles in a massively parallel manner. However, what this test does tell us is how our application responds to multiple, potentially conflicting, potentially LOCKING operations happening simultaneously.

If we apply this thinking to all the operations the application may perform, we could end up with an overwhelming number of test cases, but we can pare things down. We should identify the so-called ‘golden path’ of the application; the sequence of operations that most people perform most of the time. This should be a relatively small slice of functionality, spread across a few areas. We can be very thorough about exercising this functionality. Then, we use our knowledge of the application to identify other areas which we expect may be problematic, or areas that we simply don’t have any coverage on at all. This is where the art vs science really comes into play.

Once we have the test cases, we can start executing them. It is generally advisable to start small and increase relatively rapidly until you start seeing issues. A starting point might be to start at 10 operations/second and keep doubling until you see excessive slowdown or errors. Once you see this, dial it back to the last checkpoint and make small adjustments until you find the breaking point, or single-instance capacity. Note this down, it’s important. Also note down CPU and memory consumption at this point, as well as average and max response times for any dependency calls (you did implement tracing, right?).

A common question to have at this point is “what counts as excessive slowdown/errors.” Unfortunately, it’s really case-by-case. You could assume as a starting point that your 10 operations/second performance is ‘acceptable,’ and then once your response time or error rate doubles, then it is no longer acceptable. You could also assume that it’s fine until the ratio of failed requests to successful ones is greater than 50%. At this stage in the process, any of this is fine; we are exploring, after all. We’ll return to the topic later.


Now that we’ve done some exploration, we can make some educated extrapolations. We have our single-instance capacity. We know that a high-availability application generally wants MINIMUM 3 instances, spread across availability zones…So we can extrapolate that this setup should handle 3x our single-instance capacity. So go and set that up. Re-run the exploration and record the results. Do they align with our extrapolation? If yes, great success. If not, we need to understand why and potentially make changes so that they DO align.

This is also a good time to do a stress test. Push those 3 instances until they become unresponsive, then back the traffic down to ‘reasonable’ levels and see how long they take to recover. Again, note it down. 

It would be wise to take a look at the tracing data here. It will likely help you identify the reason for any discrepancy between our prediction and the actual results. Ex. more traffic being sent to particular instances, or increased latency on database queries because the database is in a different availability zone than two of the instances.


Now that we have some baseline data, we need to figure out what to do with it. The business has revenue goals, and our product owner almost certainly knows what an acceptable user experience “feels” like even if they can’t (yet) articulate numbers. This is a good time to get the product owner (or other relevant stakeholders) involved in the process.

  • Start by sharing the baseline, 10 requests/second numbers. Show how the application responds here; people will usually have opinions of if this “feels okay” or not.
    • If not, we have a major problem
  • Next we need to determine how slow is too slow, and/or how many errors is too many
    • We can start by sharing our exploration assumptions – everyone may be fine with it
    • This can be challenging – if possible it would be good to mock up the application so you can configure a delay
    • Can also relate to other activities – “it loads before you finish reaching for the coffee, or “you can take a long swig and put the cup down.”
  • Once we know what is acceptable vs not, we can configure autoscaling accordingly
    • It needs to kick in well BEFORE we hit the “unacceptable” mark
    • Remember that it can take tens of seconds to a few minutes for new instances to come online, depending on your hosting choices
    • A fair starting point is to set your scaling to about 65% of “unacceptable”; you can tweak this higher or lower as needed

Now the truly difficult part begins – working with the stakeholders to determine how many requests per second do we need to be able to handle. Quite frequently the first answer is “all of them”. We know this is not possible; the cloud is not magic, and even if it was, the price tag would far exceed the revenue goal.

We can start with what we do know. Our “golden path” defines certain operations, and we know how many requests those need, and how long they take. We can get a ballpark requests/second for a single user based on this count, over some amount of time.

Next we need to know how many active users to expect. This is going to be entirely case-by-case. We may be able to do relatively simple math in the vein of “x transactions at y, average user makes z transactions in a week”…or perhaps we have traffic levels for an existing product, and the business projects “n” times that much traffic. Reality is seldom so simple. 

Most likely we’ll need to really get deep into the business case for the product, how the revenue goal is determined, and derive some way to associate this to a number of simultaneous users. It’s very important to do this in collaboration with the stakeholders so that everyone shares ownership of this estimate. The goal is not to point fingers at so-and-so making bad estimates, but to make a good estimate and to learn from any mistakes together. Going through this exercise may even help the business to make better, more informed goals in the future.

Given an average user’s requests/second and our estimated active user count, the simplest solution is to multiply them. In theory this should give us our projected “sustained load” (depending on the application, we may also need to identify the acute load, or the load when experiencing peak traffic. For instance, we expect a restaurant’s traffic to hugely spike for the lunchtime hour). Experience has shown that it is often prudent to double (or more than double) this number for launch, and then hopefully reign it in after the initial surge has died down. Once you obtain real-world average traffic you can set your minimum instance count to support that, and then set maximum to the greater of 2x the new minimum or 2x our original projection.

It is worth noting that the projected numbers may be very/too expensive to operate. Options are limited here:

  • Attempt optimization of the application
    • May or may not be possible/make a noticeable difference
  • Accept the cost
  • Don’t launch
  • Re-estimate – for instance, consider that not all time zones will be active at the same time
    • This is a huge risk, and should be avoided if at all possible. Any scrum team will tell you that “you need to size this smaller” rarely ends well
  • Phased rollout – launch to one geography or subset of users, make adjustments and projections based on the traffic seen there


This is an easy one…Execute tests at our expected load, and work it all the way up to maximum capacity. We want to see the autoscaling trigger and maintain acceptable performance. Document the results, and once again poke through the tracing looking for any anomalous behavior. We want to be sure to identify and document any potential issues at this point. For instance, a 3rd party dependency rate-limiting us before we hit maximum capacity, or perhaps our database doesn’t have enough connection slots. 


Based on the assessment, correct the issues and re-assess. If there’s anything that cannot be resolved, document that issue, and have the stakeholders sign off on it before launch.

Infrastructural Notes

It is important to recognize that your choice of hosting has a huge impact on both how scalable your application is, and how you must configure that scalability. For instance, a basic deployment into an autoscaling group may allow you to scale based on CPU consumption. An advanced Kubernetes deployment with custom metrics may allow you to scale based on the rate of inbound HTTP requests. Deploying into some cloud PaaS may be anywhere in between, depending on the vendor. Why does this matter?

Capacity is more than CPU+RAM

It cannot be assumed that high load ALWAYS corresponds to significant processor or memory utilization. It is a reasonable assumption for perhaps the majority of use cases, however it is certainly not a universal truth. If an application is primarily I/O bound, then it is possible to completely overwhelm it without significant processor or memory spikes. Consider the increasingly common middle-tier API. It simply makes calls to other APIs, and amalgamates them into a single response for the consumer. Most of its time is actually spent idle, waiting for responses from other applications. If its dependencies are slow, we could consume the entire threadpool with requests, and return 503/504 status codes without seeing a CPU spike. Ironically, if many of those responses were to complete all at once we might see a massive CPU spike, which could render our service entirely unresponsive, but also would be so brief as to not trigger a CPU scaling rule which typically requires a sustained load over a period of minutes.

Bandwidth must also be considered. Our CPU and RAM may fit nicely within the most basic instance class, but if the virtual NIC is insufficient for the amount of data we are dealing with, we will find ourselves bottlenecked and once again not scaling. Ditto for disk IOPS. The long and short of it is we must look holistically at the application’s workload, and EVERYTHING it needs to accomplish that workload when capacity planning.

A caution about serverless

There is nothing inherently wrong with serverless/lambda/etc as a hosting choice. There is, however, a fair bit wrong with assuming that it is always a GOOD solution to your scalability problems. Like any and all other tools, we need to understand how it works, what it is good at doing, and what problems it may actually introduce into the system. Serverless is perhaps one of the most frequently misunderstood and/or misused hosting options, especially when it comes to REST APIs.

Serverless typically operates by creating a new instance of your application for every inbound request; assuming all current instances are processing their own requests. This gives the application massive horizontal scalability, and parallel processing capacity, but has some limitations.

  • Cold starts
    • Literally booting a new instance of your application for every request introduces latency; for instance, java/spring applications may take multiple seconds to start up
    • Many providers offer a service tier to reduce cold starts – at additional cost
  • Massive is not infinite
    • There IS an upper bound to the number of instances; AWS for example by default limits you to 1000 concurrent lambda executions per Region
    • So yes, your dev environment may be consuming execution ‘slots’ preventing production from operating at capacity
  • Larger serverless instances tend to be costly, and they may accrue cost both per invocation and for execution time
  • It can be difficult to get your application into a serverless platform
    • Vendor depending; there are often size limits for your executable or other restrictions which need to be worked around

The parallel nature of serverless can also introduce new problems.

  • Your database’s connection slots can rapidly be exhausted because every serverless instance will consume at least one slot; connection pooling code will only apply within each instance
  • Internal caching of dependency requests will only happen within each instance; increasing the number of calls to your dependencies
  • Backoff/retry logic will also only happen within each instance

The thing to note here is that simply hosting on serverless does not necessarily increase your capacity to handle inbound requests. In fact, the parallel nature of serverless may reduce capacity overall. To reap the benefits, not only must your application be designed with serverless in mind, but also your dependencies must be able to handle it as well.

Wrapping Up

Application performance and scalability is a huge and broad topic, and this article is really just hitting some of the highlights. While we’ve outlined a sort of methodology and a whole lot of steps and things to watch out for, you don’t have to do everything at once. Start small – if you can only do one thing, add the instrumentation.You might be surprised how much you learn. If you can do two things, instrument and partner with the stakeholders to work out what the goal should be.