How to benchmark LLMs deployed on Amazon SageMaker

In this tutorial we will learn how to benchmark LLMs deployed on Amazon SageMaker. We will be performing the benchmark on our own PC but there are other ways which were mentioned at the end of this tutorial. This tutorial assumes you already had deployed a Llama 2 13B chat model on Amazon SageMaker Endpoint.

How to Benchmark LLMs deployed on Amazon SageMaker - TutLinks
How to Benchmark LLMs deployed on Amazon SageMaker – TutLinks


So let’s proceed with creating environment to run jupyter notebook on our local PC.

Jupyter runs on http://localhost:8888
From server, Open JupyterLab by choosing the menu View -> Open JupyterLab.

Create a new notebook by choosing the menu File -> New -> Notebook
Choose the kernel, I’ll be choosing sagemaker as kernel. You can choose the one that is available to you.

The new notebook opens and perform the following steps in the code cells of the notebook.

Configure AWS Access

Obtain the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY and add the values in a code cell of the new notebook created. Ensure the region is same as region where you want to perform sagemaker related interactions.

Initialize the SageMaker session

Optionally, in case you want to create endpoints, obtain the SageMaker Execution role from the AWS Console. Create the role variable with the name of SageMaker execution role as follows. But you can skip for the scope of this tutorial.

Obtain the SageMaker endpoint name and region. Ensure that the region where the Sagemaker enpoint is deployed must be same value as region.

You are required to have java installed. So run the following command in the notebook code cell.

Download awscurl to benchmark LLM

In the code cell of the notebook, run the follwing commands to download awscurl.

Prepare payload for endpoint

In this case I will be benchmarking LLM Llama 2 13B Chat deployed on SageMaker endpoint.

So let’s prepare inferencing parameters, input prompts as save it to a file. It’s important to note that the inferencing parameters differer each LLM. The following work only for Llama 2 models with response streaming.

If your model doesn’t support streaming response, remove the "stream":true in each prompt from the preceeding script.

Benchmarking your LLM with awscurl

We will define the parameters such as CONCURRENCY to indicate the number of concurrent threads to initiate and NUM_REQUEST to indicate the number of requests to be made per each concurrent thread.

We will set the TOKENIZER to TheBloke/Llama-2-13B-Chat-fp16 as tokenizer to perform batch inference on Llama 2 13B Chat model. Ensure you use the relevant model as the tokenizer.

We finally tail the output files generated. To do that navigate to Jupyter Server menu option File -> New -> Terminal and run the following command.

Once the batch inferencing is done, you will find the results in the output of the cell of notebook where you initiated call to awscurl.

The results include various metrics such as Tokens per second (TPS), Totalk Tokens, error rate, non 200 responses, Average Latency among other metrics.

Considerations to Benchmark LLMs and taking calculated decisions

How to Benchmark LLMs deployed on Amazon SageMaker - Calculated decisions - TutLinks

Factors affecting the performance

The speed of a SageMaker endpoint can depend on a lot of things. Some are related to how the endpoint itself is set up – like the instance type or model serving options. But other things can impact performance too, even if they have nothing to do with SageMaker directly.

For example, let’s say you created an endpoint in the US East region and you want to test how fast it is. If you benchmark it by sending requests from your laptop while sitting in a café in India, that geographic distance will affect the results. The benchmark will include some lag time from traffic traveling across the world! Not to mention if you’ve got an old laptop, that could slow things down too.

Instead, try testing response times from an Amazon SageMaker notebook instance or EC2 instance that’s running in the same AWS region. That gives you a more “apples to apples” comparison by eliminating physical distance and internet traffic as factors. It isolates the performance of just the SageMaker endpoint itself.

The point is, where you test from and network factors can significantly affect benchmark results. For the most fair and direct comparisons, test from an environment that mimics real-world conditions as close as possible. By controlling external variables, you get a better sense of your endpoint’s true underlying performance.

Benchmarking with varying Inferencing Configurations

Once you’ve decided on the AWS region closest to your end users for geographic fairness, next comes configuring the performance tests themselves. The goal is to benchmark the SageMaker endpoint under different realistic conditions that match what it will encounter in production.

For example, you’ll want to test response times for varying input prompt lengths – does a longer text query cause more latency? What about different combinations of inference parameters? Maybe higher temeperature or max_token_lenght take more time while lower values of temperature and max_token_length leads to faster speeds would have lower accuracy.

The point is to benchmark with a range of variance in the actual inferencing configurations. You want to simulate the different types of requests the model will get when it’s serving real users. This will reveal how both simple and complex queries impact response time from the user’s experience.

By testing under different parameterized configurations, you can better optimize the endpoint. Then it’s ready to provide snappy and reliable inferencing times even as user inputs and requests fluctuate. This real-world testing helps tune performance before going live.

Driving the decision with metrics

It’s important to recognize that benchmarking provides approximate estimates of how the SageMaker endpoint performs under different loads. The test results include metrics like response times and error rates for various concurrent requests across different request volumes.

These insights allow informed decisions around iterative optimization. You can tweak internal configurations like the endpoint instance type and model serving containers based on the bottlenecks and constraints observed. Running repeated benchmarking rounds with configuration changes helps hone in on the best setup for your needs.

The goal is to leverage actual data to guide tuning that gets you as close as possible to an optimal configuration for real-world conditions. Benchmarking helps provide targets for incremental improvements until the endpoint performance reaches acceptable levels for concurrent users and production request volumes. Let the test findings steer efforts towards technical configurations that best serve end user needs.


Benchmarking a SageMaker endpoint with close-to realistic data variability and usage patterns provides key performance insights. These actionable metrics of TPS, error rates, and average latency should directly guide incremental configuration improvements until reaching your optimization targets. An iterative process of comprehensive benchmarking followed by targeted enhancements results in rightly optimized and responsive SageMaker deployments ready for production environments to make your end users Happy.

If you found this tutorial insightful, please do bookmark 🔖 it! Also please do share it with your friends and colleagues!

Navule Pavan Kumar Rao

I am a Full Stack Software Engineer with the Product Development experience in Banking, Finance, Corporate Tax and Automobile domains. I use SOLID Programming Principles and Design Patterns and Architect Software Solutions that scale using C#, .NET, Python, PHP and TDD. I am an expert in deployment of the Software Applications to Cloud Platforms such as Azure, GCP and non cloud On-Premise Infrastructures using shell scripts that become a part of CI/CD. I pursued Executive M.Tech in Data Science from IIT, Hyderabad (Indian Institute of Technology, Hyderabad) and hold B.Tech in Electonics and Communications Engineering from Vaagdevi Institute of Technology & Science.

Leave a Reply