In this tutorial we will learn how to benchmark LLMs deployed on Amazon SageMaker. We will be performing the benchmark on our own PC but there are other ways which were mentioned at the end of this tutorial. This tutorial assumes you already had deployed a Llama 2 13B chat model on Amazon SageMaker Endpoint.
Table of Contents
Pre-requisites
So let’s proceed with creating environment to run jupyter notebook on our local PC.
mkdir sagemakerfrompc
cd sagemakerfrompc
python3 -m venv env
source env/bin/activate
pip install -U pip
pip install -U boto3 sagemaker
pip install -U ipykernel notebook
python -Xfrozen_modules=off -m ipykernel install --user --name=sagemaker
jupyter notebook
Jupyter runs on http://localhost:8888
From server, Open JupyterLab by choosing the menu View -> Open JupyterLab.
Create a new notebook by choosing the menu File -> New -> Notebook
Choose the kernel, I’ll be choosing sagemaker
as kernel. You can choose the one that is available to you.
The new notebook opens and perform the following steps in the code cells of the notebook.
Configure AWS Access
Obtain the AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
and add the values in a code cell of the new notebook created. Ensure the region is same as region where you want to perform sagemaker related interactions.
AWS_ACCESS_KEY_ID="YOUR AWS_ACCESS_KEY_ID"
AWS_SECRET_ACCESS_KEY="YOUR AWS_SECRET_ACCESS_KEY"
Initialize the SageMaker session
import sagemaker
sess = sagemaker.session.Session() # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket() # bucket to house artifacts
Optionally, in case you want to create endpoints, obtain the SageMaker Execution role from the AWS Console. Create the role variable with the name of SageMaker execution role as follows. But you can skip for the scope of this tutorial.
role='AmazonSageMaker-ExecutionRole-YYYYMMDDTHHMMSS'
Obtain the SageMaker endpoint name and region. Ensure that the region where the Sagemaker enpoint is deployed must be same value as region.
endpoint_name='Your SageMaker Endpoint Name'
region = sess._region_name
ENDPOINT_URL = f'https://runtime.sagemaker.{region}.amazonaws.com/endpoints/{endpoint_name}/invocations'
You are required to have java installed. So run the following command in the notebook code cell.
!sudo yum -y update
!sudo yum -y install java wget
Download awscurl to benchmark LLM
In the code cell of the notebook, run the follwing commands to download awscurl.
!wget https://github.com/frankfliu/junkyard/releases/download/v0.3.1/awscurl
!chmod +x awscurl
Prepare payload for endpoint
In this case I will be benchmarking LLM Llama 2 13B Chat deployed on SageMaker endpoint.
So let’s prepare inferencing parameters, input prompts as save it to a file. It’s important to note that the inferencing parameters differer each LLM. The following work only for Llama 2 models with response streaming.
!mkdir prompts
!echo '{"inputs":"The story of great new Bharata Mandapam from India to demonstrate India\'s vibrant culture.","parameters":{"min_new_tokens":256, "max_new_tokens":512, "do_sample":true},"stream":true}' > prompts/prompt1.txt
!echo '{"inputs":"What are the latest models and their features for iPhone and Mac notebook.","parameters":{"min_new_tokens":512, "max_new_tokens":1024, "do_sample":true},"stream":true}' > prompts/prompt2.txt
!echo '{"inputs":"Give ideas for naming a .com web domain that sells toys and books.","parameters":{"min_new_tokens":128, "max_new_tokens":256, "do_sample":true},"stream":true}' > prompts/prompt3.txt
If your model doesn’t support streaming response, remove the "stream":true
in each prompt from the preceeding script.
Benchmarking your LLM with awscurl
We will define the parameters such as CONCURRENCY
to indicate the number of concurrent threads to initiate and NUM_REQUEST
to indicate the number of requests to be made per each concurrent thread.
CONCURRENCY = 5
NUM_REQUEST = 10
We will set the TOKENIZER to TheBloke/Llama-2-13B-Chat-fp16
as tokenizer to perform batch inference on Llama 2 13B Chat model. Ensure you use the relevant model as the tokenizer.
!TOKENIZER=TheBloke/Llama-2-13B-Chat-fp16 \
AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
./awscurl -c $CONCURRENCY -N $NUM_REQUEST -X POST $ENDPOINT_URL \
--connect-timeout 60 -H "Content-Type: application/json" --dataset prompts -t -n sagemaker -o output
We finally tail the output files generated. To do that navigate to Jupyter Server menu option File -> New -> Terminal and run the following command.
!tail -f output.*
Once the batch inferencing is done, you will find the results in the output of the cell of notebook where you initiated call to awscurl
.
The results include various metrics such as Tokens per second (TPS), Totalk Tokens, error rate, non 200 responses, Average Latency among other metrics.
Considerations to Benchmark LLMs and taking calculated decisions
Factors affecting the performance
The speed of a SageMaker endpoint can depend on a lot of things. Some are related to how the endpoint itself is set up – like the instance type or model serving options. But other things can impact performance too, even if they have nothing to do with SageMaker directly.
For example, let’s say you created an endpoint in the US East region and you want to test how fast it is. If you benchmark it by sending requests from your laptop while sitting in a café in India, that geographic distance will affect the results. The benchmark will include some lag time from traffic traveling across the world! Not to mention if you’ve got an old laptop, that could slow things down too.
Instead, try testing response times from an Amazon SageMaker notebook instance or EC2 instance that’s running in the same AWS region. That gives you a more “apples to apples” comparison by eliminating physical distance and internet traffic as factors. It isolates the performance of just the SageMaker endpoint itself.
The point is, where you test from and network factors can significantly affect benchmark results. For the most fair and direct comparisons, test from an environment that mimics real-world conditions as close as possible. By controlling external variables, you get a better sense of your endpoint’s true underlying performance.
Benchmarking with varying Inferencing Configurations
Once you’ve decided on the AWS region closest to your end users for geographic fairness, next comes configuring the performance tests themselves. The goal is to benchmark the SageMaker endpoint under different realistic conditions that match what it will encounter in production.
For example, you’ll want to test response times for varying input prompt lengths – does a longer text query cause more latency? What about different combinations of inference parameters? Maybe higher temeperature or max_token_lenght take more time while lower values of temperature and max_token_length leads to faster speeds would have lower accuracy.
The point is to benchmark with a range of variance in the actual inferencing configurations. You want to simulate the different types of requests the model will get when it’s serving real users. This will reveal how both simple and complex queries impact response time from the user’s experience.
By testing under different parameterized configurations, you can better optimize the endpoint. Then it’s ready to provide snappy and reliable inferencing times even as user inputs and requests fluctuate. This real-world testing helps tune performance before going live.
Driving the decision with metrics
It’s important to recognize that benchmarking provides approximate estimates of how the SageMaker endpoint performs under different loads. The test results include metrics like response times and error rates for various concurrent requests across different request volumes.
These insights allow informed decisions around iterative optimization. You can tweak internal configurations like the endpoint instance type and model serving containers based on the bottlenecks and constraints observed. Running repeated benchmarking rounds with configuration changes helps hone in on the best setup for your needs.
The goal is to leverage actual data to guide tuning that gets you as close as possible to an optimal configuration for real-world conditions. Benchmarking helps provide targets for incremental improvements until the endpoint performance reaches acceptable levels for concurrent users and production request volumes. Let the test findings steer efforts towards technical configurations that best serve end user needs.
Conclusion
Benchmarking a SageMaker endpoint with close-to realistic data variability and usage patterns provides key performance insights. These actionable metrics of TPS, error rates, and average latency should directly guide incremental configuration improvements until reaching your optimization targets. An iterative process of comprehensive benchmarking followed by targeted enhancements results in rightly optimized and responsive SageMaker deployments ready for production environments to make your end users Happy.
If you found this tutorial insightful, please do bookmark 🔖 it! Also please do share it with your friends and colleagues!