I want to improve Amazon Bedrock performance and response times when I process and retrieve large scale data.
Resolution
You might experience latency issues for one of the following reasons:
- Distance between your application and the Amazon Bedrock endpoint
- Larger models that typically require more time to process
- Length and complexity of your prompts
- Large volume of simultaneous API calls
To Improve performance and response times, take the following actions.
Choose the right model
Review your specific requirements, and then choose the model that best fits your needs in terms of speed and output quality.
Improve your input and system prompts
Reduce the number of tokens in both your input prompts and system prompts. If your model has less tokens to process and generate, then the model generates a faster response.
It's a best practice to use clear and concise prompts, structured templates, and prompt engineering techniques.
Use prompt caching
Prompt caching is an optional feature that you can use to reduce response and model inference in Amazon Bedrock. Add parts of your conversation to a cache so that the model can reuse the context.
Use inference parameters
Use inference parameters specific to the models such as temperature to control and tune response generation. These parameters help you control the length of the output.
Use latency-optimized inference
Latency-optimized inference for foundation models in Amazon Bedrock provides faster response times and improved responsiveness for AI applications. There is no additional setup required to access the latency optimization capability. Set the Latency parameter to Optimized.
Use smaller models
Larger models, such as Anthropic Claude 2, typically have higher latency but better quality. Instead, you can use smaller models that offer faster responses with reduced capabilities.
Select a closer Region
If the model is available in the AWS Region, then choose a Region in Amazon Bedrock closest to you.
Use streaming APIs
The InvokeModel and Converse APIs have to wait until all response tokens are generated before they send the tokens back to you. Use the InvokeModelWithResponseStream and ConverseStream APIs because these APIs don't wait until all tokens are generated and they return the response in a stream.