- Newest
- Most votes
- Most comments
Prompt caching in Amazon Bedrock does not impact the tokens-per-minute service limit in the way you've described. When using prompt caching, you are only charged for and counted against your token limits for the new tokens processed, not for the cached tokens.
In your example scenario:
- You have a limit of 100k tokens per minute
- You make 5 API calls
- Each call has 20k tokens in cache hits and 5k new tokens
In this case, you would only be using 25k tokens against your limit (5k new tokens * 5 calls), not 125k. The 20k cached tokens per call (100k total) do not count against your token limit.
So, in this scenario, you would be well within your 100k tokens per minute limit, using only 25k of it. The service limits would not impact your API calls in this case.
Prompt caching is designed to reduce both costs and latency by avoiding reprocessing of repeated content. Cached tokens are charged at a reduced rate and do not count towards your regular token limits. This allows you to make more efficient use of your allocated resources and potentially make more API calls within the same limits.
It's important to note that the exact implementation and impact may vary depending on the specific model and region you're using, as prompt caching features can differ. Always refer to the most current AWS documentation for the most accurate information regarding your specific use case.
Sources
Cache Prompts Between Requests - Amazon Bedrock Prompt Caching - AWS
Effectively use prompt caching on Amazon Bedrock | AWS Machine Learning Blog
Prompt caching for faster model inference - Amazon Bedrock
Reduce costs and latency with Amazon Bedrock Intelligent Prompt Routing and prompt caching (preview) | AWS News Blog
Relevant content
- asked a month ago
- asked a year ago
- asked 17 days ago