- Newest
- Most votes
- Most comments
Hi,
The runtime quotas for queries in terms of queries / token throughput are not adjustable (as of now): see https://docs.aws.amazon.com/bedrock/latest/userguide/quotas.html
It's for example 1'000 per min for Claude Haiku. The way to go around it if your budget allows is to go to provisioned througput: see Provisioned Quotas at bottom of same page.
Your problem may be a snowball effect: too many initial requests and retries in parallel will just generate more throttling exceptions and bad user experience.
What can try and see if it helps in your specific use case is to create a request manager: you can create a context via Redis where each lambda kaving a query under way stores its id up to a maximum that you control. When a new request comes in, it checks in the context if maximum is not reached. If not, it requests to the LLM. If max is reached, the Lambda polls the context until another Lambda removes its entry because its request completed.
In that way, you will incrementally learn the maximum parallelism allowed for your account and region to minimize throttling exceptions and their cascading effect. It will not increase your allowed throughput but at least ensure that you make best use of it.
There is NO guarantee that this mechanism will work in your specific use case. It just helped in one of mine, that's why I suggest it here for a trial on your side.
Best,
Didier
