Skip to content

Scalability Assessment: Amazon Bedrock Agents as Backend Service for Production Web Application - Hitting Quota Limits in Development

0

I'm evaluating Amazon Bedrock Agents for a production web application and have concerns about scalability and quota limitations. Specifically:

  • Current setup: I've built a Bedrock Agent that performs text analysis (assessment, rating, and enhancement suggestions)
  • Current issue: I'm hitting quota limits during development/testing with just a single user
  • Usage pattern: Each request involves processing a text passage through the Agent

Questions:

  • What are the recommended architectural patterns for using Bedrock Agents in a customer-facing web application?
  • Are there specific best practices for implementing rate limiting or request batching with Agents?
  • Would direct API calls to Bedrock foundation models be more appropriate for this use case?
  • Hi, can you detail which quota limits you bump into: LLM inferences, agent instances, etc. ? The solution may differe based on quota type.

asked a year ago747 views
1 Answer
0

In general, yes certainly we have customers using Bedrock Agents for (very) large-scale use-cases. There are some important (and some non-adjustable) quotas to be aware of around the building of agents (for e.g. rates on agent build/update APIs, number of aliases and action groups agents can have, etc etc...) but it sounds like you're struggling more with usage quotas? As far as I've seen, these are usually constraints on the usage of the underlying foundation model your agent is calling - rather than being agent-specific: So as far as question (3) goes, you'd probably still need to plan for and negotiate them even if using other agent orchestration tools with Bedrock FMs.

The underlying issue with LLM inference quotas is that the biggest models are very compute intensive, and accelerated compute is in high global demand: I've seen customers get very high quotas approved, but those high quotas represent potentially significant bills for them and infrastructure reservations from AWS side... So you'll likely find it easier to get very high quota requests approved if you:

  1. Have forecasted the actual expected cost for your production workload based on your prototype app, validated that it seems viable, and maybe done some optimizations if needed.
  2. Have sustained usage over a period from testing to go-live, and ideally are ramping smoothly into higher usage. (If the workload represents a very proportionally big increase over your overall previous AWS bills, or the AWS account is brand new, then of course that could raise some extra questions before quota approval)
  3. Similar to point 2, have built some confidence that the solution actually works as you want it to - rather than just throwing something live to a huge user volume and then shutting it off next week because it doesn't deliver results.

On evaluation

I'd suggest to check out AWSLabs' agent-evaluation if you haven't already found a nice framework for automated functional testing of your agent prototypes. For latency & performance testing, LLMeter could help and will hopefully soon be launching tools to link its performance test results to estimated costs. This guided workshop has more discussion & tools for evaluating LLM-based applications too. Bedrock recently launched some new tools at re:Invent like Knowledge Base evaluation jobs that aren't captured in there yet but we hope to soon.

On architectural patterns

  • Double-check whether Agents are needed for your use-case: If you're performing a generally fixed flow of analysis, you might find a fixed prompt flow could meet the use-case and require less LLM throughput & fewer calls than the agentic "plan and act" cycle. It's often a trade-off between flexibility and optimization.
  • Note you can also disable steps in the Bedrock Agent prompt sequence if they're not needed for your use-case - again to help reduce the number of LLM calls/tokens needed to fulfil your standard workflow
  • Latency can be a challenge with agentic systems if they're performing tasks that require a large number of iterations before returning final response to the user. Consider whether your app really needs to interact with users in real-time (which might mean some trade-offs with using faster/lower-quality models, or fewer processing steps), or whether it's better to frame your use-case in a more asynchronous user experience (email, messaging, etc).
  • Fairness may be complex to implement, but consider whether it's important enough to your use-case to do it
    • If your app has a concept of multiple users or tenants, you might be concerned about one or a few of them consuming too much capacity and leaving the others under-served. Solutions to this depend quite a lot on how you authenticate users and what other components you have in your stack, so are likely to need some custom build. There's this example solution but it's targeted more towards separating business units than end users.

I haven't seen much yet around specific request batching/queuing for agents - but it'll depend heavily on whether your agent is actually interactive (with users), or whether it's mostly just talking to itself to execute a one-shot task and then returning the result. If (it's appropriate for the use-case and) you present your agent to users in an asynchronous-by-default channel and minimize their expectations of immediate reply - then you'll have more flexibility to do things like queuing than if your app is set up around real-time conversations.

AWS
EXPERT
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.