Skip to content

How to get/set Bedrock Granular Cost controls/limits for Anthropic model usage for 100 -1000 employees

1

For the past 6 months I have run a claude code POC in my company - now we want to roll it out to the entire firm and my question now is whether its possible to easily setup a granular cost control framework - specifically:

  1. Per-division limits (cost and token budgets by team/business unit)
  2. Per-developer limits (cost and token budgets by individual user)

I checked https://docs.aws.amazon.com/bedrock/latest/userguide/capacity-limits-cost-optimization.html and posts here, but its still not clear how to go about this and whether Bedrock gives these controls natively, or do we have to build them?

Thanks for any advice, Steve

  • If my answer helped solve your problem, I would appreciate it if you click on “accepted answer”

2 Answers
2
Accepted Answer

While Application Inference Profiles (AIPs) are excellent for cost allocation, managing 1,000 individual profiles is an administrative nightmare in my view. Here is the recommended architecture for a firm-wide rollout:

1. Grouping via AIPs (The "Division" Level):

  • Create AIPs at the Business Unit or Team level (e.g., AIP-Engineering-Prod, AIP-Marketing).
  • Tag these with CostCenter keys to automate your AWS Billing reports.

2. Granular Tracking (The "Developer" Level):

3. Real-Time Enforcement (The "Limit" Level):

  • Bedrock does NOT have a "Stop at $X" button per user !!!
  • Gateway Pattern: Route all employee requests through a central Lambda "Proxy" or API Gateway.
  • Logic: Before calling Bedrock, the Lambda checks a DynamoDB table (User_ID | Daily_Token_Count). If the limit is reached, it returns a "429 Too Many Requests". After the call, it updates the count based on the metadata returned by Bedrock.

see:

4. Summary of Native vs. Custom:

  • Native: Cost tracking and tagging (AIPs).
  • Custom: Hard limits and per-user quotas (Lambda + DynamoDB).

PS: I would start by enforcing limits at the API Key / Application level first. Individual developer limits are usually only necessary if you provide a "Playground" UI where users can run arbitrary, expensive prompts (like long-form Claude 3.5 Sonnet tasks).

EXPERT
answered a month ago
EXPERT
reviewed a month ago
  • Thanks Florian for guiding me here, creating 1000 AIPs is indeed an admin nightmare. The tiered approach (AIPs at division level, requestMetaData for per-developer tracking) sounds like the pragmatic answer here. I was hoping for a ready-made-here-you-are solution, but Lambda proxy + DynamoDB for real-time enforcement is exactly the pattern I probably need here.

    What are still an open questions for me:

    • Claude Code use specific solution -- your answer assumes generic Bedrock API calls. Claude Code's traffic pattern (many small requests, tool use loops, context window stuffing) means token budgets matter more than request counts.
    • Caching -- Bedrock prompt caching can dramatically change cost profiles. My cost tracking needs to account for cached vs. uncached tokens.
    • Model routing -- if I enforce Sonnet-by-default and gate Opus access per team, that's most likely a cheaper lever than per-user token limits.

    What do you think?

  • You are right to focus on Real-Time Enforcement - it’s the hardest part because Bedrock doesn't do it natively. To solve this for 1,000 users without an "admin disaster", I think the following is a pragmatic approach:

    • Circuit Breaker: Use an API Gateway + Lambda Proxy. The Lambda performs a "Pre-Flight" check against DynamoDB before forwarding to Bedrock. If the limit is reached, it returns 429 Too Many Requests immediately. This hopefully provides the real-time (near to real-time) gate you need.
    • Claude Code & Caching: A raw 'Token Limit' is too blunt for Claude Code's context-heavy loops. Your Proxy must be 'Cache-Aware': parse the cacheReadTokenCount from the response and update the budget based on Actual Cost ($). This ensures developers aren't penalized for using efficient cached contexts.
    • Loop Protection: Since Claude Code can get stuck in tool-use loops, implement a 'Request Rate Limit' (e.g., 20 calls/min) in the Proxy. This prevents a runaway loop from burning a daily budget in seconds.
    • Model Gating: Use IAM to restrict Opus access to specific roles. For everyone else, Sonnet 3.5 via the Proxy-enforced limits is the most pragmatic lever.

    So, native Bedrock only provides visibility. For Enforcement, the Proxy pattern is unavoidable. Tracking USD instead of raw tokens is the only way to fairly account for Caching discounts.

  • Perhaps you could try it with Kinesis Data Streams to handle the log-streaming for 1,000 users to avoid latency, but remember: Kinesis is asynchronous. It’s great for near real-time dashboards, but for a hard circuit breaker that stops a request before it costs money, you still need that synchronous check.

1

Hello.

The following AWS blog post may also be helpful.
This document describes how to track costs using application inference profiles.
https://aws.amazon.com/jp/blogs/machine-learning/track-allocate-and-manage-your-generative-ai-cost-and-usage-with-amazon-bedrock/

The journey begins with the insurance provider creating application inference profiles that are tailored to their diverse business units. By assigning AWS cost allocation tags, the organization can effectively monitor and track their Bedrock spend patterns. For example, the claims processing team established an application inference profile with tags such as dept:claims, team:automation, and app:claims_chatbot. This tagging structure categorizes costs and allows assessment of usage against budgets.

EXPERT
answered a month ago
EXPERT
reviewed a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.