Skip to content

How do I troubleshoot the "ThrottlingException" error when I use Amazon Bedrock on-demand resources?

5 minute read
3

I want to troubleshoot the "ThrottlingException" (429 HTTP status code) error that I receive when I use Amazon Bedrock on-demand resources.

Short description

When you exceed service quotas, Amazon Bedrock denies your requests.

Amazon Bedrock returns a "ThrottlingException" (HTTP Status Code: 429) error, and you receive one of the following error messages on the client side:

  • "Too many requests, please wait before trying again. You have sent too many requests. Wait before trying again."
  • "Your request rate is too high. Reduce the frequency of requests."
  • "Too many tokens, please wait before trying again."

Resolution

Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshooting errors for the AWS CLI. Also, make sure that you're using the most recent AWS CLI version.

Monitor your AWS service quotas

Review your Amazon Bedrock service quotas to make sure that you're not exceeding them. Check Amazon CloudWatch metrics at 1-minute increments to identify throttling patterns. When your usage exceeds the quotas at peak times, throttling might occur even with previously successful batches. To make sure that your application's request volume doesn't exceed the quotas, monitor the InputTokenCount and Invocations Amazon Bedrock runtime metrics.

Some models have separate quotas for Requests Per Minute (RPM) and Tokens Per Minute (TPM) that Amazon Bedrock enforces concurrently.

Newer model versions might have different quotas than previous versions.

Note: The Service Quota dashboard shows only configured quotas, not real-time usage. To monitor real-time usage, use CloudWatch.

Use cross-Region inference profiles

Use cross-Region inference profiles to dynamically route traffic across multiple AWS Regions for optimal availability for each request and better performance for high-usage periods. Each Region maintains independent capacity pools. To avoid throttling in one Region's capacity pool, distribute requests across multiple Regions.

Some models, such as Anthropic Claude 3.5 Sonnet, require cross-Region inference profiles in certain Regions.

For more information, see the code sample for cross-Region interference in the amazon-bedrock-workshop on the GitHub website.

Note: To use an inference profile, you must use a Region and model that Amazon Bedrock supports.

Request a quota increase

New accounts might have lower initial quotas than the default quotas. Some models have non-adjustable fixed quotas. If your workload traffic exceeds your account's on-demand quotas, then contact AWS Support or your account manager to request a quota increase. AWS might adjust default quotas based on usage patterns or service requirements.

Include the following information in your request:

  • The name of the quota that you want to increase
  • The model ID
  • The Region for the quota increase
  • A brief explanation of your use case
  • Your projected usage, including steady and peak tokens and requests per minute, and average input and output tokens per request.

Use Provisioned Throughput

If you have high throughput requirements, then purchase Provisioned Throughput.

Note: You incur an additional cost when you use Provisioned Throughput. For information about Provisioned Throughput pricing, see the Pricing models section in Amazon Bedrock pricing.

For more information about how you can use Provisioned Throughput, see Use a Provisioned Throughput with an Amazon Bedrock resource. To use the AWS CLI or Python SDK to create Provisioned Throughput, see Code examples for Provisioned Throughput.

Note: Before you purchase Provisioned Throughput, make sure that you're using a Region and model that Amazon Bedrock supports.

Add retries with exponential backoff

When you use on-demand mode, Amazon Bedrock uses a shared capacity pool across multiple customers. During periods of high service demand you might experience throttling even when your requests are within your account's quotas. Also, the service automatically manages capacity allocation across all users.

It's best practice to use retries with exponential backoff and random jitter. If you use AWS SDKs, then see Retry behavior.

Make sure that the retry backoff lasts 1 full minute when it reaches per-minute quotas. Sync your retry attempts with the 60-second quota refresh cycle. Also, distribute your requests across multiple seconds within a 1-minute period.

Example Python configuration with adaptive retry mode:

from botocore.config import Config

config = Config(
    retries={
        'max_attempts': 10,  # Default is 3
        'mode': 'adaptive'
    }
)
bedrock_runtime = boto3.client('bedrock-runtime', config=config)

Resolve the "ServiceUnavailable" error

A "ServiceUnavailableException" (HTTP status code 503) error occurs because of temporary capacity constraints, not exceeded quotas. This error usually automatically resolves, but might require architecture adjustments for critical workloads.

Review your AWS CloudTrail logs to check whether you receive a "ServiceUnavailable" error with the "ThrottlingException" error.

If you receive both errors, then take the following actions to resolve the "ServiceUnavailableException" error:

Resolve unexpected throttling when you didn't exceed quotas

If you experienced throttling and you didn't exceed service quotas, then take the following actions: