When monitoring Amazon MSK clusters using JMX exporter with Prometheus, you may encounter "context deadline exceeded" timeout errors on port 11001. This article explains how to identify and resolve this issue by checking broker partition counts against recommended limits.
Issue
You receive timeout errors such as "context deadline exceeded" or "EOF" when Prometheus attempts to scrape metrics from the JMX exporter endpoint (port 11001) on your MSK brokers. This typically occurs when the JMX agent takes too long to collect and return metrics.
Root Cause
The primary cause is exceeding the recommended number of partitions per broker for your instance type. When partition counts are too high, the JMX exporter must process significantly more metrics, leading to scrape timeouts.
Resolution
Step 1: Check Current Partition Count
Monitor the PartitionCount metric in Amazon CloudWatch:
- Open the CloudWatch console
- Navigate to Metrics → AWS/Kafka
- Select your cluster and broker dimensions
- View the PartitionCount metric (includes leader and follower replicas)
Step 2: Compare Against Recommended Limits
Verify your partition count against the recommended limits for your broker instance type:
| Broker Instance Type | Maximum Partitions per Broker |
|---|
| kafka.t3.small | 300 |
| kafka.m5.large or kafka.m5.xlarge | 1,000 |
| kafka.m5.2xlarge | 2,000 |
| kafka.m5.4xlarge and larger | 4,000 |
Note: These limits include both leader and follower replicas.
Step 3: Resolve the Issue
If your partition count exceeds the recommended limit, choose one of the following options:
Option 1 (Recommended): Scale your MSK cluster to a larger broker instance type that can accommodate the increased partition count.
Option 2: Reduce the number of partitions per broker by adhering to the recommended limits. This may involve consolidating topics or removing unused topics.
Additional Considerations
- Empty or stale consumer groups can increase JMX scrape latency. Consider cleaning up unused consumer groups regularly.
- Set Prometheus scrape intervals to 60 seconds or higher to avoid overwhelming the JMX exporter.
- Monitor CPU utilization and network metrics in CloudWatch to identify performance bottlenecks.
Related Information