Apache MSK Prometheus JMX metric endpoint 429 errors

1

I am running into an issue with my MSK cluster's broker prometheus metrics. The JMX metrics endpoint constantly returns 429 (too many requests) errors when prometheus attempts to scrape the /metrics endpoint on port 11001 (JMX).

This does not seem to be related to broker instance type (3 m5.large brokers), as the Node metrics endpoint on port 11002 AND my consumers do not run into any throttling issues ever.

This is problematic, as I wish to monitor OffsetLag and other broker-specific metrics; the inconsistency of JMX metrics scrapes makes this nearly impossible. I have found no info anywhere else of anyone running to this particular error. Like I mentioned, I am only running into 429 errors on this JMX metrics endpoint, not anywhere else.

I have even pushed back the scrape interval to 2+minutes, and this does not solve the problems.

  • I'm having the same issue - the throttling on the endpoint is extreme and I'm not sure how to reliably get metrics. If the metrics are too expensive to calculate it should serve cached metrics not 429

Steven
asked 5 months ago458 views
1 Answer
0

@Frank,

The only fix that worked for me deleting and recreating the MSK cluster itself; the new one does not seem to throttle.

Not the best solution, but just another AWS managed service quirk. Spent a lot of time and energy trying to find a solution that did not involve recreating the cluster...

Steven
answered 3 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions