MSK Connect on t3.small fails due to not-retryable SaslAuthenticationException - reconnect.backoff.ms worker configuration will not help - can AWS remove the connection limit?

4

Hello,

we are encountering the same issues as e.g. https://github.com/aws/aws-msk-iam-auth/issues/28 regarding the SaslAuthenticationException while using MSK Connect with a kafka.t3.small instance.

Setting reconnect.backoff.ms to e.g. 10000 ms will not resolve the issue, since the exception that is being thrown (SaslAuthenticationException) is not retryable (see https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/admin/KafkaAdminClient.java#L808) and ultimately leads to a new creation of a client, not a reconnect.

When would the reconnect take place? As I went through the implementation, what I see is:

  1. that startConnect() in ConnectDistributed is calling the constructor of Worker
  2. the constructor of Worker calls ConnectUtils.lookupKafkaClusterId(config)
  3. that method calls Admin.create(config.originals()) - which opens up a new connection
  4. if you follow the calls from there, you will see that you end up not retrying upon obtaining SaslAuthenticationException (https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/admin/KafkaAdminClient.java#L808)

Even if the retry would work, several AdminClients are created, which all connect to the MSK cluster. Since this is not a reconnect, reconnect.backoff.ms settings cannot work for remediation. There is no mechanism in the Kafka code that would globally allow restricting these connections to happen only every x seconds. Unless I oversee something, MSK Connect should only work by chance with t3.small instances.

This forces us to either:

  • not use IAM and go for SASL/SCRAM
  • use a kafka.m5.large instance and go from about 32 USD/Month to 151 USD/Month per instance - meaning 90 USD vs 450 USD in our case

The limitation on the t3.small instance really limits what we want to achieve. The workaround presented here is not working and thus forces us to buy the larger instance. We have no need for a large instance and we don't want to have additional costs for simply using IAM for MSK Connect.

**Can AWS remove the limit on the t3.small instance or present a different workaround? That would be great :) **

I cannot open a support case for this, since we don't have the required subscription and I believe that this could be of general interest.

See parts of our logs using AWS MSK Connect:

[Worker-05ea3408948fa0a4c] [2022-01-01 22:41:53,059] INFO Creating Kafka admin client (org.apache.kafka.connect.util.ConnectUtils:49)
[Worker-05ea3408948fa0a4c] [2022-01-01 22:41:53,061] INFO AdminClientConfig values:
...
[Worker-05ea3408948fa0a4c] 	reconnect.backoff.max.ms = 10000
[Worker-05ea3408948fa0a4c] 	reconnect.backoff.ms = 10000
[Worker-05ea3408948fa0a4c] 	request.timeout.ms = 30000
[Worker-05ea3408948fa0a4c] 	retries = 2147483647
[Worker-05ea3408948fa0a4c] 	retry.backoff.ms = 10000
...
[Worker-05ea3408948fa0a4c] [2022-01-01 22:41:54,269] ERROR Stopping due to error (org.apache.kafka.connect.cli.ConnectDistributed:86)
[Worker-05ea3408948fa0a4c] org.apache.kafka.connect.errors.ConnectException: Failed to connect to and describe Kafka cluster. Check worker's broker connection and security properties.
[Worker-05ea3408948fa0a4c] 	at org.apache.kafka.connect.util.ConnectUtils.lookupKafkaClusterId(ConnectUtils.java:70)
[Worker-05ea3408948fa0a4c] 	at org.apache.kafka.connect.util.ConnectUtils.lookupKafkaClusterId(ConnectUtils.java:51)
[Worker-05ea3408948fa0a4c] 	at org.apache.kafka.connect.runtime.Worker.<init>(Worker.java:140)
[Worker-05ea3408948fa0a4c] 	at org.apache.kafka.connect.runtime.Worker.<init>(Worker.java:127)
[Worker-05ea3408948fa0a4c] 	at org.apache.kafka.connect.cli.ConnectDistributed.startConnect(ConnectDistributed.java:118)
[Worker-05ea3408948fa0a4c] 	at org.apache.kafka.connect.cli.ConnectDistributed.main(ConnectDistributed.java:80)
[Worker-05ea3408948fa0a4c] Caused by: java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.SaslAuthenticationException: [e4afe53f-73b5-4b94-9ac3-30d737071e56]: Too many connects
[Worker-05ea3408948fa0a4c] 	at org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)
[Worker-05ea3408948fa0a4c] 	at org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)
[Worker-05ea3408948fa0a4c] 	at org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89)
[Worker-05ea3408948fa0a4c] 	at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:260)
[Worker-05ea3408948fa0a4c] 	at org.apache.kafka.connect.util.ConnectUtils.lookupKafkaClusterId(ConnectUtils.java:64)
[Worker-05ea3408948fa0a4c] 	... 5 more
[Worker-05ea3408948fa0a4c] Caused by: org.apache.kafka.common.errors.SaslAuthenticationException: [e4afe53f-73b5-4b94-9ac3-30d737071e56]: Too many connects
[Worker-05ea3408948fa0a4c] [2022-01-01 22:41:54,281] INFO Stopped http_0.0.0.08083@68631b1d{HTTP/1.1, (http/1.1)}{0.0.0.0:8083} (org.eclipse.jetty.server.AbstractConnector:381)
[Worker-05ea3408948fa0a4c] [2022-01-01 22:41:54,283] INFO Stopped https_0.0.0.08443@611d0763{SSL, (ssl, http/1.1)}{0.0.0.0:8443} (org.eclipse.jetty.server.AbstractConnector:381)
[Worker-05ea3408948fa0a4c] MSK Connect encountered errors and failed.
2 Answers
1

Hi there, thank you for providing your valuable feedback regarding the service. Currently we advice that customers do not use IAM authenticated t3 instances with MSK Connect. This is due to connection limits currently in place on t3 instances. Amazon MSK team is aware of this issue and are working on solutions to increase the connection limit for t3 instances, which will allow this combination of connectors, instance type and authentication to function reliably. There is no ETA at the moment but please keep a tab of https://aws.amazon.com/about-aws/whats-new/ for updates on changes to current limits. Hope this helps!

AWS
SUPPORT ENGINEER
answered 2 years ago
0

Hi Team, It has been two years. Is this issue still applicable or it has been fixed? Thanks.

Mounick
answered 3 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions