MSK cluster timing out

0

I reported this before: We try to migrate to MSK, but at the point where I have some low load on the cluster representing the idle state of our application with just a few messages per minute total coming from about 30 clients the MSK brokers just start to fail with timeouts.

ARN of the cluster: arn:aws:kafka:eu-west-1:499577160181:cluster/eks-20190111-production-201902-20190708/6e3e6674-9c9b-40bb-98b4-506edf76a6b2-4

Output from kafka-broker-api-versions:

b-1.....kafka.eu-west-1.amazonaws.com:9092 (id: 1 rack: subnet-0e2b1a418025f9b10) -> ERROR: org.apache.kafka.common.errors.DisconnectException
b-3.....kafka.eu-west-1.amazonaws.com:9092 (id: 3 rack: subnet-000880b697986a0d8) -> ERROR: org.apache.kafka.common.errors.DisconnectException
b-2.....kafka.eu-west-1.amazonaws.com:9092 (id: 2 rack: subnet-0f50b40038ca2bf62) -> (
	Produce(0): 0 to 7 [usable: 7],
	Fetch(1): 0 to 10 [usable: 10],
	ListOffsets(2): 0 to 4 [usable: 4],
	Metadata(3): 0 to 7 [usable: 7],
	LeaderAndIsr(4): 0 to 1 [usable: 1],
	StopReplica(5): 0 [usable: 0],
	UpdateMetadata(6): 0 to 4 [usable: 4],
	ControlledShutdown(7): 0 to 1 [usable: 1],
	OffsetCommit(8): 0 to 6 [usable: 6],
	OffsetFetch(9): 0 to 5 [usable: 5],
	FindCoordinator(10): 0 to 2 [usable: 2],
	JoinGroup(11): 0 to 3 [usable: 3],
	Heartbeat(12): 0 to 2 [usable: 2],
	LeaveGroup(13): 0 to 2 [usable: 2],
	SyncGroup(14): 0 to 2 [usable: 2],
	DescribeGroups(15): 0 to 2 [usable: 2],
	ListGroups(16): 0 to 2 [usable: 2],
	SaslHandshake(17): 0 to 1 [usable: 1],
	ApiVersions(18): 0 to 2 [usable: 2],
	CreateTopics(19): 0 to 3 [usable: 3],
	DeleteTopics(20): 0 to 3 [usable: 3],
	DeleteRecords(21): 0 to 1 [usable: 1],
	InitProducerId(22): 0 to 1 [usable: 1],
	OffsetForLeaderEpoch(23): 0 to 2 [usable: 2],
	AddPartitionsToTxn(24): 0 to 1 [usable: 1],
	AddOffsetsToTxn(25): 0 to 1 [usable: 1],
	EndTxn(26): 0 to 1 [usable: 1],
	WriteTxnMarkers(27): 0 [usable: 0],
	TxnOffsetCommit(28): 0 to 2 [usable: 2],
	DescribeAcls(29): 0 to 1 [usable: 1],
	CreateAcls(30): 0 to 1 [usable: 1],
	DeleteAcls(31): 0 to 1 [usable: 1],
	DescribeConfigs(32): 0 to 2 [usable: 2],
	AlterConfigs(33): 0 to 1 [usable: 1],
	AlterReplicaLogDirs(34): 0 to 1 [usable: 1],
	DescribeLogDirs(35): 0 to 1 [usable: 1],
	SaslAuthenticate(36): 0 [usable: 0],
	CreatePartitions(37): 0 to 1 [usable: 1],
	CreateDelegationToken(38): 0 to 1 [usable: 1],
	RenewDelegationToken(39): 0 to 1 [usable: 1],
	ExpireDelegationToken(40): 0 to 1 [usable: 1],
	DescribeDelegationToken(41): 0 to 1 [usable: 1],
	DeleteGroups(42): 0 to 1 [usable: 1]
)

I do not see anything particular in the CloudWatch metrics, except that network traffic went up when I started my services, and went down when presumably the timeouts started.

ankon
질문됨 5년 전1101회 조회
7개 답변
0

+1

답변함 5년 전
0

I restored our application to use our own self-hosted Kafka cluster (on shared t3.medium instances).

It is now 12h later, and the MSK cluster was left to itself over night. It is still timing out on two of the three brokers.

@AWS: I would have expected by now that at least your monitoring would have noticed an issue, and would have restarted the failing brokers. Please advise on how to proceed.

b-2.….kafka.eu-west-1.amazonaws.com:9092 (id: 2 rack: subnet-0f50b40038ca2bf62) -> (
	Produce(0): 0 to 7 [usable: 7],
	Fetch(1): 0 to 10 [usable: 10],
	ListOffsets(2): 0 to 4 [usable: 4],
	Metadata(3): 0 to 7 [usable: 7],
	LeaderAndIsr(4): 0 to 1 [usable: 1],
	StopReplica(5): 0 [usable: 0],
	UpdateMetadata(6): 0 to 4 [usable: 4],
	ControlledShutdown(7): 0 to 1 [usable: 1],
	OffsetCommit(8): 0 to 6 [usable: 6],
	OffsetFetch(9): 0 to 5 [usable: 5],
	FindCoordinator(10): 0 to 2 [usable: 2],
	JoinGroup(11): 0 to 3 [usable: 3],
	Heartbeat(12): 0 to 2 [usable: 2],
	LeaveGroup(13): 0 to 2 [usable: 2],
	SyncGroup(14): 0 to 2 [usable: 2],
	DescribeGroups(15): 0 to 2 [usable: 2],
	ListGroups(16): 0 to 2 [usable: 2],
	SaslHandshake(17): 0 to 1 [usable: 1],
	ApiVersions(18): 0 to 2 [usable: 2],
	CreateTopics(19): 0 to 3 [usable: 3],
	DeleteTopics(20): 0 to 3 [usable: 3],
	DeleteRecords(21): 0 to 1 [usable: 1],
	InitProducerId(22): 0 to 1 [usable: 1],
	OffsetForLeaderEpoch(23): 0 to 2 [usable: 2],
	AddPartitionsToTxn(24): 0 to 1 [usable: 1],
	AddOffsetsToTxn(25): 0 to 1 [usable: 1],
	EndTxn(26): 0 to 1 [usable: 1],
	WriteTxnMarkers(27): 0 [usable: 0],
	TxnOffsetCommit(28): 0 to 2 [usable: 2],
	DescribeAcls(29): 0 to 1 [usable: 1],
	CreateAcls(30): 0 to 1 [usable: 1],
	DeleteAcls(31): 0 to 1 [usable: 1],
	DescribeConfigs(32): 0 to 2 [usable: 2],
	AlterConfigs(33): 0 to 1 [usable: 1],
	AlterReplicaLogDirs(34): 0 to 1 [usable: 1],
	DescribeLogDirs(35): 0 to 1 [usable: 1],
	SaslAuthenticate(36): 0 [usable: 0],
	CreatePartitions(37): 0 to 1 [usable: 1],
	CreateDelegationToken(38): 0 to 1 [usable: 1],
	RenewDelegationToken(39): 0 to 1 [usable: 1],
	ExpireDelegationToken(40): 0 to 1 [usable: 1],
	DescribeDelegationToken(41): 0 to 1 [usable: 1],
	DeleteGroups(42): 0 to 1 [usable: 1]
)
b-1.….kafka.eu-west-1.amazonaws.com:9092 (id: 1 rack: subnet-0e2b1a418025f9b10) -> ERROR: org.apache.kafka.common.errors.DisconnectException
b-3.….kafka.eu-west-1.amazonaws.com:9092 (id: 3 rack: subnet-000880b697986a0d8) -> ERROR: org.apache.kafka.common.errors.DisconnectException
ankon
답변함 5년 전
0

Hi Ankon we are looking into this issue. Did you cut a trouble ticket with our support team?

답변함 5년 전
0

Yes, we have changed our support plan, and opened a support request.

ankon
답변함 5년 전
0

I got a reply from AWS support now, esssentially saying:

  1. This is a known issue with Kafka 2.1.0, it’s a deadlock described in https://issues.apache.org/jira/browse/KAFKA-7697
  2. They claim that they do work-around that issue by restarting affected brokers (see https://forums.aws.amazon.com/thread.jspa?messageID=895763#895763)
  3. They suggest using 1.1.1 for production, and advise against using 2.1.0 in production.
  4. While they are investigating fix options including providing newer Kafka versions, there are no ETAs.

I hope this helps anyone else reading through this forum.

ankon
답변함 5년 전
0

Thanks for the follow up. We are working on supporting 2.2.1 which does not have this known issue.

답변함 5년 전
0

Kafka 2.2.1 support is live for new clusters

답변함 5년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠