r6i instances cause ena issues

0

In the past weeks we have switch a number of instances over to the new r6i instance types. We have used r6i.xl, r6i.2xlarge and r6i.4xlarge instances. These instance types seems to be prone to hangs on the ena driver. Network load on the instances ranges from low to high so the actual amount of network seems to be unrelated to the issue. The instance doen't seem to recover from this on:

All these instances have similar message in the logs:

Dec 27 01:38:20 bc-prod-053 kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 0, index 668. 5412000 usecs have passed since last napi execution. Missing Tx timeout value 5000 msecs
Dec 27 01:38:20 bc-prod-053 kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 1, index 340. 5424000 usecs have passed since last napi execution. Missing Tx timeout value 5000 msecs
Dec 27 01:38:20 bc-prod-053 kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 3, index 779. 5436000 usecs have passed since last napi execution. Missing Tx timeout value 5000 msecs
Dec 27 01:38:20 bc-prod-053 kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 3, index 780. 5444000 usecs have passed since last napi execution. Missing Tx timeout value 5000 msecs
Dec 27 01:38:20 bc-prod-053 kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 3, index 782. 5456000 usecs have passed since last napi execution. Missing Tx timeout value 5000 msecs
Dec 27 01:38:20 bc-prod-053 kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 3, index 783. 5468000 usecs have passed since last napi execution. Missing Tx timeout value 5000 msecs
Dec 27 01:38:22 bc-prod-053 kernel: ena 0000:00:05.0 eth0: Keep alive watchdog timeout.
Dec 27 01:38:22 bc-prod-053 kernel: ena 0000:00:05.0 eth0: Trigger reset is on
Dec 27 01:38:22 bc-prod-053 kernel: ena 0000:00:05.0 eth0: tx_timeout: 0
Dec 27 01:38:22 bc-prod-053 kernel: ena 0000:00:05.0 eth0: suspend: 0
Dec 27 01:38:22 bc-prod-053 kernel: ena 0000:00:05.0 eth0: resume: 0
Dec 27 01:38:22 bc-prod-053 kernel: ena 0000:00:05.0 eth0: wd_expired: 1
Dec 27 01:38:22 bc-prod-053 kernel: ena 0000:00:05.0 eth0: interface_up: 1
Dec 27 01:38:22 bc-prod-053 kernel: ena 0000:00:05.0 eth0: interface_down: 0
Dec 27 01:38:22 bc-prod-053 kernel: ena 0000:00:05.0 eth0: admin_q_pause: 0
Dec 27 01:38:22 bc-prod-053 kernel: ena 0000:00:05.0 eth0: queue_0_tx_cnt: 56154872
....
Dec 27 01:38:22 bc-prod-053 kernel: ena 0000:00:05.0 eth0: ena_admin_q_aborted_cmd: 0
Dec 27 01:38:22 bc-prod-053 kernel: ena 0000:00:05.0 eth0: ena_admin_q_submitted_cmd: 53
Dec 27 01:38:22 bc-prod-053 kernel: ena 0000:00:05.0 eth0: ena_admin_q_completed_cmd: 53
Dec 27 01:38:22 bc-prod-053 kernel: ena 0000:00:05.0 eth0: ena_admin_q_out_of_space: 0
Dec 27 01:38:22 bc-prod-053 kernel: ena 0000:00:05.0 eth0: ena_admin_q_no_completion: 0
Dec 27 01:38:22 bc-prod-053 kernel: ena 0000:00:05.0 eth0: Reading reg failed for timeout. expected: req id[10] offset[88] actual: req id[57015] offset[88]
Dec 27 01:38:23 bc-prod-053 kernel: ena 0000:00:05.0 eth0: Reading reg failed for timeout. expected: req id[11] offset[8] actual: req id[57016] offset[88]
Dec 27 01:38:23 bc-prod-053 kernel: ena 0000:00:05.0 eth0: Reg read32 timeout occurred
Dec 27 01:38:23 bc-prod-053 kernel: ena 0000:00:05.0 eth0: Reading reg failed for timeout. expected: req id[1] offset[88] actual: req id[57006] offset[0]
Dec 27 01:38:23 bc-prod-053 kernel: ena 0000:00:05.0 eth0: Reading reg failed for timeout. expected: req id[2] offset[8] actual: req id[57007] offset[0]
Dec 27 01:38:23 bc-prod-053 kernel: ena 0000:00:05.0 eth0: Reg read32 timeout occurred
Dec 27 01:38:23 bc-prod-053 kernel: ena 0000:00:05.0: Can not reset device
Dec 27 01:38:23 bc-prod-053 kernel: ena 0000:00:05.0: Can not initialize device
Dec 27 01:38:23 bc-prod-053 kernel: ena 0000:00:05.0: Reset attempt failed. Can not reset the device```
LeonB
asked 2 years ago281 views
1 Answer
0

Hello,

I understand you are having problems with your r6i instances and their ena drivers.

I did some research on this topic on the Amazon Docs and found a migration guide for r6i instances [1]. This article specifically mentions that driver updates may be required for these 6th generation instances. Or, if you plan to launch the instance with a new AMI, make sure that you select the correct AMI version with embedded compatible drivers.

The article also includes instructions on how to verify that your ena drivers are up to date [1]. If your driver version is lower than the ones listed in the table in the article, I would recommend updating the drivers.

I also noticed that there were some "could not reset device" errors in the logs you provided. This can be an indication of an underlying hardware failure. For situations like this, it would be best to open a support case so that our Support Engineers can directly investigate your resources.

Articles: [1] What do I need to do before migrating my EC2 instance to a sixth generation instance to make sure that I get maximum network performance? - Amazon Docs (https://aws.amazon.com/premiumsupport/knowledge-center/migrate-to-gen6-ec2-instance/)

Please let me know if this response helps or if you have any other questions. If the above steps do not resolve your issue, I strongly encourage you to contact an AWS support engineer who will be able to assist you.

answered 2 years ago
AWS
SUPPORT ENGINEER
reviewed 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions