In the past weeks we have switch a number of instances over to the new r6i instance types. We have used r6i.xl, r6i.2xlarge and r6i.4xlarge instances. These instance types seems to be prone to hangs on the ena driver. Network load on the instances ranges from low to high so the actual amount of network seems to be unrelated to the issue. The instance doen't seem to recover from this on:
All these instances have similar message in the logs:
Dec 27 01:38:20 bc-prod-053 kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 0, index 668. 5412000 usecs have passed since last napi execution. Missing Tx timeout value 5000 msecs
Dec 27 01:38:20 bc-prod-053 kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 1, index 340. 5424000 usecs have passed since last napi execution. Missing Tx timeout value 5000 msecs
Dec 27 01:38:20 bc-prod-053 kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 3, index 779. 5436000 usecs have passed since last napi execution. Missing Tx timeout value 5000 msecs
Dec 27 01:38:20 bc-prod-053 kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 3, index 780. 5444000 usecs have passed since last napi execution. Missing Tx timeout value 5000 msecs
Dec 27 01:38:20 bc-prod-053 kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 3, index 782. 5456000 usecs have passed since last napi execution. Missing Tx timeout value 5000 msecs
Dec 27 01:38:20 bc-prod-053 kernel: ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 3, index 783. 5468000 usecs have passed since last napi execution. Missing Tx timeout value 5000 msecs
Dec 27 01:38:22 bc-prod-053 kernel: ena 0000:00:05.0 eth0: Keep alive watchdog timeout.
Dec 27 01:38:22 bc-prod-053 kernel: ena 0000:00:05.0 eth0: Trigger reset is on
Dec 27 01:38:22 bc-prod-053 kernel: ena 0000:00:05.0 eth0: tx_timeout: 0
Dec 27 01:38:22 bc-prod-053 kernel: ena 0000:00:05.0 eth0: suspend: 0
Dec 27 01:38:22 bc-prod-053 kernel: ena 0000:00:05.0 eth0: resume: 0
Dec 27 01:38:22 bc-prod-053 kernel: ena 0000:00:05.0 eth0: wd_expired: 1
Dec 27 01:38:22 bc-prod-053 kernel: ena 0000:00:05.0 eth0: interface_up: 1
Dec 27 01:38:22 bc-prod-053 kernel: ena 0000:00:05.0 eth0: interface_down: 0
Dec 27 01:38:22 bc-prod-053 kernel: ena 0000:00:05.0 eth0: admin_q_pause: 0
Dec 27 01:38:22 bc-prod-053 kernel: ena 0000:00:05.0 eth0: queue_0_tx_cnt: 56154872
....
Dec 27 01:38:22 bc-prod-053 kernel: ena 0000:00:05.0 eth0: ena_admin_q_aborted_cmd: 0
Dec 27 01:38:22 bc-prod-053 kernel: ena 0000:00:05.0 eth0: ena_admin_q_submitted_cmd: 53
Dec 27 01:38:22 bc-prod-053 kernel: ena 0000:00:05.0 eth0: ena_admin_q_completed_cmd: 53
Dec 27 01:38:22 bc-prod-053 kernel: ena 0000:00:05.0 eth0: ena_admin_q_out_of_space: 0
Dec 27 01:38:22 bc-prod-053 kernel: ena 0000:00:05.0 eth0: ena_admin_q_no_completion: 0
Dec 27 01:38:22 bc-prod-053 kernel: ena 0000:00:05.0 eth0: Reading reg failed for timeout. expected: req id[10] offset[88] actual: req id[57015] offset[88]
Dec 27 01:38:23 bc-prod-053 kernel: ena 0000:00:05.0 eth0: Reading reg failed for timeout. expected: req id[11] offset[8] actual: req id[57016] offset[88]
Dec 27 01:38:23 bc-prod-053 kernel: ena 0000:00:05.0 eth0: Reg read32 timeout occurred
Dec 27 01:38:23 bc-prod-053 kernel: ena 0000:00:05.0 eth0: Reading reg failed for timeout. expected: req id[1] offset[88] actual: req id[57006] offset[0]
Dec 27 01:38:23 bc-prod-053 kernel: ena 0000:00:05.0 eth0: Reading reg failed for timeout. expected: req id[2] offset[8] actual: req id[57007] offset[0]
Dec 27 01:38:23 bc-prod-053 kernel: ena 0000:00:05.0 eth0: Reg read32 timeout occurred
Dec 27 01:38:23 bc-prod-053 kernel: ena 0000:00:05.0: Can not reset device
Dec 27 01:38:23 bc-prod-053 kernel: ena 0000:00:05.0: Can not initialize device
Dec 27 01:38:23 bc-prod-053 kernel: ena 0000:00:05.0: Reset attempt failed. Can not reset the device```