AXI4 bvalid not asserted sometimes after wready && wvalid && wlast

Question

I have an 8-kernel awsxclbin that sometimes hangs when multiple kernels are active. The kernels are identical RTL-kernels and the awsxclbin is generated using Vitis 2020.1. Each kernel has one AXI-MM interface to access the FPGA DDR.  
  
Using hw_emu mode I found that sometimes bvalid does not get asserted after _wvalid && wready && wlast_ and causes our RTL Kernel to hang. This behavior (bvalid not being asserted) is sporadic and only happens when multiple kernels are accessing FPGA DDR. Also, not all the kernels hang and different kernels hang at different times.  
  
Any idea on what could be causing this and how to get around the issue?  
  
Also, is there any documentation on when bvalid is/isn't asserted by the shell on f1?

Answer

Hi hsharma,  
  
Please let us know if recompiling from a 2019.2 xo using 2020.1 helped? If not, we can pursue next debug steps.  
  
-Deep

Answer

Hello,  
  
The warning suggests that you have a 2020.1 xo that is being linked for the platform using 2019.2. So let's first confirm that it is a 2019.2 xo being linked using 2020.1.   
  
The hangs might also suggest changes in register addressing between kernels generated on 2019.2 vs 2020.1. Have you had a chance to look at those in case that needs a host code change?   
  
-Deep

Answer

Hi Deep,  
  
You're right. I fixed our scripts to compile with 2019.2 xo and 2020.1 Vitis for xclbin generation. It seems we still have the hang issue when running 8-kernels at the same time.  
  
Can you elaborate on the register addressing changes? Our host code uses OpenCL so I assume we won't have to make changes in host code.  
We're using FPGA 1.7.1 AMI for our tests. Do we need to switch to a different AMI?

Answer

We get this exception when trying to using 2019.2 xo with 2020.1 vitis:  
  
ERROR: \[v++ 17-70] Application Exception: The XO file './sdx_kernel_Gorilla_256bus.xo' was created in version 2020.1, which is later than the software version you are currently running: 2019.2. Forward compatibility is not supported

Can you suggest something else?

Answer

Hi,  
  
Are you using 2019.2 Vitis with the AR73068 patch? That is one difference between 2019.2 and 2020.1 that we know of. With our developer kit, we apply the patch automatically for 2019.2 and 2020.1 tools have that fix built in.  
  
Another thing to check is to take the 2019.2 xo and recompile using 2020.1 tools and create an AFI. This is to check if there is a problem in synthesis between the two different tool versions.   
  
-Deep

Answer

We use our local installations to generate xclbin and awsxclbin. We do not have the AR73068 patch applied to 2019.2 Vitis locally.  
  
I will try generating xclbin/awsxclbin using 2020.1 Vitis with 2019.2 xo.

Answer

The following workaround fixed the hang issue for us:  
We added FIFOs in our RTL kernel for read/writes to off-chip memory. Throttling the read requests (AXI AR channel) based on the available space in on-chip FIFOs and write requests (AXI AW channel) based on data available in on-chip FIFOs did the trick. It seems when there are a lot of outstanding read AXI requests, the write AXI requests would not get completed and that would hang our kernel.  
  
Thanks Deep!

Answer

Hello,  
  
The DDR Controllers provided in the AWS Dev Kit follow AXI4 Protocol and returns Write Response (BRESP) upon successful completion of write request from the master logic. The following simulation example may help you to understand behavior of the controller's AXI4 interface:  
https://github.com/aws/aws-fpga/tree/master/hdk/cl/examples/cl_dram_dma/verif#dram-dma-cl-example-simulation  
  
Following are quick items to check based on your description of the issue:  
1. Check the AXI4 request size and ensure it does not cross 4K Boundary.   
2. Ensure that there are no other AXI4 Protocol Violations.  
3. Check AWID/BID from each RTL Kernel.   
  
Here's some more info on DDR4 AXI4 interface:  
https://github.com/aws/aws-fpga/blob/master/hdk/docs/AWS_Shell_Interface_Specification.md#ddr4-axi  
  
Please contact us if you seek any additional info.   
  
Thanks!  
Chakra

Answer

Thanks for the link to simulation example.  
  
Answering your questions:  
1. Both read/write requests don't cross 4K boundary. Write/Read request sizes are 1KB.  
2. There doesn't seem to be any protocol violations. What would be the simplest way to confirm axi4 protocol compliance? Should we add a protocol checker to each kernel?  
3. The awid/bid signals are always tied to 0.  
  
Some more relevant information:  
1. The 8 kernels are attached to DDR banks - 0,0,1,3,3,0,0,2. There's only one kernel attached to bank1 and bank2 and these kernels also hang.  
2. I modified our RTL Kernel to not wait for bvalid. However, wready was not asserted for the next transaction and the kernel would still hang. I figured this our from hw_emu mode. I have not tried adding ILAs and testing on FPGA to verify this yet.  
**3. The same RTL code when compiled with 2019.2 does not hang and passes our tests.**  
  
Things I can try next:  
1. attach ILAs to the kernel and verify if bvalid/wready not being asserted is the only issue.  
2. add protocol checker to the kernels.

Answer

Thanks for the details.  
  
Please try adding protocol checkers to the RTL kernels.   
Also, if you could build a debug counter to track how many Write Requests were made, and how many responses were received, will help us to identify when it starts to break.   
  
Does this issue happen when only one Kernel is attached to a single DDR?  
  
Thanks!  
Chakra

Answer

I added counters to the AXI interfaces in our RTL kernels. With four kernels connected to bank 0,0,1,3:  
  
For read channels (AR/R): I see about 129K read requests (arvalid && arready) after which rready goes low. The ILA waveform shows that the last two read requests do not finish (rvalid && rready && rlast).  
  
For write channels (AW/W/B): I see about 156K write requests (awvalid && awready). bvalid is not asserted for the last request and the kernel get stuck waiting for bvalid.

When each of the four kernels is connected to a separate bank, all our tests pass.

Answer

Hello,  
  
I sent you a PM about further debug. Another thing is that XRT is quite different for 1.8.1 AMI onwards. For 2020.1, I'd definitely suggest using 1.9.x AMI.  
  
-Deep

Answer

We reduced the number of kernels to 4 (connected to bank 0,0,1,3) and attached protocol checker and an ILA to each kernel.  
We see these errors:  
kernel 0 and 1 (bank 0): XILINX_RECS_WRITE_TO_BVALID_MAX_WAIT followed by AXI_RECM_RREADY_MAX_WAIT, XILINX_RECS_CONTINUOUS_RTRANSFERS_MAX_WAIT, and XILINX_RECM_CONTINUOUS_WTRANSFERS_MAX_WAIT  
kernel 2 and 3 see no protocol violations and do not hang  
  
Does this help narrow down the issue?  
  
Again, the same code on 2019.2 Vitis seems to work and hangs with the above protocol violations when using 2020.1 Vitis.  
  
I will add counters for AXI requests/responses and also try 1kernel per DDR bank

AXI4 bvalid not asserted sometimes after wready &amp;&amp; wvalid &amp;&amp; wlast

関連するコンテンツ

AXI4 bvalid not asserted sometimes after wready && wvalid && wlast