AXI4 bvalid not asserted sometimes after wready && wvalid && wlast

0

I have an 8-kernel awsxclbin that sometimes hangs when multiple kernels are active. The kernels are identical RTL-kernels and the awsxclbin is generated using Vitis 2020.1. Each kernel has one AXI-MM interface to access the FPGA DDR.

Using hw_emu mode I found that sometimes bvalid does not get asserted after wvalid && wready && wlast and causes our RTL Kernel to hang. This behavior (bvalid not being asserted) is sporadic and only happens when multiple kernels are accessing FPGA DDR. Also, not all the kernels hang and different kernels hang at different times.

Any idea on what could be causing this and how to get around the issue?

Also, is there any documentation on when bvalid is/isn't asserted by the shell on f1?

hsharma
asked 3 years ago456 views
13 Answers
0

Hello,

The DDR Controllers provided in the AWS Dev Kit follow AXI4 Protocol and returns Write Response (BRESP) upon successful completion of write request from the master logic. The following simulation example may help you to understand behavior of the controller's AXI4 interface:
https://github.com/aws/aws-fpga/tree/master/hdk/cl/examples/cl_dram_dma/verif#dram-dma-cl-example-simulation

Following are quick items to check based on your description of the issue:

  1. Check the AXI4 request size and ensure it does not cross 4K Boundary.
  2. Ensure that there are no other AXI4 Protocol Violations.
  3. Check AWID/BID from each RTL Kernel.

Here's some more info on DDR4 AXI4 interface:
https://github.com/aws/aws-fpga/blob/master/hdk/docs/AWS_Shell_Interface_Specification.md#ddr4-axi

Please contact us if you seek any additional info.

Thanks!
Chakra

AWS
answered 3 years ago
0

Thanks for the link to simulation example.

Answering your questions:

  1. Both read/write requests don't cross 4K boundary. Write/Read request sizes are 1KB.
  2. There doesn't seem to be any protocol violations. What would be the simplest way to confirm axi4 protocol compliance? Should we add a protocol checker to each kernel?
  3. The awid/bid signals are always tied to 0.

Some more relevant information:

  1. The 8 kernels are attached to DDR banks - 0,0,1,3,3,0,0,2. There's only one kernel attached to bank1 and bank2 and these kernels also hang.
  2. I modified our RTL Kernel to not wait for bvalid. However, wready was not asserted for the next transaction and the kernel would still hang. I figured this our from hw_emu mode. I have not tried adding ILAs and testing on FPGA to verify this yet.
    3. The same RTL code when compiled with 2019.2 does not hang and passes our tests.

Things I can try next:

  1. attach ILAs to the kernel and verify if bvalid/wready not being asserted is the only issue.
  2. add protocol checker to the kernels.
hsharma
answered 3 years ago
0

Thanks for the details.

Please try adding protocol checkers to the RTL kernels.
Also, if you could build a debug counter to track how many Write Requests were made, and how many responses were received, will help us to identify when it starts to break.

Does this issue happen when only one Kernel is attached to a single DDR?

Thanks!
Chakra

AWS
answered 3 years ago
0

We reduced the number of kernels to 4 (connected to bank 0,0,1,3) and attached protocol checker and an ILA to each kernel.
We see these errors:
kernel 0 and 1 (bank 0): XILINX_RECS_WRITE_TO_BVALID_MAX_WAIT followed by AXI_RECM_RREADY_MAX_WAIT, XILINX_RECS_CONTINUOUS_RTRANSFERS_MAX_WAIT, and XILINX_RECM_CONTINUOUS_WTRANSFERS_MAX_WAIT
kernel 2 and 3 see no protocol violations and do not hang

Does this help narrow down the issue?

Again, the same code on 2019.2 Vitis seems to work and hangs with the above protocol violations when using 2020.1 Vitis.

I will add counters for AXI requests/responses and also try 1kernel per DDR bank

hsharma
answered 3 years ago
0

I added counters to the AXI interfaces in our RTL kernels. With four kernels connected to bank 0,0,1,3:

For read channels (AR/R): I see about 129K read requests (arvalid && arready) after which rready goes low. The ILA waveform shows that the last two read requests do not finish (rvalid && rready && rlast).

For write channels (AW/W/B): I see about 156K write requests (awvalid && awready). bvalid is not asserted for the last request and the kernel get stuck waiting for bvalid.

When each of the four kernels is connected to a separate bank, all our tests pass.

hsharma
answered 3 years ago
0

Hi,

Are you using 2019.2 Vitis with the AR73068 patch? That is one difference between 2019.2 and 2020.1 that we know of. With our developer kit, we apply the patch automatically for 2019.2 and 2020.1 tools have that fix built in.

Another thing to check is to take the 2019.2 xo and recompile using 2020.1 tools and create an AFI. This is to check if there is a problem in synthesis between the two different tool versions.

-Deep

Deep_P
answered 3 years ago
0

We use our local installations to generate xclbin and awsxclbin. We do not have the AR73068 patch applied to 2019.2 Vitis locally.

I will try generating xclbin/awsxclbin using 2020.1 Vitis with 2019.2 xo.

hsharma
answered 3 years ago
0

Hi hsharma,

Please let us know if recompiling from a 2019.2 xo using 2020.1 helped? If not, we can pursue next debug steps.

-Deep

Deep_P
answered 3 years ago
0

We get this exception when trying to using 2019.2 xo with 2020.1 vitis:

ERROR: [v++ 17-70] Application Exception: The XO file './sdx_kernel_Gorilla_256bus.xo' was created in version 2020.1, which is later than the software version you are currently running: 2019.2. Forward compatibility is not supported

Can you suggest something else?

hsharma
answered 3 years ago
0

Hello,

The warning suggests that you have a 2020.1 xo that is being linked for the platform using 2019.2. So let's first confirm that it is a 2019.2 xo being linked using 2020.1.

The hangs might also suggest changes in register addressing between kernels generated on 2019.2 vs 2020.1. Have you had a chance to look at those in case that needs a host code change?

-Deep

Deep_P
answered 3 years ago
0

Hi Deep,

You're right. I fixed our scripts to compile with 2019.2 xo and 2020.1 Vitis for xclbin generation. It seems we still have the hang issue when running 8-kernels at the same time.

Can you elaborate on the register addressing changes? Our host code uses OpenCL so I assume we won't have to make changes in host code.
We're using FPGA 1.7.1 AMI for our tests. Do we need to switch to a different AMI?

hsharma
answered 3 years ago
0

Hello,

I sent you a PM about further debug. Another thing is that XRT is quite different for 1.8.1 AMI onwards. For 2020.1, I'd definitely suggest using 1.9.x AMI.

-Deep

Deep_P
answered 3 years ago
0

The following workaround fixed the hang issue for us:
We added FIFOs in our RTL kernel for read/writes to off-chip memory. Throttling the read requests (AXI AR channel) based on the available space in on-chip FIFOs and write requests (AXI AW channel) based on data available in on-chip FIFOs did the trick. It seems when there are a lot of outstanding read AXI requests, the write AXI requests would not get completed and that would hang our kernel.

Thanks Deep!

hsharma
answered 3 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions