Limited number of in-flight DDR requests on SDAccel shell

0

Hello,

I am developing an RTL kernel for SDAccel 2019.1, which performs single read requests (no bursts) to all DDR4 channels. Since the bandwidth I was getting was significantly lower than expected (only about 0.7 memory requests per cycle across all four channels in aggregate), I developed a simple kernel that performs single sequential accesses to benchmark the ideal performance of the four memory channels. To my surprise, it appears that channel 0 performs differently from channel 1, 2 and 3, which contradicts what is said in the specifications: https://github.com/aws/aws-fpga/blob/master/hdk/docs/AWS_Shell_Interface_Specification.md#ddr4axi "NOTE: There is no performance or frequency difference between the four DRAM controllers regardless whether they resides in the CL or the Shell logic"

In particular:

  • channel 0 has a latency of 80-140 cycles at 250 MHz, vs 10-20 cycles for the other channels
  • channel 0 only supports 18 outstanding memory requests, while the other channels are limited to 3

Because the maximum number of outstanding memory requests is, in both cases, lower than the latency, the maximum bandwidth that can be achieved with single requests is very low, about 19% of the ideal bandwidth on channel 0 and 17% on the other channels.

Considering that the specs say that "The DRAM interface uses the Xilinx DDR-4 Interface controller" I was expecting the interfaces would behave similarly to the MIG DDR3 controller, where accesses can be fully pipelined and it is possible to achieve almost 100% of the ideal bandwidth even with single requests.

If this is the case also for the DDR4 controller, then I believe that some of the blocks along the path between the kernel and the MIG may have a buffering capacity that is lower than the memory latency. I tried editing the Vivado block diagram generated by SDAccel and I could see there are some AXI SmartConnects between the kernel and the memory port inside a block called "SDx Memory Subsystem", but it is flagged as Read-Only. Is there a way to edit those block, or to configure SDAccel in such a way that they are generated without buffers that limit the number of in-flight memory requests?

Thanks,

Mikhail

Edited by: mikhai on Apr 22, 2020 1:21 AM

Edited by: mikhai on Apr 22, 2020 5:10 PM

mikhai
已提问 4 年前208 查看次数
3 回答
0
已接受的回答

SDAccel platforms are optimized for larger block transfers into and out of DDR memories. Any time a kernel or host requests very short (1-4 data-beat) accesses, we expect a drop in memory bandwidth. As you pointed out, this is a direct result of the relationship between round-trip latency and command pipelining. That is why Xilinx recommends memory accesses that are close to the AXI protocol limit of 4kB per access, as described in this documentation link: https://www.xilinx.com/html_docs/xilinx2019_2/vitis_doc/Chunk2020182740.html#ghz1504034325224 . The best path forward would be to increase your kernel’s burst-length and add internal BRAM buffering, as needed, to support non-sequential data access patterns. Trying to increase command pipelining might help to a limited extent, but for accesses as short as singles or 4-beat bursts, but will not efficiently utilize the AXI bandwidth therefore limiting in overall bandwidth.

Also, the results posted indicate a significant difference in latency between the SDAccel and HDK design methodologies. Please keep in mind that SDAccel solutions always provide pathways from the host PCIe bridge to each of the DDR memories, causing kernel-to-memory traffic to also pass through AXI interconnect switches. Some of those latency cycles are consumed in the clock domain crossing which is present, by default, in the kernel-to-memory pathways of SDAccel solutions. You have control over your kernel clock to possibly eliminate those CDC cycles.

Please let us know if you are still seeing issues or have additional questions.

AWS
已回答 4 年前
0

A small follow up for other developers who may be interested in maximizing bandwidth with short requests.

I benchmarked the four DDR4 channels using a simple accelerator that sends out read requests with arbitrary stride and burst length, both as an RTL kernel in SDAccel and using the HDK flow. Here are the main findings:

Achieved bandwidth in GB/s, per channel, with 2^20 sequential reads as a function of burst length: https://pastebin.com/05JNMd00 (the ideal bandwidth per channel is 64 bytes/cycle * 250 MHz = 16 GB/s)

Bandwidth in GB/s per channel with burst length = 1 (ARLEN=0) as a function of the stride (in 64 byte words): https://pastebin.com/2uE42E8u

With burst length = 1 (ARLEN=0), stride = 1:
Latency in cycles:
HDK, DDRA, B, D: 39-170, avg. 64.1
HDK, DDRC: 43-177, avg. 74.9
SDAccel, bank1, 2, 3: 70-208, avg. 84.2
SDAccel, bank0: 74-191, avg. 91.8

Average number of in-flight requests:
HDK, DDRA, B, D: 32.4
HDK, DDRC: 35.8
SDAccel, bank1, 2, 3: 17.8
SDAccel, bank0: 17.8

TL;DR

  • From the accelerator perspective, the DDR4 on SDAccel have slightly longer latency and sustain about half the number of in-flight requests compared to HDK.
  • The minimum burst length that saturates the bandwidth is 2 on HDK and 8 on SDAccel.
  • bank0 or DDRC (the controller implemented in the shell) has a 5-6% lower bandwidth than the other ones.
  • Bandwidth as a function of the stride is consistent with the address mapping specified here: https://forums.aws.amazon.com/message.jspa?messageID=897304

Edited by: mikhai on May 4, 2020 8:54 PM

mikhai
已回答 4 年前
0

Hi mikhai

I have sent you a PM requesting additional information, in order to help resolve this.

Regards
amrxilinx

已回答 4 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则