- Newest
- Most votes
- Most comments
SDAccel platforms are optimized for larger block transfers into and out of DDR memories. Any time a kernel or host requests very short (1-4 data-beat) accesses, we expect a drop in memory bandwidth. As you pointed out, this is a direct result of the relationship between round-trip latency and command pipelining. That is why Xilinx recommends memory accesses that are close to the AXI protocol limit of 4kB per access, as described in this documentation link: https://www.xilinx.com/html_docs/xilinx2019_2/vitis_doc/Chunk2020182740.html#ghz1504034325224 . The best path forward would be to increase your kernel’s burst-length and add internal BRAM buffering, as needed, to support non-sequential data access patterns. Trying to increase command pipelining might help to a limited extent, but for accesses as short as singles or 4-beat bursts, but will not efficiently utilize the AXI bandwidth therefore limiting in overall bandwidth.
Also, the results posted indicate a significant difference in latency between the SDAccel and HDK design methodologies. Please keep in mind that SDAccel solutions always provide pathways from the host PCIe bridge to each of the DDR memories, causing kernel-to-memory traffic to also pass through AXI interconnect switches. Some of those latency cycles are consumed in the clock domain crossing which is present, by default, in the kernel-to-memory pathways of SDAccel solutions. You have control over your kernel clock to possibly eliminate those CDC cycles.
Please let us know if you are still seeing issues or have additional questions.
A small follow up for other developers who may be interested in maximizing bandwidth with short requests.
I benchmarked the four DDR4 channels using a simple accelerator that sends out read requests with arbitrary stride and burst length, both as an RTL kernel in SDAccel and using the HDK flow. Here are the main findings:
Achieved bandwidth in GB/s, per channel, with 2^20 sequential reads as a function of burst length: https://pastebin.com/05JNMd00 (the ideal bandwidth per channel is 64 bytes/cycle * 250 MHz = 16 GB/s)
Bandwidth in GB/s per channel with burst length = 1 (ARLEN=0) as a function of the stride (in 64 byte words): https://pastebin.com/2uE42E8u
With burst length = 1 (ARLEN=0), stride = 1:
Latency in cycles:
HDK, DDRA, B, D: 39-170, avg. 64.1
HDK, DDRC: 43-177, avg. 74.9
SDAccel, bank1, 2, 3: 70-208, avg. 84.2
SDAccel, bank0: 74-191, avg. 91.8
Average number of in-flight requests:
HDK, DDRA, B, D: 32.4
HDK, DDRC: 35.8
SDAccel, bank1, 2, 3: 17.8
SDAccel, bank0: 17.8
TL;DR
- From the accelerator perspective, the DDR4 on SDAccel have slightly longer latency and sustain about half the number of in-flight requests compared to HDK.
- The minimum burst length that saturates the bandwidth is 2 on HDK and 8 on SDAccel.
- bank0 or DDRC (the controller implemented in the shell) has a 5-6% lower bandwidth than the other ones.
- Bandwidth as a function of the stride is consistent with the address mapping specified here: https://forums.aws.amazon.com/message.jspa?messageID=897304
Edited by: mikhai on May 4, 2020 8:54 PM
Hi mikhai
I have sent you a PM requesting additional information, in order to help resolve this.
Regards
amrxilinx
Relevant content
- asked 6 years ago
- asked 4 years ago
- asked 4 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 6 months ago
- AWS OFFICIALUpdated a month ago