OpenCL + RTL Kernel, number of arguments and how to write kernel.xml?

0

Hello everyone,

I was wondering if there is a limit on the number of arguments, scalar and pointers, for the OpenCL + RTL kernel flow?

Currently, there is an example in SDAccel_examples that has only 3-pointers (two-input, one-output) and one-scalar-argument defining the size of the buffers. Is there a way of having let say 6-pointers (five-inputs and 1-output) and one-scalar-argument?

How would you write kernel.xml for that case? here is the example for rtl-vadd

https://github.com/Xilinx/SDAccel_Examples/blob/master/getting_started/rtl_kernel/rtl_vadd/src/kernel.xml

I had tried running RTL wizard on AWS F1 without any success on different AMIs, it is always crashing.

I also tried to get four-pointers working on hardware emulation by adding "0xc" to the offset of the last pointer as:

id="0" offset="0x10" ---> input-pointer-0
id="1" offset="0x1C" ---> input-pointer-1
id="2" offset="0x28" ---> input-pointer-2
id="3" offset="0x34" ---> output-pointer
id="4" offset="0x40" ---> scalar-arg

This works only on hardware emulation but FAIL in real hardware, I have tested successfully case already which is 3-pointers and 1-scalar. Lastly, I did all the necessary changes in hw (Verilog-source-code) and host-code to make this work, not only on the xml file.

Any help will be greatly appreciate it.

Thanks!

xor
asked 5 years ago408 views
8 Answers
0

If you use the RTL Kernel Wizard to create a kernel with a single scalar input and 6 pointers all mapped to the same AXI MM interface, the generated XML file will look like this:

<?xml version="1.0" encoding="UTF-8"?>
<root versionMajor="1" versionMinor="6">
  <kernel name="kernel6ptr" language="ip_c" vlnv="mycompany.com:kernel:kernel6ptr:1.0" attributes="" preferredWorkGroupSizeMultiple="0" workGroupSize="1" interr
upt="true">
    <ports>
      <port name="s_axi_control" mode="slave" range="0x1000" dataWidth="32" portType="addressable" base="0x0"/>
      <port name="m00_axi" mode="master" range="0xFFFFFFFFFFFFFFFF" dataWidth="512" portType="addressable" base="0x0"/>
    </ports>
    <args>
      <arg name="scalar00" addressQualifier="0" id="0" port="s_axi_control" size="0x4" offset="0x010" type="uint" hostOffset="0x0" hostSize="0x4"/> 
      <arg name="axi00_ptr0" addressQualifier="1" id="1" port="m00_axi" size="0x8" offset="0x018" type="int*" hostOffset="0x0" hostSize="0x8"/> 
      <arg name="axi00_ptr1" addressQualifier="1" id="2" port="m00_axi" size="0x8" offset="0x020" type="int*" hostOffset="0x0" hostSize="0x8"/> 
      <arg name="axi00_ptr2" addressQualifier="1" id="3" port="m00_axi" size="0x8" offset="0x028" type="int*" hostOffset="0x0" hostSize="0x8"/> 
      <arg name="axi00_ptr3" addressQualifier="1" id="4" port="m00_axi" size="0x8" offset="0x030" type="int*" hostOffset="0x0" hostSize="0x8"/> 
      <arg name="axi00_ptr4" addressQualifier="1" id="5" port="m00_axi" size="0x8" offset="0x038" type="int*" hostOffset="0x0" hostSize="0x8"/> 
      <arg name="axi00_ptr5" addressQualifier="1" id="6" port="m00_axi" size="0x8" offset="0x040" type="int*" hostOffset="0x0" hostSize="0x8"/> 
    </args>
  </kernel>
</root>

The example generated by the RTL Kernel Wizard passed HW emulation when targeting the AWS F1 platform.

answered 5 years ago
0

Thanks for your reply but when I try it, I am getting a zero out buffer.

Like

INFO: [ConfigUtil 60-895]   Target platform: /home/centos/src/project_data/aws-fpga/SDAccel/aws_platform/xilinx_aws-vu9p-f1-04261818_dynamic_5_0/xilinx_aws-vu9p-f1-04261818_dynamic_5_0.xpfm
emulation configuration file `emconfig.json` is created in current working directory
XCL_EMULATION_MODE=hw_emu ./host
ERROR: xclProbe-scan failed at fpga_pci_get_all_slot_specs
xclProbe found 0 FPGA slots with xocl driver running
Found Platform
Platform Name: Xilinx
XCLBIN File Name: vadd
INFO: Importing xclbin/vadd.hw_emu.xilinx_aws-vu9p-f1-04261818_dynamic_5_0.xclbin
Loading: 'xclbin/vadd.hw_emu.xilinx_aws-vu9p-f1-04261818_dynamic_5_0.xclbin'
INFO: [SDx-EM 01] Hardware emulation runs simulation underneath. Using a large data set will result in long simulation times. It is recommended that a small dataset is used for faster execution. This flow does not use cycle accurate model
s and hence the performance data generated is approximate.
WARNING: unaligned host pointer '0x23cac20' detected, this leads to extra memcpy
WARNING: unaligned host pointer '0x23cac70' detected, this leads to extra memcpy
WARNING: unaligned host pointer '0x23cacc0' detected, this leads to extra memcpy
WARNING: unaligned host pointer '0x23cad10' detected, this leads to extra memcpy
WARNING: unaligned host pointer '0x23cad60' detected, this leads to extra memcpy
WARNING: unaligned host pointer '0x23cadb0' detected, this leads to extra memcpy
out[0]: 0
out[1]: 0
out[2]: 0
out[3]: 0
out[4]: 0
out[5]: 0
out[6]: 0
out[7]: 0
out[8]: 0
out[9]: 0
out[10]: 0
out[11]: 0
out[12]: 0
out[13]: 0
out[14]: 0
out[15]: 0
INFO: [SDx-EM 22] [Wall clock time: 22:55, Emulation time: 0.00332367 ms] Data transfer between kernel(s) and global memory(s)
BANK0          RD = 0.125 KB               WR = 0.062 KB
BANK1          RD = 0.000 KB               WR = 0.000 KB
BANK2          RD = 0.000 KB               WR = 0.000 KB
BANK3          RD = 0.000 KB               WR = 0.000 KB
BANKkrnl_vadd_rtl_1/m_axi_gmem          RD = 0.000 KB               WR = 0.000 KB
krnl_vadd_rtl_1:m_axi_gmem          RD = 0.125 KB               WR = 0.062 KB

Here is my host code

#include "xcl2.hpp"
#include <vector>

int main(int argc, char** argv)
{
    int size = 16;

    size_t size_bytes = sizeof(int) * size;

    int *ibuf_0 = static_cast<int *>(malloc(size_bytes));
    int *ibuf_1 = static_cast<int *>(malloc(size_bytes));
    int *ibuf_2 = static_cast<int *>(malloc(size_bytes));
    int *ibuf_3 = static_cast<int *>(malloc(size_bytes));
    int *ibuf_4 = static_cast<int *>(malloc(size_bytes));
    int *obuf_0 = static_cast<int *>(malloc(size_bytes));

    // Create the test data and Software Result
    for(int i = 0 ; i < size ; i++){
        ibuf_0[i] = 0xa;
        ibuf_1[i] = 0x2;
        ibuf_2[i] = 0x9;
        ibuf_3[i] = 0x4;
        ibuf_4[i] = 0x5;
        obuf_0[i] = 0x6;
    }

//OPENCL HOST CODE AREA START
    //Create Program and Kernel
    std::vector<cl::Device> devices = xcl::get_xil_devices();
    cl::Device device = devices[0];

    cl::Context context(device);
    cl::CommandQueue q(context, device, CL_QUEUE_PROFILING_ENABLE);
    std::string device_name = device.getInfo<CL_DEVICE_NAME>();

    std::string binaryFile = xcl::find_binary_file(device_name,"vadd");
    cl::Program::Binaries bins = xcl::import_binary_file(binaryFile);
    devices.resize(1);
    cl::Program program(context, devices, bins);
    cl::Kernel krnl_vadd(program,"krnl_vadd_rtl");

    //Allocate Buffer in Global Memory
    std::vector<cl::Memory> ibuf_vec, obuf_vec;
    cl::Buffer ocl_ibuf_0(context, CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, size_bytes, ibuf_0);
    cl::Buffer ocl_ibuf_1(context, CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, size_bytes, ibuf_1);
    cl::Buffer ocl_ibuf_2(context, CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, size_bytes, ibuf_2);
    cl::Buffer ocl_ibuf_3(context,CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, size_bytes, ibuf_3);
    cl::Buffer ocl_ibuf_4(context,CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, size_bytes, ibuf_4);
    cl::Buffer ocl_obuf_0(context,CL_MEM_USE_HOST_PTR | CL_MEM_WRITE_ONLY, size_bytes, obuf_0);

    ibuf_vec.push_back(ocl_ibuf_0);
    ibuf_vec.push_back(ocl_ibuf_1);
    ibuf_vec.push_back(ocl_ibuf_2);
    ibuf_vec.push_back(ocl_ibuf_3);
    ibuf_vec.push_back(ocl_ibuf_4);
    obuf_vec.push_back(ocl_obuf_0);

    //Copy input data to device global memory
    q.enqueueMigrateMemObjects(ibuf_vec, 0/* 0 means from host*/);

    //Set the Kernel Arguments
    int nargs = 0;
    krnl_vadd.setArg(nargs++, size);
    krnl_vadd.setArg(nargs++, ocl_ibuf_0);
    krnl_vadd.setArg(nargs++, ocl_ibuf_1);
    krnl_vadd.setArg(nargs++, ocl_ibuf_2);
    krnl_vadd.setArg(nargs++, ocl_ibuf_3);
    krnl_vadd.setArg(nargs++, ocl_ibuf_4);
    krnl_vadd.setArg(nargs++, ocl_obuf_0);

    //Launch the Kernel
    q.enqueueTask(krnl_vadd);

    //Copy Result from Device Global Memory to Host Local Memory
    q.enqueueMigrateMemObjects(obuf_vec, CL_MIGRATE_MEM_OBJECT_HOST);
    q.finish();

//OPENCL HOST CODE AREA END

    for (int i = 0 ; i < size ; i++){
        printf("out[%d]: %x\n", i, obuf_0[i]);
    }

    free(ibuf_0);
    free(ibuf_1);
    free(ibuf_2);
    free(ibuf_3);
    free(ibuf_4);
    free(obuf_0);

    return 0;
}

Is there a problem with my host code?. It works with 1-scalar and 5-pointers but it fails when using 1-scalar and 6-pointers or when I have to use an offset of 0x40 on the kernel.xml file

Thanks!

Edited by: xor on Jan 8, 2019 3:29 PM

xor
answered 5 years ago
0

There doesn't seem to be anything wrong with your host code (at least nothing obvious).
Could the issue be with your RTL code? It is odd that it would work with 5 pointers but not with 6.
As mentioned earlier, the simple example generated by the wizard works fine.

Are you able to run HW emulation in debug mode to look at the RTL waveforms?

answered 5 years ago
0

Hi,

Yeah I did successfully:

3-pointers and 1-scalar
4-pointers and 1-scalar
5-pointers and 1-scalar

But once I go for 6-pointers and 1-scalar, it breaks.

How do I do RTL waveforms with hardware emulation? Is there a tutorial or documentation for that?

Thanks!

xor
answered 5 years ago
0

Enabling RTL waveforms during emulation is covered in the SDAccel documentation:
https://www.xilinx.com/html_docs/xilinx2018_2/sdaccel_doc/device-hardware-transaction-view-nng1504034335037.html

In short:
o When using the SDx GUI, you can enable the waveforms from the Run Configurations settings.
o When working from the command line, you need to add the two lines below to the sdaccel.ini file:

[Emulation]
launch_waveform=gui

The documentation referenced above explains both approaches in greater detail.

Edited by: ThomasXilinx on Jan 9, 2019 11:24 AM

answered 5 years ago
0

Thanks for the pointer, however this is waveform is more about "device-level transaction" and not RTL waveform debugging. It is more like a kernel events timeline. In their own words, The details include data transfers between the kernel and global memory, data flow via inter-kernel pipes as well as data flow via intra-kernel pipes.

I am checking that to see if I can find anything there. Otherwise, I believe I am going to use the RTL-wizard-->Vivado way to create a testbench for this particular case and see if I can find something.

xor
answered 5 years ago
0

The default waveform setup will indeed trace data transfers between the kernel and global memory, data flow via inter-kernel pipes as well as data flow via intra-kernel pipes. But in interactive mode, you can also access all the signal in the RTL kernel and add them to the waveform. So you are not limited to the default trace configuration.

This said, in your case, the default setup would let you see check whether there are data transfers related to the 6th pointer.

Please confirm the outcome of using the RTL Kernel Wizard. As mentioned earlier, I tried this yesterday and the generated example worked for me.

answered 5 years ago
0

You were right, I did not know I could add RTL waveforms with hardware emulation. I finally found what my issue was, it was the bitwidth of the address. This is the example I am using.

https://github.com/Xilinx/SDAccel_Examples/blob/master/getting_started/rtl_kernel/rtl_vadd/src/hdl/krnl_vadd_rtl.v#L45

This example uses 6-bits for axi-lite-control-addresses but the 6th-pointer is 0x40 which means that I would need 7-bits.

Thanks for all your help!

xor
answered 5 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions