"creating kernel" step took more than 36 hours in hardware system build

0

Hello!

I'm doing a hardware system build using Vitis on z1d.2xlarge instance. The goal is to accelerate an algorithm with the f1.2xlarge instance. I am using the latest FPGA AMI for this instance.

My application has one kernel in which the core computation-intensive part is two completely unrolled matrix-vector multiplication, each one looks like the following:

#define MAX_ROWS #a number#
#define MAX_COLUMNS #a number#

typedef ap_fixed<2, 2, AP_TRN, AP_WRAP> fixed_point_2t;
typedef ap_fixed<18, 6, AP_TRN, AP_WRAP> fixed_point_16t;
typedef ap_fixed<32, 16, AP_TRN, AP_WRAP> fixed_point_32t;

fixed_point_2t matrix[MAX_ROWS][MAX_COLUMNS] = {0};
fixed_point_32t vector[MAX_COLUMNS] = {0};
fixed_point_32t product[MAX_ROWS] = {0};

for (int i =0; i < NUM_COLUMNS; i ++)
{
#pragma HLS PIPELINE
    for (int j = 0; j < MAX_ROWS; j++)
    {
        product[j] += fixed_point_16t(matrix[j][i]) * fixed_point_16t(vector[i]);
    }
}

NUM_COLUMNS is a variable in the kernel function, and MAX_ROWS is a hardcoded constant. The goal here is to have the kernel support matrix-vector multiplication for N_ROWS, N_COLUMNS < MAX_ROWS, MAX_COLUMNS. All the arrays are partitioned to support the corresponding unrolling. I omitted the memory transferring part from the above code.

Currently, I am compiling a version for MAX_ROWS = 2000 and MAX_COLUMNS = 2000, it has stuck on the "creating kernel" step for 36 hours now. I can see that it is always using 100% of one CPU core, and the memory (RAM) usage on the z1d instance slowly increases over time.

With some back-of-envelope calculations, I am pretty confident that the resources (DSPs, BRAMs) on the FPGA are able to support what the application requires. However, this long building time has made me very worried.

Here are my questions:

  1. Is this not the correct way to do large size matrix-vector multiplication (for N >= 2000)? If so, what could be a better way to do it?
  2. Is there any place where I can check if the synthesis progress is stuck? It is taking a fairly long time now and I am not sure if it is ever going to finish.

I can share my source code if needed, but I would prefer not posting it on the forum (sorry about that). Thank you for reading through my post, and I am open to any opinion/help on this issue. Please let me know if any part of my question is not clear, and any suggestion is greatly appreciated!

Thank you,

Owen

已提問 4 年前檢視次數 297 次
4 個答案
0

Hi,

Looks like either your design is running into congestion issues, or you are running out of memory on host machine. Does the build run successfully if you reduce the number of multiplications in your design?

Alternatively, you can try running builds on instances with larger memory.

Vivado tool outputs vivado.log to check the progress of builds.

Thanks!
Chakra

AWS
已回答 4 年前
0

Hi Chakra,

Thank you very much for your reply!

1. Possibility of running out of memory on host machine

I've looked at the the RAM usage on the host machine. It uses at most 24.7 GB RAM with total of 60 GB RAM available on the host machine. Therefore, I suppose memory is not a critical issue here?

2. Does the build run successfully if you reduce the number of multiplications in your design?

Yes, the build does run successfully if the number if multiplications is reduced. Please see https://forums.xilinx.com/t5/Vitis-Acceleration-SDAccel-SDSoC/Vitis-hardware-emulation-cannot-finish-scheduling/td-p/1131555 for my vivado hls reports for matrix size of 100 and 1000. When compiling with matrix size of 1000, seems like vivado hls is stuck on scheduling. Do you have any insights on why this happens? Any suggestions you have is greatly appreciated!

According to my post in Xilinx (the link above), seems like they also recommend me to use the latest release of Vitis, which is not officially supported by AWS. I am wondering if that is worth trying?

Thank you,
Owen

已回答 4 年前
0

Hello,

Thanks for the feedback. Please stay tuned for availability of the latest Vitis version for F1.

-Chakra

AWS
已回答 4 年前
0

Turns out the problem is due to a large array partition factor (500). If any is caught into the problem as I did please try a smaller partition factor.

已回答 4 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南