Vitis AI on EC2 F1 instance: xbutil shows only device as not ready

0

Hi,

I'm currently trying to follow along this guide to setup Vitis AI on an Amazon EC2 F1 instance: https://github.com/Xilinx/Vitis-AI/tree/master/docs/aws
But I think I might be missing something as by the end, I try to follow one of the tutorials linked at the end (the resnet50 one) but it does not work, complaining about how it cannot acquire a CU when I try to run the compiled example.

Now, after looking around a bit, I tried running xbutil scan which found one device, as expected, but unfortunately it said this device was not ready and I cannot for the life of me figure out what exactly is wrong, although I have a few ideas:

  1. While following the Vitis AI aws guide, after building XRT, I try to install it with the command sudo apt install ./Release/*-xrt.deb, but at the end this fails to install xclmgmt because of some unkown symbols. I found some threads which said this driver shouldn't be needed for Vitis AI though, and I also then managed to by installing linux-modules-extra-5.4.0-1054-aws like mentioned in https://forums.aws.amazon.com/thread.jspa?messageID=950464, but this did not help (I also did not try a different kernel as mentioned, since I was unsure if this was a good idea, especially if said driver ends up being unnecessary).

  2. I've seen AFI's mentioned a lot while reading through documentation, but I don't really know how they fit in with using Vitis AI, I was under the impression I wouldn't need to deal with them, from following the examples, but maybe this is the problem? I would appreciate any resources so I can understand their role in all this.

  3. I am also wondering if it even is possible to use Vitis AI with the aws fpga devices. It's my understanding that ec2 f1 instances have access to vu9p fpgas, but when I look at the Vitis AI page, I see no specific mention of this model, which is leaving me a bit confused.

Hopefully my problem makes sense, I am more than happy to post some logs/command outputs, I'm just not sure what might be helpful to see right now.
Thanks in advance for any help.

asked 3 years ago509 views
7 Answers
1

I have now managed to fix the problem. After looking through the available AFIs with the aws ec2 describe-fpga-images command, I found one under the name "dpdpuv3_wrapper.hw.xilinx_aws-vu9p-f1_shell-v04261818_201920_2" with global id agfi-0e168992b12da45f9 which seems to work well.

So I suppose the main issue was that no AFI was loaded on startup, I don't know whether that's something that should be expected or not.

Something else worth noting, I did find an older AFI with the same name but different global id which did not return reasonable results when running some of the Vitis-AI examples, so that's something worth keeping in mind for people getting silly results.

Thanks for the help Deep!

answered 3 years ago
  • Hi brunopaiva,

    I also intend to run Vitis-AI examples on AWS f1 instances. So far we followed the vector addition example (https://github.com/aws/aws-fpga/tree/master/Vitis) and also separately tried to quantize and compile an example model using Vitis-AI. However, I have a doubt on how to deploy the compiled model on an f1 instance.

    Can you please provide some guidance regarding this? Thanks in advance!

0

Hello,

The Alveo DDR supported DPU's should work on the VU9P on F1.

With AWS F1, there is a concept of AFI's and awsxclbins as opposed to bitstreams/xclbins you are used to on other platforms. AWSXCLBIN's also have AFI metadata in them that let xclmgmt load the AFI on the FPGA. So to answer your question, you need XRT and XRT-AWS built and installed and working for using F1 instances with Vitis AI.

Now on to debugging your issue, Did you build XRT on the same Ubuntu host? Can you share the region in which you launched the F1 instance? What does journalctl -u mpd say. These would let me help you further on getting things running.

-Deep

Deep_P
answered 3 years ago
0

Hi Deep, thanks for the help!

Yes I built XRT on the same Ubuntu host. After building and installing the relevant .deb package, I get the following

ubuntu@host:~/XRT/build$ sudo apt install ./Release/*-xrt.deb  
Reading package lists... Done  
Building dependency tree         
Reading state information... Done  
Note, selecting 'xrt' instead of './Release/xrt_202110.2.11.0_18.04-amd64-xrt.deb'  
The following NEW packages will be installed:  
  xrt  
0 upgraded, 1 newly installed, 0 to remove and 0 not upgraded.  
Need to get 0 B/12.4 MB of archives.  
After this operation, 64.3 MB of additional disk space will be used.  
Get:1 /home/ubuntu/XRT/build/Release/xrt_202110.2.11.0_18.04-amd64-xrt.deb xrt amd64 2.11.0 \\[12.4 MB\]  
Selecting previously unselected package xrt.  
(Reading database ... 103036 files and directories currently installed.)  
Preparing to unpack .../xrt_202110.2.11.0_18.04-amd64-xrt.deb ...  
Unpacking xrt (2.11.0) ...  
Setting up xrt (2.11.0) ...  
Unloading old XRT Linux kernel modules  
rmmod: ERROR: Module xocl is not currently loaded  
rmmod: ERROR: Module xclmgmt is not currently loaded  
Invoking DKMS common.postinst for xrt  
Loading new xrt-2.11.0 DKMS files...  
Building for 5.4.0-1054-aws  
Building initial module for 5.4.0-1054-aws  
Done.  
  
xocl:  
Running module version sanity check.  
 - Original module  
   - No original module exists within this kernel  
 - Installation  
   - Installing to /lib/modules/5.4.0-1054-aws/updates/dkms/  
  
xclmgmt.ko:  
Running module version sanity check.  
 - Original module  
   - No original module exists within this kernel  
 - Installation  
   - Installing to /lib/modules/5.4.0-1054-aws/updates/dkms/  
  
depmod....  
  
DKMS: install completed.  
Finished DKMS common.postinst  
Loading new XRT Linux kernel modules  
modprobe: ERROR: could not insert 'xclmgmt': Unknown symbol in module, or unknown parameter (see dmesg)  
Installing MSD / MPD daemons  
ubuntu@host:~/XRT/build$ dmesg | grep -e xclmgmt  
\\[ 1491.933240\] xclmgmt: loading out-of-tree module taints kernel.  
\\[ 1491.933862\] xclmgmt: module verification failed: signature and/or required key missing - tainting kernel  
\\[ 1491.933980\] xclmgmt: Unknown symbol fpga_mgr_create (err -2)  
\\[ 1491.934125\] xclmgmt: Unknown symbol fpga_mgr_unregister (err -2)  
\\[ 1491.934195\] xclmgmt: Unknown symbol fpga_mgr_register (err -2)  
\\[ 1491.934218\] xclmgmt: Unknown symbol fpga_mgr_free (err -2)  

The region the instance was launched in is eu-west-2.

After finishing going through the guide I mentioned before, running journalctl -u mpd outputs

ubuntu@host:~$ journalctl -u mpd  
-- Logs begin at Wed 2021-08-11 17:54:31 UTC, end at Mon 2021-08-16 10:00:36 UTC. --  
Aug 16 09:56:12 ip-172-31-154-75 systemd\\[1\]: Started Xilinx Management Proxy Daemon (MPD).  
Aug 16 09:56:12 ip-172-31-154-75 mpd\\[19312\]: started  
Aug 16 09:56:12 ip-172-31-154-75 mpd\\[19312\]: found mpd plugin: /opt/xilinx/xrt/lib/libmpd_plugin.so  
Aug 16 09:56:12 ip-172-31-154-75 mpd\\[19312\]: aws: load default afi to 0000:00:1d.0  
answered 3 years ago
0

Hi, so one obvious thing I don't see here is the installation of xrt-aws.deb. Could you try installing both xrt and xrt-aws debs and restarting mpd after?

-Deep

Deep_P
answered 3 years ago
0

Sorry I didn't mention it, but I have also installed aws-xrt, as per the guide, see:

ubuntu@hostname:~/XRT/build/Release$ sudo apt install ./xrt_202110.2.11.0_18.04-amd64-aws.deb   
Reading package lists... Done  
Building dependency tree         
Reading state information... Done  
Note, selecting 'xrt-aws' instead of './xrt_202110.2.11.0_18.04-amd64-aws.deb'  
xrt-aws is already the newest version (2.11.0).  
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.  

I don't seem to be able to start the mpd service again though. I turned off the instance while I was away and now that I've come back to it, the mpd service is not starting, see the end of the output of journalctl -u mpd (I have a file with the rest, but it's around 380 lines total):

Aug 16 11:10:10 hostname mpd\\[19312\]: udev msg arrived on fd 4  
Aug 16 11:10:10 hostname mpd\\[19312\]: udev msg arrived on fd 4  
Aug 16 11:10:10 hostname mpd\\[19312\]: udev msg arrived on fd 4  
Aug 16 11:10:10 hostname mpd\\[19312\]: mpd caught signal 15  
Aug 16 11:10:10 hostname systemd\\[1\]: Stopping Xilinx Management Proxy Daemon (MPD)...  
Aug 16 11:10:10 hostname mpd\\[19312\]: failed to select: Interrupted system call  
Aug 16 11:10:10 hostname mpd\\[19312\]: aws mpd plugin fini called  
Aug 16 11:10:10 hostname mpd\\[19312\]: ended  
Aug 16 11:10:10 hostname systemd\\[1\]: Stopped Xilinx Management Proxy Daemon (MPD).  
-- Reboot --  
Aug 16 20:44:08 hostname systemd\\[1\]: Started Xilinx Management Proxy Daemon (MPD).  
Aug 16 20:44:08 hostname mpd\\[932\]: started  
Aug 16 20:44:08 hostname mpd\\[932\]: found mpd plugin: /opt/xilinx/xrt/lib/libmpd_plugin.so  
Aug 16 20:44:08 hostname mpd\\[932\]: aws: load default afi to 0000:00:1d.0  
Aug 16 20:48:34 hostname mpd\\[932\]: mpd caught signal 15  
Aug 16 20:48:34 hostname systemd\\[1\]: Stopping Xilinx Management Proxy Daemon (MPD)...  
Aug 16 20:50:04 hostname systemd\\[1\]: mpd.service: State 'stop-sigterm' timed out. Killing.  
Aug 16 20:50:04 hostname systemd\\[1\]: mpd.service: Killing process 932 (mpd) with signal SIGKILL.  
Aug 16 20:50:04 hostname systemd\\[1\]: mpd.service: Main process exited, code=killed, status=9/KILL  
Aug 16 20:50:04 hostname systemd\\[1\]: mpd.service: Failed with result 'timeout'.  
Aug 16 20:50:04 hostname systemd\\[1\]: Stopped Xilinx Management Proxy Daemon (MPD).  
answered 3 years ago
0

Ok I think I found the issue. The default AFI was unavailable in eu-west-2.
Can you retry mpd:

sudo systemctl restart mpd

You shouldn't see the signal 15 this time around.

-Deep

Deep_P
answered 3 years ago
0

I tried restarting mpd after turning my instance back on and this was the result from systemctl status mpd:

Loaded: loaded (/etc/systemd/system/mpd.service; enabled; vendor preset: enabled)
Active: failed (Result: timeout) since Wed 2021-08-18 09:17:50 UTC; 3min 19s ago
Condition: start condition failed at Wed 2021-08-18 09:17:50 UTC; 3min 19s ago
└─ ConditionDirectoryNotEmpty=/dev/xfpga was not met
Process: 989 ExecStart=/opt/xilinx/xrt/bin/mpd (code=killed, signal=KILL)
Main PID: 989 (code=killed, signal=KILL)

Aug 18 09:14:35 hostname mpd\[989]: started
Aug 18 09:14:35 hostname mpd\[989]: found mpd plugin: /opt/xilinx/xrt/lib/libmpd_plugin.so
Aug 18 09:14:35 hostname mpd\[989]: aws: load default afi to 0000:00:1d.0
Aug 18 09:16:20 hostname mpd\[989]: mpd caught signal 15
Aug 18 09:16:20 hostname systemd\[1]: Stopping Xilinx Management Proxy Daemon (MPD)...
Aug 18 09:17:50 hostname systemd\[1]: mpd.service: State 'stop-sigterm' timed out. Killing.
Aug 18 09:17:50 hostname systemd\[1]: mpd.service: Killing process 989 (mpd) with signal SIGKILL.
Aug 18 09:17:50 hostname systemd\[1]: mpd.service: Main process exited, code=killed, status=9/KILL
Aug 18 09:17:50 hostname systemd\[1]: mpd.service: Failed with result 'timeout'.
Aug 18 09:17:50 hostname systemd\[1]: Stopped Xilinx Management Proxy Daemon (MPD).

If I reboot the instance though, then mpd starts up fine (and if I try to restart it, then it breaks again).

answered 3 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions