- Newest
- Most votes
- Most comments
I have now managed to fix the problem. After looking through the available AFIs with the aws ec2 describe-fpga-images
command, I found one under the name "dpdpuv3_wrapper.hw.xilinx_aws-vu9p-f1_shell-v04261818_201920_2" with global id agfi-0e168992b12da45f9
which seems to work well.
So I suppose the main issue was that no AFI was loaded on startup, I don't know whether that's something that should be expected or not.
Something else worth noting, I did find an older AFI with the same name but different global id which did not return reasonable results when running some of the Vitis-AI examples, so that's something worth keeping in mind for people getting silly results.
Thanks for the help Deep!
Hello,
The Alveo DDR supported DPU's should work on the VU9P on F1.
With AWS F1, there is a concept of AFI's and awsxclbins as opposed to bitstreams/xclbins you are used to on other platforms. AWSXCLBIN's also have AFI metadata in them that let xclmgmt load the AFI on the FPGA. So to answer your question, you need XRT and XRT-AWS built and installed and working for using F1 instances with Vitis AI.
Now on to debugging your issue, Did you build XRT on the same Ubuntu host? Can you share the region in which you launched the F1 instance? What does journalctl -u mpd say. These would let me help you further on getting things running.
-Deep
Hi Deep, thanks for the help!
Yes I built XRT on the same Ubuntu host. After building and installing the relevant .deb package, I get the following
ubuntu@host:~/XRT/build$ sudo apt install ./Release/*-xrt.deb
Reading package lists... Done
Building dependency tree
Reading state information... Done
Note, selecting 'xrt' instead of './Release/xrt_202110.2.11.0_18.04-amd64-xrt.deb'
The following NEW packages will be installed:
xrt
0 upgraded, 1 newly installed, 0 to remove and 0 not upgraded.
Need to get 0 B/12.4 MB of archives.
After this operation, 64.3 MB of additional disk space will be used.
Get:1 /home/ubuntu/XRT/build/Release/xrt_202110.2.11.0_18.04-amd64-xrt.deb xrt amd64 2.11.0 \\[12.4 MB\]
Selecting previously unselected package xrt.
(Reading database ... 103036 files and directories currently installed.)
Preparing to unpack .../xrt_202110.2.11.0_18.04-amd64-xrt.deb ...
Unpacking xrt (2.11.0) ...
Setting up xrt (2.11.0) ...
Unloading old XRT Linux kernel modules
rmmod: ERROR: Module xocl is not currently loaded
rmmod: ERROR: Module xclmgmt is not currently loaded
Invoking DKMS common.postinst for xrt
Loading new xrt-2.11.0 DKMS files...
Building for 5.4.0-1054-aws
Building initial module for 5.4.0-1054-aws
Done.
xocl:
Running module version sanity check.
- Original module
- No original module exists within this kernel
- Installation
- Installing to /lib/modules/5.4.0-1054-aws/updates/dkms/
xclmgmt.ko:
Running module version sanity check.
- Original module
- No original module exists within this kernel
- Installation
- Installing to /lib/modules/5.4.0-1054-aws/updates/dkms/
depmod....
DKMS: install completed.
Finished DKMS common.postinst
Loading new XRT Linux kernel modules
modprobe: ERROR: could not insert 'xclmgmt': Unknown symbol in module, or unknown parameter (see dmesg)
Installing MSD / MPD daemons
ubuntu@host:~/XRT/build$ dmesg | grep -e xclmgmt
\\[ 1491.933240\] xclmgmt: loading out-of-tree module taints kernel.
\\[ 1491.933862\] xclmgmt: module verification failed: signature and/or required key missing - tainting kernel
\\[ 1491.933980\] xclmgmt: Unknown symbol fpga_mgr_create (err -2)
\\[ 1491.934125\] xclmgmt: Unknown symbol fpga_mgr_unregister (err -2)
\\[ 1491.934195\] xclmgmt: Unknown symbol fpga_mgr_register (err -2)
\\[ 1491.934218\] xclmgmt: Unknown symbol fpga_mgr_free (err -2)
The region the instance was launched in is eu-west-2
.
After finishing going through the guide I mentioned before, running journalctl -u mpd
outputs
ubuntu@host:~$ journalctl -u mpd
-- Logs begin at Wed 2021-08-11 17:54:31 UTC, end at Mon 2021-08-16 10:00:36 UTC. --
Aug 16 09:56:12 ip-172-31-154-75 systemd\\[1\]: Started Xilinx Management Proxy Daemon (MPD).
Aug 16 09:56:12 ip-172-31-154-75 mpd\\[19312\]: started
Aug 16 09:56:12 ip-172-31-154-75 mpd\\[19312\]: found mpd plugin: /opt/xilinx/xrt/lib/libmpd_plugin.so
Aug 16 09:56:12 ip-172-31-154-75 mpd\\[19312\]: aws: load default afi to 0000:00:1d.0
Hi, so one obvious thing I don't see here is the installation of xrt-aws.deb. Could you try installing both xrt and xrt-aws debs and restarting mpd after?
-Deep
Sorry I didn't mention it, but I have also installed aws-xrt, as per the guide, see:
ubuntu@hostname:~/XRT/build/Release$ sudo apt install ./xrt_202110.2.11.0_18.04-amd64-aws.deb
Reading package lists... Done
Building dependency tree
Reading state information... Done
Note, selecting 'xrt-aws' instead of './xrt_202110.2.11.0_18.04-amd64-aws.deb'
xrt-aws is already the newest version (2.11.0).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
I don't seem to be able to start the mpd service again though. I turned off the instance while I was away and now that I've come back to it, the mpd service is not starting, see the end of the output of journalctl -u mpd (I have a file with the rest, but it's around 380 lines total):
Aug 16 11:10:10 hostname mpd\\[19312\]: udev msg arrived on fd 4
Aug 16 11:10:10 hostname mpd\\[19312\]: udev msg arrived on fd 4
Aug 16 11:10:10 hostname mpd\\[19312\]: udev msg arrived on fd 4
Aug 16 11:10:10 hostname mpd\\[19312\]: mpd caught signal 15
Aug 16 11:10:10 hostname systemd\\[1\]: Stopping Xilinx Management Proxy Daemon (MPD)...
Aug 16 11:10:10 hostname mpd\\[19312\]: failed to select: Interrupted system call
Aug 16 11:10:10 hostname mpd\\[19312\]: aws mpd plugin fini called
Aug 16 11:10:10 hostname mpd\\[19312\]: ended
Aug 16 11:10:10 hostname systemd\\[1\]: Stopped Xilinx Management Proxy Daemon (MPD).
-- Reboot --
Aug 16 20:44:08 hostname systemd\\[1\]: Started Xilinx Management Proxy Daemon (MPD).
Aug 16 20:44:08 hostname mpd\\[932\]: started
Aug 16 20:44:08 hostname mpd\\[932\]: found mpd plugin: /opt/xilinx/xrt/lib/libmpd_plugin.so
Aug 16 20:44:08 hostname mpd\\[932\]: aws: load default afi to 0000:00:1d.0
Aug 16 20:48:34 hostname mpd\\[932\]: mpd caught signal 15
Aug 16 20:48:34 hostname systemd\\[1\]: Stopping Xilinx Management Proxy Daemon (MPD)...
Aug 16 20:50:04 hostname systemd\\[1\]: mpd.service: State 'stop-sigterm' timed out. Killing.
Aug 16 20:50:04 hostname systemd\\[1\]: mpd.service: Killing process 932 (mpd) with signal SIGKILL.
Aug 16 20:50:04 hostname systemd\\[1\]: mpd.service: Main process exited, code=killed, status=9/KILL
Aug 16 20:50:04 hostname systemd\\[1\]: mpd.service: Failed with result 'timeout'.
Aug 16 20:50:04 hostname systemd\\[1\]: Stopped Xilinx Management Proxy Daemon (MPD).
Ok I think I found the issue. The default AFI was unavailable in eu-west-2.
Can you retry mpd:
sudo systemctl restart mpd
You shouldn't see the signal 15 this time around.
-Deep
I tried restarting mpd after turning my instance back on and this was the result from systemctl status mpd:
Loaded: loaded (/etc/systemd/system/mpd.service; enabled; vendor preset: enabled)
Active: failed (Result: timeout) since Wed 2021-08-18 09:17:50 UTC; 3min 19s ago
Condition: start condition failed at Wed 2021-08-18 09:17:50 UTC; 3min 19s ago
└─ ConditionDirectoryNotEmpty=/dev/xfpga was not met
Process: 989 ExecStart=/opt/xilinx/xrt/bin/mpd (code=killed, signal=KILL)
Main PID: 989 (code=killed, signal=KILL)Aug 18 09:14:35 hostname mpd\[989]: started
Aug 18 09:14:35 hostname mpd\[989]: found mpd plugin: /opt/xilinx/xrt/lib/libmpd_plugin.so
Aug 18 09:14:35 hostname mpd\[989]: aws: load default afi to 0000:00:1d.0
Aug 18 09:16:20 hostname mpd\[989]: mpd caught signal 15
Aug 18 09:16:20 hostname systemd\[1]: Stopping Xilinx Management Proxy Daemon (MPD)...
Aug 18 09:17:50 hostname systemd\[1]: mpd.service: State 'stop-sigterm' timed out. Killing.
Aug 18 09:17:50 hostname systemd\[1]: mpd.service: Killing process 989 (mpd) with signal SIGKILL.
Aug 18 09:17:50 hostname systemd\[1]: mpd.service: Main process exited, code=killed, status=9/KILL
Aug 18 09:17:50 hostname systemd\[1]: mpd.service: Failed with result 'timeout'.
Aug 18 09:17:50 hostname systemd\[1]: Stopped Xilinx Management Proxy Daemon (MPD).
If I reboot the instance though, then mpd starts up fine (and if I try to restart it, then it breaks again).
Relevant content
- Accepted Answerasked 5 years ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 4 months ago
- AWS OFFICIALUpdated 2 months ago
Hi brunopaiva,
I also intend to run Vitis-AI examples on AWS f1 instances. So far we followed the vector addition example (https://github.com/aws/aws-fpga/tree/master/Vitis) and also separately tried to quantize and compile an example model using Vitis-AI. However, I have a doubt on how to deploy the compiled model on an f1 instance.
Can you please provide some guidance regarding this? Thanks in advance!