FPGA appear as not ready after first boot of AMI on AWS F1 instance

0

Hi there, I am having an issue with the latest github version (RC_v1_4_12) of aws-fpga SW.
My goal is to create an AMI that includes the Xilinx XRT from github branch 2019.2 and the aws-fpga version RC_v1_4_12 so that
I can run instances on AWS f1.
In order to do that (using f1.2xlarge):

  1. I base my new AMI on ami-03746875d916becc0 (ubuntu/images/hvm-ssd/ubuntu-xenial-16.04-amd64-server-20190628)
  2. I install all the requirements for the Xilinx XRT and aws-fpga .deb packages
  3. Install the Xilinx XRT and aws-fpga .deb packages

*** At that point, the tool "xbutil scan" lists the FPGA as ready to be used ***

  1. I create the AMI using the aws CLI: "aws ec2 create-image"

So far, so good, everything works and I can see my new AMI in the list of 'Owned by me' AMIs.

The problem starts when I use this new AMI to launch an f1.2xlarge instance:
The instance boots with no issues, but the tool "xbutil scan" lists the FPGA *** as not ready ***.
At this point I cannot program/use the FPGA.

After trying varius tricks, I found out that the solution was to restart the Xilinx MPD service which comes
with the aws-fpga package (or re-install the aws-fpga .deb, or just reboot the f1 instance).
After that the FPGA is ready to be used.

Is that MPD service a known issue and planned to fixed in the next versions of the aws-fpga?
Is there something wrong in the steps which I execute in order to create my new AMI?
This is the info that I get when I launch the new AMI:

System INFO:

System Configuration
OS name: Linux
Release: 4.4.0-1098-aws
Version: #109-Ubuntu SMP Fri Nov 8 09:30:18 UTC 2019
Machine: x86_64
Glibc: 2.23
Distribution: Ubuntu 16.04.6 LTS
XRT Information
Version: 2.3.0
Git Hash: ed096afaa193de4c8b0176e9245992cc3a3e7296
Git Branch: 2019.2
Build Date: 2019-12-20 10:20:49
XOCL: 2.3.0,ed096afaa193de4c8b0176e9245992cc3a3e7296
XCLMGMT: unknown

MPD log as reported by the journalctl:

a) During the AMI launch:

Jan 08 07:18:20 ip-172-31-32-66 mpd[1247]: started
Jan 08 07:18:20 ip-172-31-32-66 mpd[1247]: found mpd plugin: /opt/xilinx/xrt/lib/libmpd_plugin.so
Jan 08 07:18:20 ip-172-31-32-66 mpd[1247]: aws mpd plugin init called: 0
Jan 08 07:18:20 ip-172-31-32-66 mpd[1247]: [0:0:1d.0] write 56 bytes out of 56 bytes to fd 4
Jan 08 07:18:20 ip-172-31-32-66 mpd[1247]: [0:0:1d.0] msg arrived on mailbox fd 4
Jan 08 07:18:20 ip-172-31-32-66 mpd[1247]: [0:0:1d.0] retrieved msg size from mailbox: 40 bytes
Jan 08 07:18:20 ip-172-31-32-66 mpd[1247]: [0:0:1d.0] read 72 bytes out of 72 bytes from fd 4, valid: 1
Jan 08 07:18:20 ip-172-31-32-66 mpd[1247]: [0:0:1d.0] mpd daemon: request 11 received(reqSize: 24)

--> At this point xbutil scan reports: <----
[0] 0000:00:1d.0 xilinx_aws-vu9p-f1_dynamic_5_0(ts=0xabcd) user(inst=128)
WARNING: card(s) marked by '
' are not ready, run xbmgmt flash --scan --verbose to further check the details.

b) Restart the mpd.service - Xilinx Management Proxy Daemon (MPD):

Jan 08 07:18:24 ip-172-31-32-66 mpd[1247]: [0:0:1d.0] msg arrived on mailbox fd 4
Jan 08 07:18:24 ip-172-31-32-66 mpd[1247]: [0:0:1d.0] retrieved msg size from mailbox: 32 bytes
Jan 08 07:18:24 ip-172-31-32-66 mpd[1247]: [0:0:1d.0] read 64 bytes out of 64 bytes from fd 4, valid: 1
Jan 08 07:24:35 ip-172-31-32-66 mpd[1247]: mpd caught signal 15
Jan 08 07:24:35 ip-172-31-32-66 systemd[1]: Stopping Xilinx Management Proxy Daemon (MPD)...
Jan 08 07:24:37 ip-172-31-32-66 mpd[1247]: [0:0:1d.0] write 56 bytes out of 56 bytes to fd 4
Jan 08 07:24:37 ip-172-31-32-66 mpd[1247]: [0:0:1d.0] mpd_getMsg thread 0 exit!!
Jan 08 07:24:43 ip-172-31-32-66 mpd[1247]: [0:0:1d.0] write 0 bytes out of 2104 bytes to fd 4
Jan 08 07:24:43 ip-172-31-32-66 mpd[1247]: [0:0:1d.0] mpd_handleMsg thread 0 exit!!
Jan 08 07:24:43 ip-172-31-32-66 mpd[1247]: aws mpd plugin fini called
Jan 08 07:24:43 ip-172-31-32-66 mpd[1247]: ended
Jan 08 07:24:43 ip-172-31-32-66 mpd[2526]: started
Jan 08 07:24:43 ip-172-31-32-66 mpd[2526]: found mpd plugin: /opt/xilinx/xrt/lib/libmpd_plugin.so
Jan 08 07:24:43 ip-172-31-32-66 mpd[2526]: aws mpd plugin init called: 0
Jan 08 07:24:43 ip-172-31-32-66 mpd[2526]: [0:0:1d.0] write 56 bytes out of 56 bytes to fd 4
Jan 08 07:24:43 ip-172-31-32-66 mpd[2526]: [0:0:1d.0] msg arrived on mailbox fd 4
Jan 08 07:24:43 ip-172-31-32-66 mpd[2526]: [0:0:1d.0] retrieved msg size from mailbox: 40 bytes
Jan 08 07:24:43 ip-172-31-32-66 mpd[2526]: [0:0:1d.0] read 72 bytes out of 72 bytes from fd 4, valid: 1
Jan 08 07:24:43 ip-172-31-32-66 mpd[2526]: [0:0:1d.0] mpd daemon: request 11 received(reqSize: 24)
Jan 08 07:24:43 ip-172-31-32-66 mpd[2526]: [0:0:1d.0] write 2104 bytes out of 2104 bytes to fd 4
Jan 08 07:24:43 ip-172-31-32-66 mpd[2526]: [0:0:1d.0] msg arrived on mailbox fd 4
Jan 08 07:24:43 ip-172-31-32-66 mpd[2526]: [0:0:1d.0] retrieved msg size from mailbox: 48 bytes
Jan 08 07:24:43 ip-172-31-32-66 mpd[2526]: [0:0:1d.0] read 80 bytes out of 80 bytes from fd 4, valid: 1
Jan 08 07:24:43 ip-172-31-32-66 mpd[2526]: [0:0:1d.0] mpd daemon: request 10 received(reqSize: 32)
Jan 08 07:24:43 ip-172-31-32-66 mpd[2526]: [0:0:1d.0] write 524360 bytes out of 524360 bytes to fd 4

--> At this point xbutil scan reports: <----
[0] 0000:00:1d.0 xilinx_aws-vu9p-f1_dynamic_5_0(ts=0xabcd) user(inst=128)

asked 4 years ago391 views
1 Answer
0
Accepted Answer

Hello,

As of right now, we are working towards officially supporting 2019.2 Xilinx toolset(XRT, Vivado, Vitis).
We are aware of the issue with mpd and are working with Xilinx on a workaround.

You do not need to remove the packages in order to get it to work.

To use the AWS fpga mgmt library with 2019.2 XRT, you need to run the following steps:

  1. Stop mpd first
sudo systemctl stop mpd
  1. Load your image using the AWS F1 SDK: fpga-load-local-image -S<slot number> -I <AGFI>

  2. Start mpd

sudo systemctl start mpd
  1. The card should be usable now until you have to clear the FPGA image. You should be able to use xbutil to program any other AFI now.

AWS uses different PCI Device ID's to differentiate between loaded AFI's and a Cleared slot and the bug is in contention between mpd and the aws managment library when we have to refresh Device ID's.

I apologize for this churn and we will update here once we have a fix for this. Until then, please let us know if this workaround does not work for you.

-Deep

Deep_P
answered 4 years ago
profile picture
EXPERT
reviewed 12 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions