[EC2 FPGA] XDMA transfers fail on f1.16xlarge

0

Hello, I've been using EC2, FPGAs for about a 15 months now. I've always been using f1.2xlarge instances, with Ubuntu OS installed, and it worked as expected. Now, due to the amount of CPU intensive work I need to do, I've decided to try using a more robust f1.16xlarge. However, I ran into problems here. I've done all the steps: Loaded the AGFI, checked via lspci is it available, and then tried some simple XDMA read/write tests, just to make sure the connection is still there. Sadly, I get no communication with the PCIe FPGA board. Below is the dmesg output, that reports that the "magic" error in the descriptor happened. Again, I'm using the same driver, same AGFI, and the same Python wrappers around C invocation of kernel.

[ 1683.589120] xdma:engine_service_final_transfer: engine 0-H2C0-MM, status error 0x80010.
[ 1683.589123] xdma:engine_status_dump: SG engine 0-H2C0-MM status: 0x00080010: MAGIC_STOPPED,DESC_ERR:UNSUPP_REQ
[ 1683.589126] 0-H2C0-MM, s 0x80010, aborted xfer 0x00000000e19d64e9, cmpl 0/1
[ 1683.589136] xdma:xdma_xfer_submit: xfer 0x00000000e19d64e9,1024, failed, ep 0x0.

EDIT: I've figured out the problem, after looking more closely at the dmesg output, I figured out that the AGFI was loaded on a different FPGA slot. I've loaded the AGFI as I always do: sudo fpga-load-local-image -S 0 -I $MY_AGFI_ID -H. I don't see how it could end up on Slot 8? When I try to adjust my test and run it on Slot #8, all works as expected! To be honest dmesg shows this pretty straightforward:

[  183.021774] xdma:remove_one: pdev 0x00000000999de0ae, xdev 0x0000000076ff6236, 0x00000000ca7f10f1.
[  183.021777] xdma:xpdev_free: xpdev 0x0000000076ff6236, destroy_interfaces, xdev 0x00000000ca7f10f1.
[  183.024133] xdma:xpdev_free: xpdev 0x0000000076ff6236, xdev 0x00000000ca7f10f1 xdma_device_close.
[  186.066065] pci 0000:00:0f.0: [1d0f:f000] type 00 class 0x058000
[  186.066817] pci 0000:00:0f.0: reg 0x10: [mem 0x86000000-0x87ffffff]
[  186.067206] pci 0000:00:0f.0: reg 0x14: [mem 0x85200000-0x853fffff]
[  186.067855] pci 0000:00:0f.0: reg 0x18: [mem 0x5e000410000-0x5e00041ffff 64bit pref]
[  186.068493] pci 0000:00:0f.0: reg 0x20: [mem 0x5c000000000-0x5dfffffffff 64bit pref]
[  186.083784] pci 0000:00:0f.0: BAR 4: assigned [mem 0x5c000000000-0x5dfffffffff 64bit pref]
[  186.084214] pci 0000:00:0f.0: BAR 0: assigned [mem 0x86000000-0x87ffffff]
[  186.084317] pci 0000:00:0f.0: BAR 1: assigned [mem 0x85200000-0x853fffff]
[  186.084421] pci 0000:00:0f.0: BAR 2: assigned [mem 0x5e000410000-0x5e00041ffff 64bit pref]
[  186.084996] xdma:xdma_device_open: xdma device 0000:00:0f.0, 0x000000006c4610d7.
[  186.086074] xdma:map_single_bar: BAR0 at 0x86000000 mapped at 0x00000000f2cc5fa3, length=33554432(/33554432)
[  186.086088] xdma:map_single_bar: BAR1 at 0x85200000 mapped at 0x00000000caa22a31, length=2097152(/2097152)
[  186.086106] xdma:map_single_bar: BAR2 at 0x5e000410000 mapped at 0x000000003049f8eb, length=65536(/65536)
[  186.086109] xdma:map_bars: config bar 2, pos 2.
[  186.086110] xdma:map_single_bar: Limit BAR 4 mapping from 137438953472 to 2147483647 bytes
[  186.086115] xdma:map_single_bar: BAR4 at 0x5c000000000 mapped at 0x00000000bfd17001, length=2147483647(/137438953472)
[  186.086116] xdma:identify_bars: 4 BARs: config 2, user 0, bypass 4.
[  186.095983] xdma:pci_keep_intx_enabled: 0000:00:0f.0: clear INTX_DISABLE, 0x406 -> 0x6.
[  186.096158] xdma:irq_msix_channel_setup: engine 8-H2C0-MM, irq#572.
[  186.096193] xdma:irq_msix_channel_setup: engine 8-H2C1-MM, irq#573.
[  186.096225] xdma:irq_msix_channel_setup: engine 8-H2C2-MM, irq#574.
[  186.096270] xdma:irq_msix_channel_setup: engine 8-H2C3-MM, irq#575.
[  186.096301] xdma:irq_msix_channel_setup: engine 8-C2H0-MM, irq#576.
[  186.096334] xdma:irq_msix_channel_setup: engine 8-C2H1-MM, irq#577.
[  186.096366] xdma:irq_msix_channel_setup: engine 8-C2H2-MM, irq#578.
[  186.096397] xdma:irq_msix_channel_setup: engine 8-C2H3-MM, irq#579.
[  186.096431] xdma:irq_msix_user_setup: 8-USR-0, IRQ#580 with 0x000000000932b671
[  186.096463] xdma:irq_msix_user_setup: 8-USR-1, IRQ#581 with 0x000000005edcc121
[  186.096511] xdma:irq_msix_user_setup: 8-USR-2, IRQ#582 with 0x00000000249674d9
[  186.096560] xdma:irq_msix_user_setup: 8-USR-3, IRQ#583 with 0x00000000d26d07c5
[  186.096594] xdma:irq_msix_user_setup: 8-USR-4, IRQ#584 with 0x00000000c940ac79
[  186.096627] xdma:irq_msix_user_setup: 8-USR-5, IRQ#585 with 0x000000001fccab2f
[  186.096666] xdma:irq_msix_user_setup: 8-USR-6, IRQ#586 with 0x0000000009c457eb
[  186.096699] xdma:irq_msix_user_setup: 8-USR-7, IRQ#587 with 0x000000002bedefd1
[  186.096732] xdma:irq_msix_user_setup: 8-USR-8, IRQ#588 with 0x000000004ca712de
[  186.096765] xdma:irq_msix_user_setup: 8-USR-9, IRQ#589 with 0x00000000e191ad7b
[  186.096799] xdma:irq_msix_user_setup: 8-USR-10, IRQ#590 with 0x00000000026a9f8b
[  186.096833] xdma:irq_msix_user_setup: 8-USR-11, IRQ#591 with 0x00000000a7138ee8
[  186.096868] xdma:irq_msix_user_setup: 8-USR-12, IRQ#592 with 0x00000000b0c4b138
[  186.096902] xdma:irq_msix_user_setup: 8-USR-13, IRQ#593 with 0x000000007f7aa664
[  186.096934] xdma:irq_msix_user_setup: 8-USR-14, IRQ#594 with 0x0000000070f6c0f6
[  186.096970] xdma:irq_msix_user_setup: 8-USR-15, IRQ#595 with 0x000000009aed6be9
[  186.096978] xdma:probe_one: 0000:00:0f.0 xdma8, pdev 0x000000006c4610d7, xdev 0x0000000044888d47, 0x0000000058d876e9, usr 16, ch 4,4.

Is this a bug of some kind? How to protect myself?

Thank you in advance.

asked 2 years ago308 views
1 Answer
0

This seems weird as the slots would go from Slot 0 -> 7 to amount for 8 slots, so the xdma8 that you are pointing to might be a red herring. Have you looked at how we get the slot to xdma mapping in our cl_dram_dma example: https://github.com/aws/aws-fpga/blob/master/hdk/cl/examples/cl_dram_dma/software/runtime/test_dram_dma.c#L185

That should basically let you use slots instead of trying to attach to xdma devices directly.

To prove that the xdma8 in your dmesg is a red herring, here is what I tried:

  1. I started with cleared slots:
sudo fpga-describe-local-image-slots
AFIDEVICE    0       0x1d0f      0x1042      0000:00:0f.0
AFIDEVICE    1       0x1d0f      0x1042      0000:00:11.0
AFIDEVICE    2       0x1d0f      0x1042      0000:00:13.0
AFIDEVICE    3       0x1d0f      0x1042      0000:00:15.0
AFIDEVICE    4       0x1d0f      0x1042      0000:00:17.0
AFIDEVICE    5       0x1d0f      0x1042      0000:00:19.0
AFIDEVICE    6       0x1d0f      0x1042      0000:00:1b.0
AFIDEVICE    7       0x1d0f      0x1042      0000:00:1d.0
  1. Installed XDMA
  2. Loaded the prebuilt cl_dram_dma AFI on Slot 0:
sudo fpga-load-local-image -S0 -I agfi-0b5c35827af676702
AFI          0       agfi-0b5c35827af676702  loaded            0        ok               0       0x04261818
AFIDEVICE    0       0x1d0f      0xf001      0000:00:0f.0
  1. Slot 0 shows the AFI's new device ID
sudo fpga-describe-local-image-slots
AFIDEVICE    0       0x1d0f      0xf001      0000:00:0f.0
AFIDEVICE    1       0x1d0f      0x1042      0000:00:11.0
AFIDEVICE    2       0x1d0f      0x1042      0000:00:13.0
AFIDEVICE    3       0x1d0f      0x1042      0000:00:15.0
AFIDEVICE    4       0x1d0f      0x1042      0000:00:17.0
AFIDEVICE    5       0x1d0f      0x1042      0000:00:19.0
AFIDEVICE    6       0x1d0f      0x1042      0000:00:1b.0
AFIDEVICE    7       0x1d0f      0x1042      0000:00:1d.0
  1. Dmesg shows xdma7:
[  532.867229] xdma:probe_one: 0000:00:0f.0 xdma7, pdev 0xffff897295a85000, xdev 0xffff88fafa894000, 0xffff88fafa896000, usr 16, ch 4,4.

This will probably show a different slot depending on how xdma sequences through slots, but if you use the method we provide through our library, you won't need to worry about this.

Hope this helps, and let us know if you still need help after this.

Thanks,

Deep

Deep_P
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions