We have a FSx for Lustre configuration at AWS (FSx Persistant SSD 1.2TB 250mb/s/TiB; FSx for Lustre server version 2.12). From an EC2 with Ubuntu that file system is mounted using the Lustre Client Modules. The version of these lustre clients depend on the Linux kernel.
On the file system we have compressed data tables (80000 row x 1440 col and transposed as well). These are stored using the fst library/package for R (https://www.fstpackage.org/). The data are stored columnwise in a serialised manner. The benefit is, one can read a set of columns (or rows) without having to read the whole file (similar to Parquet files). It is (one of) the fastest way to read/write data from R.
Recently we discovered that on a newly configured EC2 reading data from these files on the Lustre file system is a lot slower than on an older EC2.
After some debugging it was found that an EC2 with Ubuntu 18.04.6 LTS and kernel 5.4.0-1083-aws using the Lustre Client 2.10.8 has the same fast performance as expected (same as the older EC2). However, upgrading the Lustre Client to 2.12.8 (nothing else is different... same machine) results in poor performance. The job to test the speed is reading 4 columns from 180 files containing a data table as described above. This takes about 5 seconds when it is fast, but slows down to 20 seconds in the slow case.
In addition just reading one whole table (one file) using read_fst takes about:
1 - 2 seconds with Lustre Client 2.10.8
20 - 22 seconds with Lustre Client 2.12.8
1-2 seconds with Lustre Client 2.15.4 (on Ubuntu 22... another EC2).
Reading the files immediately again using the Lustre Client 2.12.8 improves the performance back to 1 - 2 seconds. So, when they are cached (somewhere), the performance is OK, but a cold read is very slow. In contrast, the other two (2.10.8 and 2.15.4) are already very fast reading the files the first time.
I would use 2.15.4 which is the latest supported version using the highest supported Ubuntu version (22), but unfortunately the performance of 2.15.4 is similar to that of 2.12.8 in the first test (reading 4 columns from 180 files takes about 20 seconds instead of around 5). As a result, we're stuck with Ubuntu 18 and kernel 5.4.0 which is the latest combination that still supports Lustre Client 2.10.8 (which is fast in all cases).
The test has been repeated a lot of times at different times to rule out caching behaviour.
What could be the reason for these large performance differences (4x to 10x slower)? Are there perhaps some parameter settings different between the Lustre Client versions? Can those be adjusted?