Strange file consistency issue with EFS

1

Issue description

Under certain workloads, I am having issues with EFS consistency (files written by client A not seen by client B). This is a minimal reproducible example with the following setup:

  • 2 VMs (server and client) running on AWS EC2
  • EFS mounted on both VMs (via a simple access point)
  • client requests server to compile a project (in this case, glfw), the build directory is under the EFS share point
  • client checks if the libglfw3.a exists under the EFS share point

Expectation

According to the close-to-open consistency rule, the client should be able to see the libglfw3.a file after the server finishes the compilation (even earlier actually, as soon as the finishes writing and closes the specifc file).

Observed behavior

The client sees the libglfw3.a file with a significant delay (tens of seconds), way after the server finishes the compilation. Importantly, if the NFS caches are disabled (actimeo=0), the client sees the libglfw3.a file immediately. If the NFS is mounted with looser caches (actimeo=600), it takes even longer for the client to see the libglfw3.a file.

This suggests that the caches are simply not invalidated by the open() syscall. Unless I misunderstand something, this does not respect the close-to-open consistency rule.

It is worth noting that the issue is quite hard to reproduce (I also tried writing simple files in various ways from on client and the other always sees them right away).

Steps to reproduce

Please refer to this repo for the source code.

aws cloudformation create-stack --stack-name efs-consistency-issue \
 --template-body file://config.yaml \
 --capabilities CAPABILITY_NAMED_IAM \
 --parameters ParameterKey=SSHKeyName,ParameterValue=keyname

Server VM setup

rsync -avz -e "ssh -i ~/.ssh/keyname.pem" . ubuntu@server_vm_public_ip:~/issue_report/

Build server container image and run it

ssh -i ~/.ssh/keyname.pem "ubuntu@server_vm_public_ip"

sudo docker build -t server -f Dockerfile.server .
sudo docker run -v ./share_point:/mnt/offloading_share_point/ --network host server

Client VM setup

Identical to server setup

Build client container image and run

Just change the server IP in the command below

sudo docker build -t client -f Dockerfile.client .

sudo docker run -v ./share_point:/mnt/offloading_share_point/ --network host client "python3 -u py_scripts/simple_timing_check.py http://server_vm_private_ip:5031/compile"
Ronald
asked a year ago349 views
1 Answer
0
Accepted Answer

I found the issue, it has to do with the way NFS caches directory entries - CTO only applies after a file is already visible to the client, but until then the NFS client might query the directory, find that it's empty and keep that result for a while. The nfs man page explains this in the "Directory entry caching" section

Ronald
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions