Skip to content

ECS Fargate writing and reading files is failing in a very odd way

0

I have a ECS service written in Go that is running on Fargate, and I am getting some very strange filesystem behavior that I cannot explain.

I have ephemeral storage mounted on /scratch, and my containers all run as root (because of course you need to apparently, storage other than EFS seems to be mounted as root owned, 755 mode...)

When I read and write with my stress test, as well as with my actual app, files I just wrote seem to vanish. I can write a file, and when I go to read it, it's gone...

The included test code writes 1,000 files, then reads them back, and then deletes them. This loop runs until it gets an error, or 1,000 times. This test works fine on my laptop, and in Kubernetes on both my home lab as well as EKS and GKE. No issues.

On my Fargate task:

2025/08/05 13:43:29 Iteration 99...
  loop sample path: /scratch/lakerunner/stress-test-3307732947
2025/08/05 13:43:29 Iteration 100...
  loop sample path: /scratch/lakerunner/stress-test-2039664141
Error: failed to read back file #668 (/scratch/lakerunner/stress-test-1072025920): open /scratch/lakerunner/stress-test-1072025920: no such file or directory

It fails randomly, sometimes as early as the first file.

I am stuck. I'm not sure where to go from here. I know I could use EFS, but I really just use the disk as scratch space, and that seems like a really rotten change. Local disk is very fast, EFS, not so much, and also of course costly.

Test code:

package cmd

import (
	"fmt"
	"log"
	"os"

	"github.com/spf13/cobra"
)

func init() {
	cmd := &cobra.Command{
		Use:   "diskstress",
		Short: "Run disk stress tests",
		RunE: func(_ *cobra.Command, _ []string) error {
			return stressTestTempFiles()
		},
	}

	rootCmd.AddCommand(cmd)
}

func stressTestTempFiles() error {
	fmt.Printf("Running disk stress test, tmpdir is %q\n", os.TempDir())

	for loop := range 1000 {
		log.Printf("Iteration %d...", loop+1)
		var filePaths []string

		for i := range 1000 {
			f, err := os.CreateTemp("", "stress-test-*")
			if err != nil {
				return fmt.Errorf("failed to create temp file #%d: %w", i, err)
			}

			n, err := f.WriteString("hello world")
			if err != nil {
				if err2 := f.Close(); err2 != nil {
					log.Printf("warning: failed to close file %s: %v", f.Name(), err2)
				}
				return fmt.Errorf("failed to write to file %s: %w", f.Name(), err)
			}
			if n != len("hello world") {
				if err2 := f.Close(); err2 != nil {
					log.Printf("warning: failed to close file %s: %v", f.Name(), err2)
				}
				return fmt.Errorf("short write to file %s: wrote %d bytes, expected %d", f.Name(), n, len("hello world"))
			}

			filePaths = append(filePaths, f.Name())
			if err2 := f.Close(); err2 != nil {
				log.Printf("warning: failed to close file %s: %v", f.Name(), err2)
			}
		}

		for i, path := range filePaths {
			if i == 0 {
				fmt.Printf("  loop sample path: %s\n", path)
			}
			data, err := os.ReadFile(path)
			if err != nil {
				return fmt.Errorf("failed to read back file #%d (%s): %w", i, path, err)
			}
			if string(data) != "hello world" {
				return fmt.Errorf("unexpected content in file %s: %q", path, data)
			}
		}

		for _, path := range filePaths {
			if err := os.Remove(path); err != nil {
				log.Printf("warning: failed to remove file %s: %v", path, err)
			}
		}
	}
	return nil
}
asked 10 months ago384 views
1 Answer
-1

This issue with files seemingly vanishing in your Fargate task is likely related to how concurrent file operations interact with the mounted storage in Fargate.

When you're experiencing files that disappear after writing them, this could be due to one of several issues:

  1. Uninterruptible Sleep State: When multiple processes are writing to the same filesystem in Fargate, one process might enter an "Uninterruptible sleep" (D) state while writing, which can cause other operations to hang or fail. This is particularly common with network-attached storage like EFS, but can also happen with ephemeral storage under high concurrency.

  2. Resource Limitations: Fargate tasks have specific CPU and memory allocations. If your task is running with limited resources, it might struggle with high-volume file operations, especially when creating and accessing 1,000 files in rapid succession.

  3. Filesystem Caching Issues: The ephemeral storage in Fargate might not be synchronizing file operations as expected. When you write a file and immediately try to read it, the filesystem might not have completed flushing the data to disk.

To address this issue, you could try:

  1. Add Synchronization: Force data flushes periodically using fsync or similar mechanisms in your Go code to ensure writes are completed before attempting reads.

  2. Increase Task Resources: Allocate more CPU and memory to your Fargate task to handle the file operations more efficiently.

  3. Reduce Concurrency: Modify your test to create fewer files at once or add delays between operations.

  4. Consider EFS with Provisioned Throughput: While you mentioned wanting to avoid EFS due to performance concerns, EFS with Provisioned Throughput mode could provide better performance for your use case if the ephemeral storage continues to be problematic.

  5. Check Disk Space: Ensure your task isn't running out of ephemeral storage space during the test.

If you need to continue using ephemeral storage, you might want to implement a retry mechanism for file operations that fail with "no such file or directory" errors, as this could help mitigate the issue while you investigate the root cause.
Sources
Can not download files in Fargate task with mounted EFS volume | AWS re:Post
Troubleshooting mount issues - Amazon Elastic File System

answered 10 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.