Skip to content

Choosing an S3 connector for ML training with S3 Express One Zone

10 minute read
Content level: Intermediate
1

A field guide for PyTorch, TensorFlow, Spark, and Kubernetes workloads reading training data from Amazon S3 Express One Zone directory buckets.

The problem

You're training an ML model on AWS and you've chosen Amazon S3 Express One Zone for its single-digit millisecond latency and high request rates. Your data is in a directory bucket. Now you need to connect your training framework to it.

This guide covers which connector to use, what Express-specific configuration each one requires, and the mistakes that will cost you time.

One universal requirement: every connector needs s3express:CreateSession in its IAM policy. Standard managed policies (AmazonSageMakerFullAccess, AmazonElasticMapReduceforEC2Role, etc.) do not include it. The IAM section has the full policy template.

Decision matrix

ScenarioConnectorNotes
PyTorch training (EC2 or SageMaker)S3 Connector for PyTorchCRT handles CreateSession transparently
TensorFlow / JAX on EC2Mountpoint for S3Mount the directory bucket directly
Any framework on EKSMountpoint CSI driver v2+Set bucketName to directory bucket
Spark / Hive / Flink on EMRS3A (not EMRFS)Requires Express-specific cluster config; EMR 6.15+ Magic Committer enables writes
SageMaker training jobNative S3UriRole needs s3express:CreateSession
Custom pipelinesboto3Transparent, no code changes
Distributed checkpointsS3 Connector for PyTorch or MountpointExpress sub-10ms writes reduce checkpoint overhead
Shared read cache across nodesMountpoint --cache-xzUses a separate directory bucket as cache

The connectors

S3 Connector for PyTorch

The Amazon S3 Connector for PyTorch is a PyTorch-native library backed by the AWS Common Runtime (CRT). The CRT handles CreateSession for Express transparently, point s3_uri at the directory bucket and go.

This is the best throughput option for PyTorch. If you need the same data accessible to non-PyTorch tools, use Mountpoint instead.

Mountpoint for Amazon S3

Mountpoint for Amazon S3 is an open-source file client that mounts an S3 bucket as a Linux filesystem. It supports directory buckets directly, use the full bucket name with the --<az>--x-s3 suffix. Framework-agnostic: TensorFlow tf.data, JAX, HuggingFace datasets, NumPy, and OpenCV all work.

Two caching modes with Express:

  • --cache /mnt/local-cache -local disk cache on the instance
  • --cache-xz <cache-directory-bucket>-shared cache backed by a separate Express directory bucket, useful for multi-node clusters reading the same data (objects ≤1 MB only)

Linux-only. Does not provide full POSIX semantics (no file locking, atomic rename, or hard links).

Mountpoint CSI driver (Amazon EKS)

The Mountpoint CSI driver is the Kubernetes version of Mountpoint, available as an Amazon EKS add-on. Set bucketName in volumeAttributes to the directory bucket name. The v2 driver uses EKS Pod Identity for IAM.

It's not supported on Fargate, EKS Hybrid Nodes, or Windows containers.

S3A (Apache Hadoop connector) on Amazon EMR

The only Hadoop-ecosystem connector that supports S3 Express. EMRFS does not support Express, if you're on Amazon EMR and want Express, you must use S3A with s3a:// URIs, even though EMRFS is the default.

Requires specific cluster configuration (see code snippet below) because Express ETags aren't MD5 checksums, region resolution doesn't auto-detect, and S3 Select isn't supported.

Writes: EMR 6.15+ uses Spark's MagicCommitProtocol (backed by MagicV2Committer) as the default FileCommitProtocol for s3a://. Magic Committer is rename-free — it publishes outputs via S3 multipart upload completion, which Express supports — so reads and writes to a directory bucket both work. Don't override the default committer. Older EMR releases, or self-managed Spark on Apache Hadoop older than 3.4.0, default to FileOutputCommitter, which can fail on Express with InvalidStorageClass.

Requires EMR 6.15+ (Spark) or 7.2+ (HBase/Flink/Hive). EMR 7.10+ ships the Amazon EMR S3A connector with native Express support by default.

Amazon SageMaker training jobs

Amazon SageMaker File mode, FastFile mode, and Pipe mode all support Express directory buckets as input. Point S3Uri at the directory bucket. The training role needs s3express:CreateSession because AmazonSageMakerFullAccess alone is not sufficient.

Encryption note: SageMaker output written to a directory bucket can only use SSE-S3. SSE-KMS is not supported for SageMaker output to Express. Input staging is unaffected.

AWS SDK (boto3)

Current SDK versions handle CreateSession transparently for directory buckets. Use the directory bucket name as the Bucket parameter, no other changes needed. Good for custom data pipelines and preprocessing/embedding jobs.

Code snippets

Every snippet below was tested against a live S3 Express One Zone directory bucket. They include fixes for API changes (pytorch_lightning 2.x) and required configuration not present in default setups (S3A on EMR, SageMaker IAM).

S3 Connector for PyTorch

from s3torchconnector import S3MapDataset
from torch.utils.data import DataLoader

dataset = S3MapDataset.from_prefix(
    s3_uri="s3://my-bucket--use1-az4--x-s3/imagenet/train/",
    region="us-east-1",
)

loader = DataLoader(
    dataset,
    batch_size=256,
    num_workers=8,
    shuffle=True,
    collate_fn=lambda batch: batch,  # S3Reader objects; decode to tensors in your transform
)

for batch in loader:
    images = [item.read() for item in batch]  # bytes; decode with PIL, cv2, etc.
    ...

Checkpoint saving with PyTorch Lightning 2.x:

from s3torchconnector.lightning import S3LightningCheckpoint

# Lightning 2.x API—do NOT use plugins=[S3LightningCheckpoint(...)]
# or default_root_dir="s3://..." (both fail in 2.x).
trainer = pl.Trainer(
    default_root_dir="/tmp/checkpoints",  # local path for trainer internals
    enable_checkpointing=False,           # disable auto ModelCheckpoint
    max_epochs=10,
)
trainer.strategy.checkpoint_io = S3LightningCheckpoint(region="us-east-1")
trainer.fit(model)
# Save explicitly to the directory bucket
trainer.save_checkpoint("s3://my-bucket--use1-az4--x-s3/run-42/model.ckpt")

Mountpoint for S3

Mount a directory bucket:

sudo mount-s3 my-bucket--use1-az4--x-s3 /mnt/s3 \
    --allow-delete \
    --allow-overwrite \
    --region us-east-1

Mount a general-purpose bucket with a shared Express cache:

# --cache-xz requires a SEPARATE directory bucket as the cache target.
# Only objects <=1 MB are cached in the shared cache.
sudo mount-s3 my-training-bucket /mnt/s3 \
    --cache-xz my-cache-bucket--use1-az4--x-s3 \
    --read-only \
    --region us-east-1

For read-only training data, omit --allow-delete and --allow-overwrite. For write workloads (checkpointing), both flags are required—without them, writes return Operation not permitted even as root.

Mountpoint CSI driver on EKS (PVC)

apiVersion: v1
kind: PersistentVolume
metadata:
  name: s3-express-training-data
spec:
  capacity:
    storage: 1200Gi
  accessModes:
    - ReadOnlyMany          # use ReadWriteMany for checkpoint writes
  mountOptions:
    - region us-east-1
    # add allow-delete and allow-overwrite here for write workloads
  csi:
    driver: s3.csi.aws.com
    volumeHandle: s3-express-training-data
    volumeAttributes:
      bucketName: my-bucket--use1-az4--x-s3   # directory bucket name

The CSI driver's IAM role needs s3express:CreateSession on the directory bucket ARN. Use EKS Pod Identity (v2 driver) rather than IRSA.

S3A on EMR for S3 Express

[
  {
    "Classification": "core-site",
    "Properties": {
      "fs.s3a.aws.credentials.provider": "software.amazon.awssdk.auth.credentials.InstanceProfileCredentialsProvider",
      "fs.s3a.change.detection.mode": "none",
      "fs.s3a.endpoint.region": "us-east-1",
      "fs.s3a.select.enabled": "false"
    }
  },
  {
    "Classification": "spark-defaults",
    "Properties": {
      "spark.sql.sources.fastS3PartitionDiscovery.enabled": "false"
    }
  }
]

Then use s3a:// URIs in your Spark jobs:

df = spark.read.format("binaryFile").load("s3a://my-bucket--use1-az4--x-s3/dataset/shards/")

Why each setting matters:

  • change.detection.mode=none-Express ETags are not MD5 checksums; default change detection breaks
  • endpoint.region-region resolution doesn't auto-detect for Express
  • select.enabled=false-S3 Select is not supported on Express
  • fastS3PartitionDiscovery.enabled=false-uses an S3 API parameter not supported by Express

SageMaker training job

import boto3

sm = boto3.client("sagemaker", region_name="us-east-1")
sm.create_training_job(
    TrainingJobName="my-express-training-job",
    AlgorithmSpecification={
        "TrainingImage": "...",
        "TrainingInputMode": "File",  # or FastFile
    },
    RoleArn="arn:aws:iam::123456789012:role/my-sagemaker-role",
    InputDataConfig=[{
        "ChannelName": "training",
        "DataSource": {
            "S3DataSource": {
                "S3Uri": "s3://my-bucket--use1-az4--x-s3/dataset/",
                "S3DataType": "S3Prefix",
                "S3DataDistributionType": "FullyReplicated",
            }
        },
    }],
    ...
)

File mode copies the full dataset to local storage before training starts. Use it when your algorithm needs random access to the entire dataset or when the dataset fits comfortably on the instance volume. FastFile mode streams from S3 on demand with a file system interface. Use this for large datasets or when you want training to start immediately. FastFile has replaced Pipe mode, which is still supported but no longer recommended.

IAM requirements

Every connector needs s3express:CreateSession on the directory bucket ARN. The ARN format is:

arn:aws:s3express:<region>:<account-id>:bucket/<bucket-name>

Object-level operations (s3:GetObject, s3:PutObject, etc.) use standard IAM actions against the same ARN.

{
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "s3express:CreateSession",
      "Resource": "arn:aws:s3express:us-east-1:123456789012:bucket/my-bucket--use1-az4--x-s3"
    },
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject"],
      "Resource": "arn:aws:s3express:us-east-1:123456789012:bucket/my-bucket--use1-az4--x-s3/*"
    },
    {
      "Effect": "Allow",
      "Action": "s3:ListBucket",
      "Resource": "arn:aws:s3express:us-east-1:123456789012:bucket/my-bucket--use1-az4--x-s3"
    }
  ]
}

Common mistakes

Using EMRFS with Express. EMRFS does not support Express. On EMR, s3:// URIs without the S3A config route through EMRFS and fail or silently fall back to general-purpose S3 behavior.

Missing s3express:CreateSession. Standard managed policies don't include it. The IAM template above covers what you need.

Wrong AZ. Express is single-AZ. If your training instances are in a different AZ than the directory bucket, you reduce the latency benefit (~4ms in-AZ vs ~10ms cross-AZ). Cross-AZ data transfer within the same region is free for Express, so it's a latency cost but not a dollar cost. Check AZ IDs, not AZ names (names are account-specific, IDs are not).

--cache-xz with the same bucket. The cache target must be a separate directory bucket. Same bucket for data and cache is not supported.

--cache-xz size limit. Only objects ≤1 MB are cached in the shared Express cache. For large shards, use local disk caching (--cache) instead.

S3A writes returning InvalidStorageClass on older EMR or self-managed Spark. FileOutputCommitter performs a directory rename at job commit that Express rejects. On EMR 6.15+ this is handled automatically — Spark's default is MagicCommitProtocolMagicV2Committer, which is rename-free. On pre-6.15 EMR or Apache Hadoop older than 3.4.0, upgrade; those releases add Express-aware committer support.

Lightning 2.x checkpoint API. The plugins=[S3LightningCheckpoint(...)] pattern and default_root_dir="s3://..." both fail in Lightning 2.x. See the Lightning snippet for the working 2.x pattern.

aws s3 sync with directory buckets. Not supported. Use aws s3api copy-object per object, or script a list-and-copy loop.

num_workers=0 in the PyTorch DataLoader. The S3 Connector benefits from multiple workers. Single-threaded loading bottlenecks regardless of connector throughput.

In conclusion: Access patterns matter more than connector choice

The biggest lever for ML training throughput isn't the connector, it's whether you read sequentially or randomly.

  • Sequential reads (large sharded files, many samples per GET) are high throughput and latency-tolerant. All connectors perform well; Express's latency advantage is less pronounced.
  • Random reads (millions of small files, one GET per sample), are latency-bound. Express's sub-10ms latency even across AZs helps significantly vs. general-purpose S3.

Two distinct patterns to keep in mind:

  • Training data loading (sequential batch reads): shard your dataset into larger files (WebDataset tar shards, TFRecords, Parquet) before picking a connector. Millions of individual GETs, even at 4ms each, add up. Fewer, larger files mean fewer requests and higher throughput on Express.
  • High-TPS random access (checkpoint writes, metadata lookups, small-object key-value patterns): this is where Express shines. Each request completes in ~4ms and a single directory bucket can handle hundreds of thousands of TPS without prefix partitioning. Don't shard these datasets, instead take the fast access to individual objects.

Happy training.

References