Help us improve the AWS re:Post Knowledge Center by sharing your feedback in a brief survey. Your input can influence how we create and update our content to better support your AWS journey.
Accelerating Container Startup with SOCI Snapshotter on Amazon EKS
Large container images in AI/ML workloads can take longer than normal time to start, with most time spent pulling images. This post demonstrates how SOCI (Seekable OCI) Snapshotter with lazy loading reduces container startup times .
The Problem
Traditional image pulling is sequential: layers download one at a time over a single connection, then unpack serially. For large images, this becomes a majorbottleneck even with high bandwidth.
Test Scenario
We'll create a 6GB Docker image simulating an ML workload to demonstrate the problem and solution.
Step 1: Create the Docker Image
Create a Dockerfile:
FROM ubuntu:22.04 RUN apt-get update && apt-get install -y python3 python3-pip curl wget # Simulate large ML models and datasets RUN dd if=/dev/urandom of=/data1.bin bs=1M count=1500 && \ dd if=/dev/urandom of=/data2.bin bs=1M count=1500 && \ dd if=/dev/urandom of=/data3.bin bs=1M count=1500 && \ dd if=/dev/urandom of=/data4.bin bs=1M count=1500 # Startup script with initialization delays RUN echo '#!/bin/bash\n\ echo "Loading large dataset into memory..."\n\ sleep 60\n\ echo "Initializing ML model..."\n\ sleep 60\n\ echo "Warming up cache..."\n\ sleep 60\n\ echo "Performing health checks..."\n\ sleep 60\n\ echo "Application ready!"\n\ python3 -m http.server 8080' > /start.sh && chmod +x /start.sh EXPOSE 8080 CMD ["/start.sh"]
Step 2: Build and Push to ECR
Set your AWS account ID and region:
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text) export AWS_REGION=us-west-2 export ECR_REPO=soci-demo export IMAGE_TAG=slow
Create ECR repository:
aws ecr create-repository --repository-name $ECR_REPO --region $AWS_REGION
Authenticate Docker to ECR:
aws ecr get-login-password --region $AWS_REGION | \ docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com
Build and push the image:
docker build -t $ECR_REPO:$IMAGE_TAG . docker tag $ECR_REPO:$IMAGE_TAG $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPO:$IMAGE_TAG docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPO:$IMAGE_TAG
Step 3: Create SOCI Index
Install SOCI CLI:
cd /tmp wget https://github.com/awslabs/soci-snapshotter/releases/download/v0.11.0/soci-snapshotter-0.11.0-linux-amd64.tar.gz tar -xzf soci-snapshotter-0.11.0-linux-amd64.tar.gz chmod +x soci sudo mv soci /usr/local/bin/
Create and push SOCI index to ECR:
soci create $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPO:$IMAGE_TAG soci push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPO:$IMAGE_TAG
The soci create command analyzes image layers and generates an index. The soci push uploads this index to ECR alongside your image.
Understanding SOCI
SOCI (Seekable OCI) creates an index mapping image layer contents, enabling:
- Lazy loading: Start containers with essential files, pull the rest in background
- Parallel pulling: Multiple concurrent connections download chunks simultaneously
- On-demand fetching: Files are downloaded only when needed
EKS Cluster Setup
Step 4: Configure kubectl
aws eks update-kubeconfig --region $AWS_REGION --name eks-workshop
Step 5: Verify SOCI Installation
Check if SOCI is installed on nodes (pre-installed on recent EKS Optimized AMIs):
NODE_NAME=$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}') kubectl debug node/$NODE_NAME -it --profile=general --image=busybox -- chroot /host which soci
If /usr/bin/soci appears, you're ready. Otherwise, install SOCI on nodes using one of the options below.
Using DaemonSet (Recommended for existing clusters)
Create soci-installer-daemonset.yaml:
apiVersion: apps/v1 kind: DaemonSet metadata: name: soci-config namespace: kube-system spec: selector: matchLabels: app: soci-config template: metadata: labels: app: soci-config spec: hostPID: true hostNetwork: true containers: - name: config image: amazonlinux:2023 command: - /bin/bash - -c - | # Install nsenter yum install -y util-linux # Create SOCI config directory mkdir -p /host/etc/soci-snapshotter-grpc # Create SOCI configuration cat > /host/etc/soci-snapshotter-grpc/config.toml <<'EOF' [blob] max_concurrent_downloads = -1 max_concurrent_downloads_per_image = 10 concurrent_download_chunk_size = "16mb" max_concurrent_unpacks = -1 max_concurrent_unpacks_per_image = 10 discard_unpacked_layers = true EOF echo "SOCI configuration created" cat /host/etc/soci-snapshotter-grpc/config.toml # Enable and start SOCI snapshotter service echo "Enabling SOCI snapshotter service..." nsenter -t 1 -m -u -i -n systemctl enable soci-snapshotter.service nsenter -t 1 -m -u -i -n systemctl start soci-snapshotter.service # Wait for service to be ready sleep 5 # Verify service is running nsenter -t 1 -m -u -i -n systemctl status soci-snapshotter.service | head -10 # Restart containerd to apply changes echo "Restarting containerd..." nsenter -t 1 -m -u -i -n systemctl restart containerd echo "Configuration complete. Containerd restarted." sleep infinity securityContext: privileged: true volumeMounts: - name: host mountPath: /host volumes: - name: host hostPath: path: / tolerations: - operator: Exists
Deploy the DaemonSet:
kubectl apply -f soci-installer-daemonset.yaml kubectl rollout status daemonset/soci-config -n kube-system
Verify SOCI service is running:
kubectl logs -n kube-system -l app=soci-config --tail=20
You should see:
SOCI configuration created
Enabling SOCI snapshotter service...
● soci-snapshotter.service - SOCI Snapshotter
Loaded: loaded (/etc/systemd/system/soci-snapshotter.service; enabled; preset: disabled)
Active: active (running)
Restarting containerd...
Configuration complete. Containerd restarted.
Baseline Test Without SOCI
Step 6: Deploy Without SOCI
Create deployment-without-soci.yaml:
apiVersion: apps/v1 kind: Deployment metadata: name: slow-app-without-soci namespace: default spec: replicas: 1 selector: matchLabels: app: slow-app-without-soci template: metadata: labels: app: slow-app-without-soci spec: containers: - name: app image: <AWS_ACCOUNT_ID>.dkr.ecr.<AWS_REGION>.amazonaws.com/soci-demo:slow imagePullPolicy: Always ports: - containerPort: 8080
Replace <AWS_ACCOUNT_ID> and <AWS_REGION> with your values, then deploy:
kubectl apply -f deployment-without-soci.yaml
Monitor the deployment:
kubectl get pods -l app=slow-app-without-soci -w
Check pull time:
kubectl describe pod -l app=slow-app-without-soci | grep -A 5 "Events:"
Note: If testing multiple times, the image may be cached on nodes. To clear the cache for accurate testing:
# Get the node name where pod is scheduled NODE_NAME=$(kubectl get pods -l app=slow-app-without-soci -o jsonpath='{.items[0].spec.nodeName}') # Clear the image from that node kubectl debug node/$NODE_NAME --profile=general --image=busybox -- \ chroot /host ctr -n k8s.io images rm $AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/soci-demo:slow
Enable SOCI on EKS Nodes
Step 7: Configure SOCI Snapshotter
If using the DaemonSet approach , the configuration is already applied. Verify it's working:
kubectl logs -n kube-system -l app=soci-config --tail=20
Configuration File: /etc/soci-snapshotter-grpc/config.toml
[blob] max_concurrent_downloads = -1 max_concurrent_downloads_per_image = 10 concurrent_download_chunk_size = "16mb" max_concurrent_unpacks = -1 max_concurrent_unpacks_per_image = 10 discard_unpacked_layers = true
Key Configuration Settings:
-
max_concurrent_downloads = -1: Unlimited total concurrent downloads across all images. Setting to -1 removes global restrictions, allowing per-image limits to control behavior. -
max_concurrent_downloads_per_image = 10: Number of parallel HTTP connections used to download a single image. AWS recommends 10-20 for ECR. Higher valuesincrease throughput but add overhead. Start with 10 and increase if network isn't saturated. -
concurrent_download_chunk_size = "16mb": Size of chunks when splitting large layers for parallel download. 16MB balances parallelism (more chunks = more concurrent work) with overhead (smaller chunks = more HTTP requests). Larger chunks (32MB) work better on high-latency networks. -
max_concurrent_unpacks = -1: Unlimited total concurrent layer unpacking operations across all images. Similar to downloads, -1 removes global limits. -
max_concurrent_unpacks_per_image = 10: Number of layers that can be decompressed and extracted simultaneously for one image. Utilizes multi-core CPUs effectively. Set based on available CPU cores; 10 works well for typical nodes with 8+ cores. -
discard_unpacked_layers = true: Deletes compressed layer data after extraction to save disk space. Recommended for production to prevent disk exhaustion,especially with large images. Set to false only if you need to preserve original layers for debugging.
Note: The DaemonSet approach automatically configures all nodes and restarts containerd. The configuration persists across node restarts.
Step 8: Deploy With SOCI
Create deployment-with-soci.yaml:
apiVersion: apps/v1 kind: Deployment metadata: name: slow-app-with-soci namespace: default spec: replicas: 1 selector: matchLabels: app: slow-app-with-soci template: metadata: labels: app: slow-app-with-soci spec: containers: - name: app image: $AWS_ACCOUNT_ID.dkr.ecr.<AWS_REGION>.amazonaws.com/soci-demo:slow imagePullPolicy: Always ports: - containerPort: 8080
Deploy and monitor:
kubectl apply -f deployment-with-soci.yaml kubectl get pods -l app=slow-app-with-soci -w kubectl describe pod -l app=slow-app-with-soci | grep -A 5 "Events:"
Results
SOCI achieves this through lazy loading:
- Downloads small SOCI index (~few KB)
- Fetches only essential files needed to start (~50-100MB)
- Starts container immediately
- Downloads remaining 6GB in background while container runs
How SOCI Lazy Loading Works
Traditional Approach:
- Download all 6GB of layers
- Extract all layers to disk
- Wait for completion (2m38s)
- Start container
SOCI Approach:
- Download SOCI index (metadata about files)
- Identify minimum files needed (e.g., /start.sh, /bin/bash, Python)
- Download only those files via HTTP range requests
- Continue downloading remaining files in background
- Fetch additional files on-demand if accessed
The container is fully functional while the bulk of the image downloads in the background!
Troubleshooting
If SOCI doesn't seem to work:
- Verify SOCI snapshotter service is running:
NODE_NAME=$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}') kubectl debug node/$NODE_NAME --profile=general --image=busybox -- chroot /host systemctl status soci-snapshotter.service
- Check if service is enabled:
NODE_NAME=$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}') kubectl debug node/$NODE_NAME --profile=general --image=busybox -- chroot /host systemctl list-unit-files | grep soci
Should show: soci-snapshotter.service enabled
- Verify SOCI index exists in ECR:
soci index list $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPO:$IMAGE_TAG
Conclusion
SOCI Snapshotter reduces image pull times for large containers through lazy loading, without requiring code changes. For AI/ML workloads on EKS, thismeans near-instant deployments, responsive scaling, and better resource utilization.The key to success is ensuring the SOCI snapshotter service is properly enabled and running on your EKS nodes.
Cleanup
To remove the test resources:
# Delete deployments kubectl delete deployment slow-app-without-soci slow-app-with-soci # Delete SOCI DaemonSet (if you want to disable SOCI) kubectl delete daemonset soci-config -n kube-system # Delete ECR repository aws ecr delete-repository --repository-name soci-demo --region us-west-2 --force # Remove local Docker images docker rmi $AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/soci-demo:slow
Note: Deleting the DaemonSet will not automatically disable SOCI on nodes. The service will continue running until nodes are recycled or the service is manually stopped.
Resources
- Language
- English
Relevant content
- Accepted Answerasked a year ago
- asked 2 years ago
AWS OFFICIALUpdated a year ago
AWS OFFICIALUpdated 2 years ago