Skip to content

Accelerating Container Startup with SOCI Snapshotter on Amazon EKS

8 minute read
Content level: Advanced
0

Large container images in AI/ML workloads can take longer than normal time to start, with most time spent pulling images. This post demonstrates how SOCI (Seekable OCI) Snapshotter with lazy loading reduces container startup times .

The Problem

Traditional image pulling is sequential: layers download one at a time over a single connection, then unpack serially. For large images, this becomes a majorbottleneck even with high bandwidth.

Test Scenario

We'll create a 6GB Docker image simulating an ML workload to demonstrate the problem and solution.

Step 1: Create the Docker Image

Create a Dockerfile:

FROM ubuntu:22.04

RUN apt-get update && apt-get install -y python3 python3-pip curl wget

# Simulate large ML models and datasets
RUN dd if=/dev/urandom of=/data1.bin bs=1M count=1500 && \
    dd if=/dev/urandom of=/data2.bin bs=1M count=1500 && \
    dd if=/dev/urandom of=/data3.bin bs=1M count=1500 && \
    dd if=/dev/urandom of=/data4.bin bs=1M count=1500

# Startup script with initialization delays
RUN echo '#!/bin/bash\n\
echo "Loading large dataset into memory..."\n\
sleep 60\n\
echo "Initializing ML model..."\n\
sleep 60\n\
echo "Warming up cache..."\n\
sleep 60\n\
echo "Performing health checks..."\n\
sleep 60\n\
echo "Application ready!"\n\
python3 -m http.server 8080' > /start.sh && chmod +x /start.sh

EXPOSE 8080
CMD ["/start.sh"]

Step 2: Build and Push to ECR

Set your AWS account ID and region:

export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
export AWS_REGION=us-west-2
export ECR_REPO=soci-demo
export IMAGE_TAG=slow

Create ECR repository:

aws ecr create-repository --repository-name $ECR_REPO --region $AWS_REGION

Authenticate Docker to ECR:

aws ecr get-login-password --region $AWS_REGION | \
  docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com

Build and push the image:

docker build -t $ECR_REPO:$IMAGE_TAG .
docker tag $ECR_REPO:$IMAGE_TAG $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPO:$IMAGE_TAG
docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPO:$IMAGE_TAG

Step 3: Create SOCI Index

Install SOCI CLI:

cd /tmp
wget https://github.com/awslabs/soci-snapshotter/releases/download/v0.11.0/soci-snapshotter-0.11.0-linux-amd64.tar.gz
tar -xzf soci-snapshotter-0.11.0-linux-amd64.tar.gz
chmod +x soci
sudo mv soci /usr/local/bin/

Create and push SOCI index to ECR:

soci create $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPO:$IMAGE_TAG
soci push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPO:$IMAGE_TAG

The soci create command analyzes image layers and generates an index. The soci push uploads this index to ECR alongside your image.

Understanding SOCI

SOCI (Seekable OCI) creates an index mapping image layer contents, enabling:

  • Lazy loading: Start containers with essential files, pull the rest in background
  • Parallel pulling: Multiple concurrent connections download chunks simultaneously
  • On-demand fetching: Files are downloaded only when needed

EKS Cluster Setup

Step 4: Configure kubectl

aws eks update-kubeconfig --region $AWS_REGION --name eks-workshop

Step 5: Verify SOCI Installation

Check if SOCI is installed on nodes (pre-installed on recent EKS Optimized AMIs):

NODE_NAME=$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}')
kubectl debug node/$NODE_NAME -it --profile=general --image=busybox -- chroot /host which soci

If /usr/bin/soci appears, you're ready. Otherwise, install SOCI on nodes using one of the options below.

Using DaemonSet (Recommended for existing clusters)

Create soci-installer-daemonset.yaml:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: soci-config
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: soci-config
  template:
    metadata:
      labels:
        app: soci-config
    spec:
      hostPID: true
      hostNetwork: true
      containers:
      - name: config
        image: amazonlinux:2023
        command:
        - /bin/bash
        - -c
        - |
          # Install nsenter
          yum install -y util-linux

          # Create SOCI config directory
          mkdir -p /host/etc/soci-snapshotter-grpc

          # Create SOCI configuration
          cat > /host/etc/soci-snapshotter-grpc/config.toml <<'EOF'
          [blob]
          max_concurrent_downloads = -1
          max_concurrent_downloads_per_image = 10
          concurrent_download_chunk_size = "16mb"
          max_concurrent_unpacks = -1
          max_concurrent_unpacks_per_image = 10
          discard_unpacked_layers = true
          EOF

          echo "SOCI configuration created"
          cat /host/etc/soci-snapshotter-grpc/config.toml

          # Enable and start SOCI snapshotter service
          echo "Enabling SOCI snapshotter service..."
          nsenter -t 1 -m -u -i -n systemctl enable soci-snapshotter.service
          nsenter -t 1 -m -u -i -n systemctl start soci-snapshotter.service

          # Wait for service to be ready
          sleep 5

          # Verify service is running
          nsenter -t 1 -m -u -i -n systemctl status soci-snapshotter.service | head -10

          # Restart containerd to apply changes
          echo "Restarting containerd..."
          nsenter -t 1 -m -u -i -n systemctl restart containerd

          echo "Configuration complete. Containerd restarted."
          sleep infinity
        securityContext:
          privileged: true
        volumeMounts:
        - name: host
          mountPath: /host
      volumes:
      - name: host
        hostPath:
          path: /
      tolerations:
      - operator: Exists

Deploy the DaemonSet:

kubectl apply -f soci-installer-daemonset.yaml
kubectl rollout status daemonset/soci-config -n kube-system

Verify SOCI service is running:

kubectl logs -n kube-system -l app=soci-config --tail=20

You should see:

SOCI configuration created
Enabling SOCI snapshotter service...
● soci-snapshotter.service - SOCI Snapshotter
     Loaded: loaded (/etc/systemd/system/soci-snapshotter.service; enabled; preset: disabled)
     Active: active (running)
Restarting containerd...
Configuration complete. Containerd restarted.

Baseline Test Without SOCI

Step 6: Deploy Without SOCI

Create deployment-without-soci.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: slow-app-without-soci
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: slow-app-without-soci
  template:
    metadata:
      labels:
        app: slow-app-without-soci
    spec:
      containers:
      - name: app
        image: <AWS_ACCOUNT_ID>.dkr.ecr.<AWS_REGION>.amazonaws.com/soci-demo:slow
        imagePullPolicy: Always
        ports:
        - containerPort: 8080

Replace <AWS_ACCOUNT_ID> and <AWS_REGION> with your values, then deploy:

kubectl apply -f deployment-without-soci.yaml

Monitor the deployment:

kubectl get pods -l app=slow-app-without-soci -w

Check pull time:

kubectl describe pod -l app=slow-app-without-soci | grep -A 5 "Events:"

Note: If testing multiple times, the image may be cached on nodes. To clear the cache for accurate testing:

# Get the node name where pod is scheduled
NODE_NAME=$(kubectl get pods -l app=slow-app-without-soci -o jsonpath='{.items[0].spec.nodeName}')

# Clear the image from that node
kubectl debug node/$NODE_NAME --profile=general --image=busybox -- \
  chroot /host ctr -n k8s.io images rm $AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/soci-demo:slow

Enable SOCI on EKS Nodes

Step 7: Configure SOCI Snapshotter

If using the DaemonSet approach , the configuration is already applied. Verify it's working:

kubectl logs -n kube-system -l app=soci-config --tail=20

Configuration File: /etc/soci-snapshotter-grpc/config.toml

[blob]
max_concurrent_downloads = -1
max_concurrent_downloads_per_image = 10
concurrent_download_chunk_size = "16mb"
max_concurrent_unpacks = -1
max_concurrent_unpacks_per_image = 10
discard_unpacked_layers = true

Key Configuration Settings:

  • max_concurrent_downloads = -1: Unlimited total concurrent downloads across all images. Setting to -1 removes global restrictions, allowing per-image limits to control behavior.

  • max_concurrent_downloads_per_image = 10: Number of parallel HTTP connections used to download a single image. AWS recommends 10-20 for ECR. Higher valuesincrease throughput but add overhead. Start with 10 and increase if network isn't saturated.

  • concurrent_download_chunk_size = "16mb": Size of chunks when splitting large layers for parallel download. 16MB balances parallelism (more chunks = more concurrent work) with overhead (smaller chunks = more HTTP requests). Larger chunks (32MB) work better on high-latency networks.

  • max_concurrent_unpacks = -1: Unlimited total concurrent layer unpacking operations across all images. Similar to downloads, -1 removes global limits.

  • max_concurrent_unpacks_per_image = 10: Number of layers that can be decompressed and extracted simultaneously for one image. Utilizes multi-core CPUs effectively. Set based on available CPU cores; 10 works well for typical nodes with 8+ cores.

  • discard_unpacked_layers = true: Deletes compressed layer data after extraction to save disk space. Recommended for production to prevent disk exhaustion,especially with large images. Set to false only if you need to preserve original layers for debugging.

Note: The DaemonSet approach automatically configures all nodes and restarts containerd. The configuration persists across node restarts.

Step 8: Deploy With SOCI

Create deployment-with-soci.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: slow-app-with-soci
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: slow-app-with-soci
  template:
    metadata:
      labels:
        app: slow-app-with-soci
    spec:
      containers:
      - name: app
        image: $AWS_ACCOUNT_ID.dkr.ecr.<AWS_REGION>.amazonaws.com/soci-demo:slow
        imagePullPolicy: Always
        ports:
        - containerPort: 8080

Deploy and monitor:

kubectl apply -f deployment-with-soci.yaml
kubectl get pods -l app=slow-app-with-soci -w
kubectl describe pod -l app=slow-app-with-soci | grep -A 5 "Events:"

Results

SOCI achieves this through lazy loading:

  1. Downloads small SOCI index (~few KB)
  2. Fetches only essential files needed to start (~50-100MB)
  3. Starts container immediately
  4. Downloads remaining 6GB in background while container runs

How SOCI Lazy Loading Works

Traditional Approach:

  • Download all 6GB of layers
  • Extract all layers to disk
  • Wait for completion (2m38s)
  • Start container

SOCI Approach:

  • Download SOCI index (metadata about files)
  • Identify minimum files needed (e.g., /start.sh, /bin/bash, Python)
  • Download only those files via HTTP range requests
  • Continue downloading remaining files in background
  • Fetch additional files on-demand if accessed

The container is fully functional while the bulk of the image downloads in the background!

Troubleshooting

If SOCI doesn't seem to work:

  1. Verify SOCI snapshotter service is running:
NODE_NAME=$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}')
kubectl debug node/$NODE_NAME --profile=general --image=busybox -- chroot /host systemctl status soci-snapshotter.service
  1. Check if service is enabled:
NODE_NAME=$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}')
kubectl debug node/$NODE_NAME --profile=general --image=busybox -- chroot /host systemctl list-unit-files | grep soci

Should show: soci-snapshotter.service enabled

  1. Verify SOCI index exists in ECR:
soci index list $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPO:$IMAGE_TAG

Conclusion

SOCI Snapshotter reduces image pull times for large containers through lazy loading, without requiring code changes. For AI/ML workloads on EKS, thismeans near-instant deployments, responsive scaling, and better resource utilization.The key to success is ensuring the SOCI snapshotter service is properly enabled and running on your EKS nodes.

Cleanup

To remove the test resources:

# Delete deployments
kubectl delete deployment slow-app-without-soci slow-app-with-soci

# Delete SOCI DaemonSet (if you want to disable SOCI)
kubectl delete daemonset soci-config -n kube-system

# Delete ECR repository
aws ecr delete-repository --repository-name soci-demo --region us-west-2 --force

# Remove local Docker images
docker rmi $AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/soci-demo:slow

Note: Deleting the DaemonSet will not automatically disable SOCI on nodes. The service will continue running until nodes are recycled or the service is manually stopped.

Resources