Skip to content

Deploy DeepSeek-R1-0528 (671B) on Amazon EKS using vLLM

13 minute read
Content level: Expert
1

Quick guide on how to deploy DeepSeek R1 (full model) on Amazon EKS with distributed inferencing using vLLM

This post provides a walkthrough to deploy the latest DeepSeek-R1-0528 (671B full model) on Amazon EKS using vLLM, a high-throughput and memory-efficient inference and serving engine for LLMs.

We’ll leverage the AWS Deep Learning Container (DLCs) that are pre-configured with necessary libraries and dependencies, including Elastic Fabric Adapter (EFA) drivers optimized for high-throughput, low-latency inter-node communications and Remote Direct Memory Access (RDMA) support for running distributed inferencing.

For this demo, we’ll be using 2x P5e.48xlarge instances as the full DeepSeek R1 model has extremely high demand for GPU memory capacity. You could alternatively use P4d or G6e instances but might need 4 or more nodes to run it effectively.

Additionally, the solution uses the following services and components:

  • Amazon EKS Cluster: Fully managed Kubernetes service for orchestrating vLLM container workloads for distributed inferencing
  • AWS vLLM Deep Learning Container (DLCs): Pre-packaged container image simplifies vLLM deployment on EKS
  • Amazon FSx for Lustre: HPC file storage for downloading and storing the models
  • AWS Load Balancer Controller: Streamlines Kubernetes Load Balancer or Ingress resources management through automatic provisioning of AWS NLBs or ALBs.
  • LeaderWorkerSet API: Kubernetes-native orchestration handles the vLLM deployment, scaling and load balancing
  • Open WebUI: An open-source AI platform that provides a ChatGPT style interface to interact with self-hosted LLMs

This guide assumes you have intermediate Kubernetes experiences and are familiar with Amazon EKS and AWS CLI.

All K8s deployment YAML files are available here.

 

Prerequisites

 

Deploy an EKS cluster with a GPU node group

First, let’s inspect the provided cluster configuration file vllm-cluster-config.yaml. Update the config based on your own environments such as Region, EKS version, EKS GPU AMI and your own capacity reservation ID.

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: vllm-cluster
  region: us-east-2 # update to match your EKS region
  version: "1.32" # update to your preferred EKS version

managedNodeGroups:
  - name: vllm-p5e-nodes-efa
    instanceType: p5e.48xlarge 
    minSize: 0
    maxSize: 2
    desiredCapacity: 2
    availabilityZones: [us-east-2c] # ensure the P5e instances are available in the selected AZ
    volumeSize: 100
    privateNetworking: true
    # Use the EKS-optimized GPU AMI
    ami: ami-1234abcd # replace with desired EKS GPU AMI in the selected Region
[…]
    # Capacity Reservations for AI/ML nodes
    capacityReservation:
      capacityReservationTarget:
        capacityReservationID: "cr-1234abcd" # replace with your own capacity reservation id
    instanceMarketOptions:
      marketType: capacity-block
[…]

Deploy an EKS cluster within a new dedicated VPC using eksctl.

eksctl create cluster -f vllm-cluster-config.yaml

Verify the node status after the cluster deployment is complete.

% eksctl get cluster 
NAME    REGION    EKSCTL CREATED
vllm-cluster  us-east-2 True

% kubectl get nodes
NAME                                            STATUS   ROLES    AGE     VERSION
ip-192-168-168-138.us-east-2.compute.internal   Ready    <none>   5m5s    v1.32.9-eks-113cf36
ip-192-168-181-246.us-east-2.compute.internal   Ready    <none>   5m16s   v1.32.9-eks-113cf36

Use the following commands to verify if the NVIDIA device plugin is installed (should be already included with the GPU-optimized AMI), and if the nodes are correctly annotated with GPU and EFA resources.

% kubectl get pods -n kube-system | grep nvidia
nvidia-device-plugin-daemonset-2sdhp   1/1     Running   1 (8m46s ago)   11m
nvidia-device-plugin-daemonset-pctkc   1/1     Running   1 (8m48s ago)   11m

% kubectl get nodes -o json | jq -r '
  ["NODE", "NVIDIA_GPU", "EFA_CAPACITY"],
  (.items[] |
    [
      .metadata.name,
      (.status.capacity."nvidia.com/gpu" // "0"),
      (.status.capacity."vpc.amazonaws.com/efa" // "0")
    ]
  ) | @tsv' | column -t -s $'\t'

NODE                                           NVIDIA_GPU  EFA_CAPACITY
ip-192-168-168-138.us-east-2.compute.internal  8           32
ip-192-168-181-246.us-east-2.compute.internal  8           32

 

Deploy FSx for Lustre file system and configure FSx CSI

Next, we’ll deploy a FSx for Lustre file system, configure FSx CSI and prepare PV/PVC persistent storage for the vLLM deployment. Following the user guide (step-1) to create a FSx for Lustre file system with the following configuration:

  • Ensure the file system is deployed within the same VPC and AZ as the EKS nodes (us-east-2c in my example) to minimize access latency
  • Capacity of 2.4 TiB
  • Deployment type set to SCRATCH_2 (optimized for high burst throughput)
  • Configure FSx Security Groups to allow inbound connection from EKS node Security Group and its own Security Group on TCP 988, 1018-1023 as per here

You can use the following command to find the SG for the EKS node group.

aws eks describe-cluster --name vllm-cluster --query "cluster.resourcesVpcConfig.clusterSecurityGroupId" --output text

Your updated FSx security group should look like below (inbound rules).

Enter image description here

Install AWS FSx for Lustre CSI Driver.

% helm repo add aws-fsx-csi-driver https://kubernetes-sigs.github.io/aws-fsx-csi-driver/
% helm repo update
% helm install aws-fsx-csi-driver aws-fsx-csi-driver/aws-fsx-csi-driver --namespace kube-system
NAME: aws-fsx-csi-driver
LAST DEPLOYED: Tue Oct 28 00:29:00 2025
NAMESPACE: kube-system
STATUS: deployed
REVISION: 1
TEST SUITE: None

Verify the CSI add-ons are installed and running correctly.

% kubectl get pods -n kube-system | grep fsx

fsx-csi-controller-76477f6879-kk7xx    4/4     Running   0             68s
fsx-csi-controller-76477f6879-t79nm    4/4     Running   0             68s
fsx-csi-node-w7ddc                     3/3     Running   0             68s
fsx-csi-node-xp6bh                     3/3     Running   0             68s

Inspect the provided FSx storage class and PV/PVC config files, and update the marked sections such as FSx subnet, security groups and file system volume handle, DNS name and mount name.

% cat fsx-lustre-sc.yaml     
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fsx-sc
provisioner: fsx.csi.aws.com
parameters:
  subnetId: subnet-1234abcd # update to the FSx FS subnet
  securityGroupIds: sg-1234abcd # update to the FSx SG
[…]
% cat fsx-lustre-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: fsx-lustre-pv
spec:
  capacity:
    storage: 2400Gi  # Update to your FSxL FS size
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: fsx-sc
  csi:
    driver: fsx.csi.aws.com
    volumeHandle: fs-1234abcd  # replace with your FSxL FS ID
    volumeAttributes:
      dnsname: fs-1234abcd.fsx.us-east-2.amazonaws.com  # replace with your FSxL DNS name
      mountname: 1234abcd  # replace with your FSxL FS mount name

Provision a FSx Lustre-backed storage class and deploy PV/PVC as the vLLM external storage for downloading and storing models.

% kubectl apply -f fsx-lustre-sc.yaml  
% kubectl apply -f fsx-lustre-pv.yaml        
% kubectl apply -f fsx-lustre-pvc.yaml

Verify the Kubernetes storage resources just deployed, making sure the PV/PVC status is “Bound”.

% kubectl get storageclasses.storage.k8s.io 
NAME     PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
fsx-sc   fsx.csi.aws.com         Retain          Immediate              false                  5m48s

% kubectl get pv fsx-lustre-pv                     
NAME            CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                    STORAGECLASS   VOLUMEATTRIBUTESCLASS   REASON   AGE
fsx-lustre-pv   2400Gi     RWX            Retain           Bound    default/fsx-lustre-pvc   fsx-sc         <unset>                          5m58s

% kubectl get pvc fsx-lustre-pvc
NAME             STATUS   VOLUME          CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE
fsx-lustre-pvc   Bound    fsx-lustre-pv   2400Gi     RWX            fsx-sc         <unset>                 6m6s

 

Install AWS Load Balancer controller

To interact with the DeepSeek model from external, we’ll expose the frontend Open WebUI service via Kubernetes Ingress using the AWS Load Balancer controller.

First, create an IAM OIDC provider for the cluster.

eksctl utils associate-iam-oidc-provider --region=us-east-2 --cluster vllm-cluster –approve

Second, create the IAM policy for the AWS Load Balancer Controller.

curl -O https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/main/docs/install/iam_policy.json
aws iam create-policy  --policy-name AWSLoadBalancerControllerIAMPolicy  --policy-document file://iam_policy.json

Next, create an IAM service account for the AWS Load Balancer Controller.

eksctl create iamserviceaccount \
  --cluster=vllm-cluster\
  --namespace=kube-system \
  --name=aws-load-balancer-controller \
  --attach-policy-arn=arn:aws:iam::<your-account-id>:policy/AWSLoadBalancerControllerIAMPolicy \
  --override-existing-serviceaccounts \
  --region <region-id> \
  --approve

Install the ALB controller using Helm.

helm repo add eks https://aws.github.io/eks-charts
helm repo update eks
kubectl apply -f https://raw.githubusercontent.com/aws/eks-charts/master/stable/aws-load-balancer-controller/crds/crds.yaml

helm install aws-load-balancer-controller eks/aws-load-balancer-controller \
  -n kube-system \
  --set clusterName=vllm-cluster \
  --set serviceAccount.create=false \
  --set serviceAccount.name=aws-load-balancer-controller 

Make sure the ALB controllers pods are running.

% kubectl get pods -n kube-system | grep aws-load-balancer-controller
aws-load-balancer-controller-bf747d4d5-5tflv   1/1     Running   0             29s
aws-load-balancer-controller-bf747d4d5-g7sf4   1/1     Running   0             29s

Create a Security Group for the ALB.

aws ec2 create-security-group \
  --group-name vllm-alb-sg \
  --description "Security group for vLLM ALB" \
  --vpc-id <vpc-id> \

Find your workstation’s public ip address and use it as the source address for updating the ALB SG ingress rule.

aws ec2 authorize-security-group-ingress \
  --group-id <alb-sg-id> \
  --protocol tcp \
  --port 80 \
  --cidr <user-pub-ip>/32

Create another ingress rule to allow TCP port 8080 from the security group associated with EKS node EFAs (used by Pods). This is for the ALB to access the Open WebUI pods which will be deployed in a moment.

aws ec2 authorize-security-group-ingress \
  --group-id <node-efa-sg-id> \
  --protocol tcp \
  --port 8080 \
  --source-group <alb-sg-id>

 

Instal LWS and deploy vLLM server workloads

Now we’ll install the LeaderWorkerSet controller to handle the vLLM deployment.

CHART_VERSION=0.7.0
helm install lws oci://registry.k8s.io/lws/charts/lws \
  --version=$CHART_VERSION \
  --namespace lws-system \
  --create-namespace \
  --wait --timeout 300s

Before we deploy the vLLM server workloads, inspect the provided vllm-ds-r1-lws.yaml, take a note of the following sections and make adjustment based on your own environments.

  • spec.leaderWorkerTemplate.size: update this to the total number of nodes (e.g. change to 4 workers if you are running 4x P4d nodes)
  • .leaderTemplate/workerTemplate.spec.containers.image: DLC container image for deploying vLLM inferencing. For example, I’m using vllm:0.10.2-gpu-py312-cu129-ubuntu22.04-ec2-v1.1 from the public ECR gallery and it comes with vLLM 0.10.2 and CUDA 12.9 pre-installed.
  • --model deepseek-ai/DeepSeek-R1-0528: official model name/path from Hugging Face.
  • --pipeline-parallel-size: model weights are split into multiple pipeline stages, this value should match your total number of nodes.
  • --tensor-parallel-size: within each pipeline stage, models are further sharded across multiple GPUs, this number should match the number of GPUs on each node
  • --max-model-len: this limits the user prompts length - you might need to lower this value if using a lower tier GPU with less vRAM, as it directly affects the KV Cache consumption.

Also, adjust the following values to fit within your node’s hardware specifications.

              limits:
                nvidia.com/gpu: "8"
                cpu: "96"
                memory: "512Gi"
                vpc.amazonaws.com/efa: 8
              requests:
                nvidia.com/gpu: "8"
                cpu: "96"
                memory: "512Gi"
                vpc.amazonaws.com/efa: 8

Deploy the vLLM server using the LWS pattern, which in my case including 1x vLLM leader and 1x worker pods across the 2x P5e nodes.

% kubectl apply -f vllm-ds-r1-lws.yaml

Since we are deploying a very large model, the whole process will take approx. 1hr, and you can inspect the logs from the leader pod.

% kubectl logs -f vllm-ds-r1-lws-0

Note the vLLM DLCs will automatically load pre-configured and optimized libraries and drivers including CUDA, EFA, NCCL and AWS-OFI-NCCL, which provides GPUDirect RDMA over the fabric.

(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=257, ip=192.168.160.50)vllm-ds-r1-lws-0-1:257:257 [0] NCCL INFO cudaDriverVersion 12080
(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=257, ip=192.168.160.50)vllm-ds-r1-lws-0-1:257:257 [0] NCCL INFO NCCL version 2.27.3+cuda12.9
(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=257, ip=192.168.160.50)vllm-ds-r1-lws-0-1:257:257 [0] NCCL INFO NET/Plugin: Plugin name set by env to libnccl-net.so
(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=257, ip=192.168.160.50)vllm-ds-r1-lws-0-1:257:257 [0] NCCL INFO NET/Plugin: Loaded net plugin Libfabric (v10)
...
(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=257, ip=192.168.160.50)vllm-ds-r1-lws-0-1:257:257 [0] NCCL INFO Successfully loaded external plugin libnccl-net.so
(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=257, ip=192.168.160.50)vllm-ds-r1-lws-0-1:257:257 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.16.3
(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=257, ip=192.168.160.50)vllm-ds-r1-lws-0-1:257:257 [0] NCCL INFO NET/OFI Using Libfabric version 2.1
(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=257, ip=192.168.160.50)vllm-ds-r1-lws-0-1:257:257 [0] NCCL INFO NET/OFI Using CUDA driver version 12080 with runtime 12080
(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=257, ip=192.168.160.50)vllm-ds-r1-lws-0-1:257:257 [0] NCCL INFO NET/OFI Configuring AWS-specific options
(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=257, ip=192.168.160.50)vllm-ds-r1-lws-0-1:257:257 [0] NCCL INFO NET/OFI Setting provider_filter to efa
(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=257, ip=192.168.160.50)vllm-ds-r1-lws-0-1:257:257 [0] NCCL INFO NET/OFI Internode latency set at 35.0 us
(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=257, ip=192.168.160.50)vllm-ds-r1-lws-0-1:257:257 [0] NCCL INFO NET/OFI Using transport protocol RDMA (platform set)
(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=978) vllm-ds-r1-lws-0:978:978 [0] NCCL INFO NET/OFI Selected provider is efa, fabric is efa-direct (found 16 nics)

The full DeepSeek-R1-0528 model is approx. 700GB and it will take some time to download (45min in my case).

(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=257, ip=192.168.160.50) INFO 10-19 10:00:32 [gpu_model_runner.py:2338] Starting to load model deepseek-ai/DeepSeek-R1-0528...
(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=982) INFO 10-19 10:00:32 [gpu_model_runner.py:2370] Loading model from scratch...
(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=982) INFO 10-19 10:00:32 [cuda.py:252] Using FlashMLA backend on V1 engine.
(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=982) INFO 10-19 10:00:33 [weight_utils.py:348] Using model weights format ['*.safetensors']
(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=982) INFO 10-19 10:44:56 [weight_utils.py:369] Time spent downloading weights for deepseek-ai/DeepSeek-R1-0528: 2663.422644 seconds

Next, vLLM will start loading the full model including all 163 shards.

…
Loading safetensors checkpoint shards:  99% Completed | 161/163 [14:16<00:09,  4.90s/it]
Loading safetensors checkpoint shards:  99% Completed | 162/163 [14:17<00:03,  3.64s/it]
Loading safetensors checkpoint shards: 100% Completed | 163/163 [14:29<00:00,  6.22s/it]
Loading safetensors checkpoint shards: 100% Completed | 163/163 [14:29<00:00,  5.33s/it]
(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=987) INFO 10-28 17:53:59 [default_loader.py:268] Loading weights took 870.02 seconds

Eventually, you’ll see the vLLM API server becomes available.

[…]
(APIServer pid=1) INFO 10-28 17:55:47 [api_server.py:1971] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=1) INFO 10-28 17:55:47 [launcher.py:36] Available routes are:
[…]
(APIServer pid=1) INFO:     Started server process [1]
(APIServer pid=1) INFO:     Waiting for application startup.
(APIServer pid=1) INFO:     Application startup complete.
(APIServer pid=1) INFO:     192.168.168.138:44016 - "GET /health HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     192.168.168.138:54482 - "GET /health HTTP/1.1" 200 OK

You should also see the leader pod showing 1/1 Ready status.

% kubectl get pods vllm-ds-r1-lws-0        
NAME               READY   STATUS    RESTARTS   AGE
vllm-ds-r1-lws-0   1/1     Running   0          110m

 

Deploy Open WebUI for interacting with the model

Finally, let’s deploy the Open WebUI frontend for interacting with the DeepSeek model via vLLM. Inspect the provided open-webui.yaml and notice the value of OPENAI_API_BASE_URLS is already set to the K8s DNS address for the vLLM leader service.

        # This is the vllm k8s service URL to which Open-WebUI client connects to.
        - name: OPENAI_API_BASE_URLS
          value: "http://vllm-ds-r1-lws-leader.default.svc.cluster.local:8000/v1"

Here we’ll also deploy a public facing Application Load Balancer through the K8s Ingress API, so make sure you update the ALB security group as configured in the previous steps.

    # Use AWS Load Balancer Controller with ALB
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/security-groups: sg-1234abcd # replace with your ALB Security Group

Deploy the yaml file and take a note of the Ingress (ALB) URL.

% kubectl apply -f open-webui.yaml
% kubectl get ingress
NAME         CLASS   HOSTS   ADDRESS                                                                 PORTS   AGE
open-webui   alb     *       k8s-default-openwebu-1fce4d24e1-152388083.us-east-2.elb.amazonaws.com   80      11m

You should now be able to access and interact with the DeepSeek R1 model via the Open WebUI URL.

Enter image description here

 

Clean up

To remove the demo resources created in the preceding steps, use the following commands.

kubectl delete -f open-webui.yaml
kubectl delete -f vllm-ds-r1-lws.yaml
kubectl delete -f fsx-lustre-pvc.yaml && kubectl delete -f fsx-lustre-pv.yaml && kubectl delete -f fsx-lustre-sc.yaml

If the demo environment is no longer needed, then delete the EKS cluster and FSx for Lustre file system to avoid incurring charges.

eksctl delete cluster –name=<your-cluster-name> --region=<region-id>
aws fsx delete-file-system --file-system-id <fsx-lustre-id>

 

Conclusion

In this post, we walked through the process for deploying DeepSeek-R1-0528 (671B) full model on Amazon EKS using vLLM, with EC2 P5e GPU nodes and FSx for Lustre integration.

 

References