Deploy DeepSeek-R1-0528 (671B) on Amazon EKS using vLLM
Quick guide on how to deploy DeepSeek R1 (full model) on Amazon EKS with distributed inferencing using vLLM
This post provides a walkthrough to deploy the latest DeepSeek-R1-0528 (671B full model) on Amazon EKS using vLLM, a high-throughput and memory-efficient inference and serving engine for LLMs.
We’ll leverage the AWS Deep Learning Container (DLCs) that are pre-configured with necessary libraries and dependencies, including Elastic Fabric Adapter (EFA) drivers optimized for high-throughput, low-latency inter-node communications and Remote Direct Memory Access (RDMA) support for running distributed inferencing.
For this demo, we’ll be using 2x P5e.48xlarge instances as the full DeepSeek R1 model has extremely high demand for GPU memory capacity. You could alternatively use P4d or G6e instances but might need 4 or more nodes to run it effectively.
Additionally, the solution uses the following services and components:
- Amazon EKS Cluster: Fully managed Kubernetes service for orchestrating vLLM container workloads for distributed inferencing
- AWS vLLM Deep Learning Container (DLCs): Pre-packaged container image simplifies vLLM deployment on EKS
- Amazon FSx for Lustre: HPC file storage for downloading and storing the models
- AWS Load Balancer Controller: Streamlines Kubernetes Load Balancer or Ingress resources management through automatic provisioning of AWS NLBs or ALBs.
- LeaderWorkerSet API: Kubernetes-native orchestration handles the vLLM deployment, scaling and load balancing
- Open WebUI: An open-source AI platform that provides a ChatGPT style interface to interact with self-hosted LLMs
This guide assumes you have intermediate Kubernetes experiences and are familiar with Amazon EKS and AWS CLI.
All K8s deployment YAML files are available here.
Prerequisites
- Capacity block for 2x P5e.48xlarge instances in the selected AWS region
- The following tools:
Deploy an EKS cluster with a GPU node group
First, let’s inspect the provided cluster configuration file vllm-cluster-config.yaml. Update the config based on your own environments such as Region, EKS version, EKS GPU AMI and your own capacity reservation ID.
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: vllm-cluster
region: us-east-2 # update to match your EKS region
version: "1.32" # update to your preferred EKS version
managedNodeGroups:
- name: vllm-p5e-nodes-efa
instanceType: p5e.48xlarge
minSize: 0
maxSize: 2
desiredCapacity: 2
availabilityZones: [us-east-2c] # ensure the P5e instances are available in the selected AZ
volumeSize: 100
privateNetworking: true
# Use the EKS-optimized GPU AMI
ami: ami-1234abcd # replace with desired EKS GPU AMI in the selected Region
[…]
# Capacity Reservations for AI/ML nodes
capacityReservation:
capacityReservationTarget:
capacityReservationID: "cr-1234abcd" # replace with your own capacity reservation id
instanceMarketOptions:
marketType: capacity-block
[…]
Deploy an EKS cluster within a new dedicated VPC using eksctl.
eksctl create cluster -f vllm-cluster-config.yaml
Verify the node status after the cluster deployment is complete.
% eksctl get cluster
NAME REGION EKSCTL CREATED
vllm-cluster us-east-2 True
% kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-192-168-168-138.us-east-2.compute.internal Ready <none> 5m5s v1.32.9-eks-113cf36
ip-192-168-181-246.us-east-2.compute.internal Ready <none> 5m16s v1.32.9-eks-113cf36
Use the following commands to verify if the NVIDIA device plugin is installed (should be already included with the GPU-optimized AMI), and if the nodes are correctly annotated with GPU and EFA resources.
% kubectl get pods -n kube-system | grep nvidia
nvidia-device-plugin-daemonset-2sdhp 1/1 Running 1 (8m46s ago) 11m
nvidia-device-plugin-daemonset-pctkc 1/1 Running 1 (8m48s ago) 11m
% kubectl get nodes -o json | jq -r '
["NODE", "NVIDIA_GPU", "EFA_CAPACITY"],
(.items[] |
[
.metadata.name,
(.status.capacity."nvidia.com/gpu" // "0"),
(.status.capacity."vpc.amazonaws.com/efa" // "0")
]
) | @tsv' | column -t -s $'\t'
NODE NVIDIA_GPU EFA_CAPACITY
ip-192-168-168-138.us-east-2.compute.internal 8 32
ip-192-168-181-246.us-east-2.compute.internal 8 32
Deploy FSx for Lustre file system and configure FSx CSI
Next, we’ll deploy a FSx for Lustre file system, configure FSx CSI and prepare PV/PVC persistent storage for the vLLM deployment. Following the user guide (step-1) to create a FSx for Lustre file system with the following configuration:
- Ensure the file system is deployed within the same VPC and AZ as the EKS nodes (us-east-2c in my example) to minimize access latency
- Capacity of 2.4 TiB
- Deployment type set to
SCRATCH_2(optimized for high burst throughput) - Configure FSx Security Groups to allow inbound connection from EKS node Security Group and its own Security Group on TCP 988, 1018-1023 as per here
You can use the following command to find the SG for the EKS node group.
aws eks describe-cluster --name vllm-cluster --query "cluster.resourcesVpcConfig.clusterSecurityGroupId" --output text
Your updated FSx security group should look like below (inbound rules).
Install AWS FSx for Lustre CSI Driver.
% helm repo add aws-fsx-csi-driver https://kubernetes-sigs.github.io/aws-fsx-csi-driver/
% helm repo update
% helm install aws-fsx-csi-driver aws-fsx-csi-driver/aws-fsx-csi-driver --namespace kube-system
NAME: aws-fsx-csi-driver
LAST DEPLOYED: Tue Oct 28 00:29:00 2025
NAMESPACE: kube-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
Verify the CSI add-ons are installed and running correctly.
% kubectl get pods -n kube-system | grep fsx
fsx-csi-controller-76477f6879-kk7xx 4/4 Running 0 68s
fsx-csi-controller-76477f6879-t79nm 4/4 Running 0 68s
fsx-csi-node-w7ddc 3/3 Running 0 68s
fsx-csi-node-xp6bh 3/3 Running 0 68s
Inspect the provided FSx storage class and PV/PVC config files, and update the marked sections such as FSx subnet, security groups and file system volume handle, DNS name and mount name.
% cat fsx-lustre-sc.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fsx-sc
provisioner: fsx.csi.aws.com
parameters:
subnetId: subnet-1234abcd # update to the FSx FS subnet
securityGroupIds: sg-1234abcd # update to the FSx SG
[…]
% cat fsx-lustre-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: fsx-lustre-pv
spec:
capacity:
storage: 2400Gi # Update to your FSxL FS size
volumeMode: Filesystem
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
storageClassName: fsx-sc
csi:
driver: fsx.csi.aws.com
volumeHandle: fs-1234abcd # replace with your FSxL FS ID
volumeAttributes:
dnsname: fs-1234abcd.fsx.us-east-2.amazonaws.com # replace with your FSxL DNS name
mountname: 1234abcd # replace with your FSxL FS mount name
Provision a FSx Lustre-backed storage class and deploy PV/PVC as the vLLM external storage for downloading and storing models.
% kubectl apply -f fsx-lustre-sc.yaml
% kubectl apply -f fsx-lustre-pv.yaml
% kubectl apply -f fsx-lustre-pvc.yaml
Verify the Kubernetes storage resources just deployed, making sure the PV/PVC status is “Bound”.
% kubectl get storageclasses.storage.k8s.io
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
fsx-sc fsx.csi.aws.com Retain Immediate false 5m48s
% kubectl get pv fsx-lustre-pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS VOLUMEATTRIBUTESCLASS REASON AGE
fsx-lustre-pv 2400Gi RWX Retain Bound default/fsx-lustre-pvc fsx-sc <unset> 5m58s
% kubectl get pvc fsx-lustre-pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
fsx-lustre-pvc Bound fsx-lustre-pv 2400Gi RWX fsx-sc <unset> 6m6s
Install AWS Load Balancer controller
To interact with the DeepSeek model from external, we’ll expose the frontend Open WebUI service via Kubernetes Ingress using the AWS Load Balancer controller.
First, create an IAM OIDC provider for the cluster.
eksctl utils associate-iam-oidc-provider --region=us-east-2 --cluster vllm-cluster –approve
Second, create the IAM policy for the AWS Load Balancer Controller.
curl -O https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/main/docs/install/iam_policy.json
aws iam create-policy --policy-name AWSLoadBalancerControllerIAMPolicy --policy-document file://iam_policy.json
Next, create an IAM service account for the AWS Load Balancer Controller.
eksctl create iamserviceaccount \
--cluster=vllm-cluster\
--namespace=kube-system \
--name=aws-load-balancer-controller \
--attach-policy-arn=arn:aws:iam::<your-account-id>:policy/AWSLoadBalancerControllerIAMPolicy \
--override-existing-serviceaccounts \
--region <region-id> \
--approve
Install the ALB controller using Helm.
helm repo add eks https://aws.github.io/eks-charts
helm repo update eks
kubectl apply -f https://raw.githubusercontent.com/aws/eks-charts/master/stable/aws-load-balancer-controller/crds/crds.yaml
helm install aws-load-balancer-controller eks/aws-load-balancer-controller \
-n kube-system \
--set clusterName=vllm-cluster \
--set serviceAccount.create=false \
--set serviceAccount.name=aws-load-balancer-controller
Make sure the ALB controllers pods are running.
% kubectl get pods -n kube-system | grep aws-load-balancer-controller
aws-load-balancer-controller-bf747d4d5-5tflv 1/1 Running 0 29s
aws-load-balancer-controller-bf747d4d5-g7sf4 1/1 Running 0 29s
Create a Security Group for the ALB.
aws ec2 create-security-group \
--group-name vllm-alb-sg \
--description "Security group for vLLM ALB" \
--vpc-id <vpc-id> \
Find your workstation’s public ip address and use it as the source address for updating the ALB SG ingress rule.
aws ec2 authorize-security-group-ingress \
--group-id <alb-sg-id> \
--protocol tcp \
--port 80 \
--cidr <user-pub-ip>/32
Create another ingress rule to allow TCP port 8080 from the security group associated with EKS node EFAs (used by Pods). This is for the ALB to access the Open WebUI pods which will be deployed in a moment.
aws ec2 authorize-security-group-ingress \
--group-id <node-efa-sg-id> \
--protocol tcp \
--port 8080 \
--source-group <alb-sg-id>
Instal LWS and deploy vLLM server workloads
Now we’ll install the LeaderWorkerSet controller to handle the vLLM deployment.
CHART_VERSION=0.7.0
helm install lws oci://registry.k8s.io/lws/charts/lws \
--version=$CHART_VERSION \
--namespace lws-system \
--create-namespace \
--wait --timeout 300s
Before we deploy the vLLM server workloads, inspect the provided vllm-ds-r1-lws.yaml, take a note of the following sections and make adjustment based on your own environments.
spec.leaderWorkerTemplate.size: update this to the total number of nodes (e.g. change to 4 workers if you are running 4x P4d nodes).leaderTemplate/workerTemplate.spec.containers.image: DLC container image for deploying vLLM inferencing. For example, I’m usingvllm:0.10.2-gpu-py312-cu129-ubuntu22.04-ec2-v1.1from the public ECR gallery and it comes with vLLM 0.10.2 and CUDA 12.9 pre-installed.--model deepseek-ai/DeepSeek-R1-0528: official model name/path from Hugging Face.--pipeline-parallel-size: model weights are split into multiple pipeline stages, this value should match your total number of nodes.--tensor-parallel-size: within each pipeline stage, models are further sharded across multiple GPUs, this number should match the number of GPUs on each node--max-model-len: this limits the user prompts length - you might need to lower this value if using a lower tier GPU with less vRAM, as it directly affects the KV Cache consumption.
Also, adjust the following values to fit within your node’s hardware specifications.
limits:
nvidia.com/gpu: "8"
cpu: "96"
memory: "512Gi"
vpc.amazonaws.com/efa: 8
requests:
nvidia.com/gpu: "8"
cpu: "96"
memory: "512Gi"
vpc.amazonaws.com/efa: 8
Deploy the vLLM server using the LWS pattern, which in my case including 1x vLLM leader and 1x worker pods across the 2x P5e nodes.
% kubectl apply -f vllm-ds-r1-lws.yaml
Since we are deploying a very large model, the whole process will take approx. 1hr, and you can inspect the logs from the leader pod.
% kubectl logs -f vllm-ds-r1-lws-0
Note the vLLM DLCs will automatically load pre-configured and optimized libraries and drivers including CUDA, EFA, NCCL and AWS-OFI-NCCL, which provides GPUDirect RDMA over the fabric.
(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=257, ip=192.168.160.50)vllm-ds-r1-lws-0-1:257:257 [0] NCCL INFO cudaDriverVersion 12080
(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=257, ip=192.168.160.50)vllm-ds-r1-lws-0-1:257:257 [0] NCCL INFO NCCL version 2.27.3+cuda12.9
(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=257, ip=192.168.160.50)vllm-ds-r1-lws-0-1:257:257 [0] NCCL INFO NET/Plugin: Plugin name set by env to libnccl-net.so
(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=257, ip=192.168.160.50)vllm-ds-r1-lws-0-1:257:257 [0] NCCL INFO NET/Plugin: Loaded net plugin Libfabric (v10)
...
(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=257, ip=192.168.160.50)vllm-ds-r1-lws-0-1:257:257 [0] NCCL INFO Successfully loaded external plugin libnccl-net.so
(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=257, ip=192.168.160.50)vllm-ds-r1-lws-0-1:257:257 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.16.3
(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=257, ip=192.168.160.50)vllm-ds-r1-lws-0-1:257:257 [0] NCCL INFO NET/OFI Using Libfabric version 2.1
(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=257, ip=192.168.160.50)vllm-ds-r1-lws-0-1:257:257 [0] NCCL INFO NET/OFI Using CUDA driver version 12080 with runtime 12080
(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=257, ip=192.168.160.50)vllm-ds-r1-lws-0-1:257:257 [0] NCCL INFO NET/OFI Configuring AWS-specific options
(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=257, ip=192.168.160.50)vllm-ds-r1-lws-0-1:257:257 [0] NCCL INFO NET/OFI Setting provider_filter to efa
(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=257, ip=192.168.160.50)vllm-ds-r1-lws-0-1:257:257 [0] NCCL INFO NET/OFI Internode latency set at 35.0 us
(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=257, ip=192.168.160.50)vllm-ds-r1-lws-0-1:257:257 [0] NCCL INFO NET/OFI Using transport protocol RDMA (platform set)
(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=978) vllm-ds-r1-lws-0:978:978 [0] NCCL INFO NET/OFI Selected provider is efa, fabric is efa-direct (found 16 nics)
The full DeepSeek-R1-0528 model is approx. 700GB and it will take some time to download (45min in my case).
(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=257, ip=192.168.160.50) INFO 10-19 10:00:32 [gpu_model_runner.py:2338] Starting to load model deepseek-ai/DeepSeek-R1-0528...
(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=982) INFO 10-19 10:00:32 [gpu_model_runner.py:2370] Loading model from scratch...
(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=982) INFO 10-19 10:00:32 [cuda.py:252] Using FlashMLA backend on V1 engine.
(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=982) INFO 10-19 10:00:33 [weight_utils.py:348] Using model weights format ['*.safetensors']
(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=982) INFO 10-19 10:44:56 [weight_utils.py:369] Time spent downloading weights for deepseek-ai/DeepSeek-R1-0528: 2663.422644 seconds
Next, vLLM will start loading the full model including all 163 shards.
…
Loading safetensors checkpoint shards: 99% Completed | 161/163 [14:16<00:09, 4.90s/it]
Loading safetensors checkpoint shards: 99% Completed | 162/163 [14:17<00:03, 3.64s/it]
Loading safetensors checkpoint shards: 100% Completed | 163/163 [14:29<00:00, 6.22s/it]
Loading safetensors checkpoint shards: 100% Completed | 163/163 [14:29<00:00, 5.33s/it]
(EngineCore_DP0 pid=806) (RayWorkerWrapper pid=987) INFO 10-28 17:53:59 [default_loader.py:268] Loading weights took 870.02 seconds
Eventually, you’ll see the vLLM API server becomes available.
[…]
(APIServer pid=1) INFO 10-28 17:55:47 [api_server.py:1971] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=1) INFO 10-28 17:55:47 [launcher.py:36] Available routes are:
[…]
(APIServer pid=1) INFO: Started server process [1]
(APIServer pid=1) INFO: Waiting for application startup.
(APIServer pid=1) INFO: Application startup complete.
(APIServer pid=1) INFO: 192.168.168.138:44016 - "GET /health HTTP/1.1" 200 OK
(APIServer pid=1) INFO: 192.168.168.138:54482 - "GET /health HTTP/1.1" 200 OK
You should also see the leader pod showing 1/1 Ready status.
% kubectl get pods vllm-ds-r1-lws-0
NAME READY STATUS RESTARTS AGE
vllm-ds-r1-lws-0 1/1 Running 0 110m
Deploy Open WebUI for interacting with the model
Finally, let’s deploy the Open WebUI frontend for interacting with the DeepSeek model via vLLM. Inspect the provided open-webui.yaml and notice the value of OPENAI_API_BASE_URLS is already set to the K8s DNS address for the vLLM leader service.
# This is the vllm k8s service URL to which Open-WebUI client connects to.
- name: OPENAI_API_BASE_URLS
value: "http://vllm-ds-r1-lws-leader.default.svc.cluster.local:8000/v1"
Here we’ll also deploy a public facing Application Load Balancer through the K8s Ingress API, so make sure you update the ALB security group as configured in the previous steps.
# Use AWS Load Balancer Controller with ALB
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/target-type: ip
alb.ingress.kubernetes.io/security-groups: sg-1234abcd # replace with your ALB Security Group
Deploy the yaml file and take a note of the Ingress (ALB) URL.
% kubectl apply -f open-webui.yaml
% kubectl get ingress
NAME CLASS HOSTS ADDRESS PORTS AGE
open-webui alb * k8s-default-openwebu-1fce4d24e1-152388083.us-east-2.elb.amazonaws.com 80 11m
You should now be able to access and interact with the DeepSeek R1 model via the Open WebUI URL.
Clean up
To remove the demo resources created in the preceding steps, use the following commands.
kubectl delete -f open-webui.yaml
kubectl delete -f vllm-ds-r1-lws.yaml
kubectl delete -f fsx-lustre-pvc.yaml && kubectl delete -f fsx-lustre-pv.yaml && kubectl delete -f fsx-lustre-sc.yaml
If the demo environment is no longer needed, then delete the EKS cluster and FSx for Lustre file system to avoid incurring charges.
eksctl delete cluster –name=<your-cluster-name> --region=<region-id>
aws fsx delete-file-system --file-system-id <fsx-lustre-id>
Conclusion
In this post, we walked through the process for deploying DeepSeek-R1-0528 (671B) full model on Amazon EKS using vLLM, with EC2 P5e GPU nodes and FSx for Lustre integration.
References
- https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/what-is-dlc.html
- https://aws.amazon.com/blogs/machine-learning/deploy-llms-on-amazon-eks-using-vllm-deep-learning-containers/
- https://builder.aws.com/content/2w2T9a1HOICvNCVKVRyVXUxuKff/deploying-the-deepseek-v3-model-full-version-in-amazon-eks-using-vllm-and-lws
- https://www.theriseunion.com/en/blog/DeepSeek-V3-R1-671B-GPU-Requirements.html
Relevant content
- asked 2 years ago
AWS OFFICIALUpdated a year ago
AWS OFFICIALUpdated 5 months ago