Step by Step guide to deploy DeepSeek R1 Distilled models.
Authored by Pinak Panigrahi
The DeepSeek team recently introduced a new set of reasoning models called DeepSeek-R1-Distill-Llama-70B and DeepSeek-R1-Distill-Llama-8B. Getting started with them on AWS Inferentia and Trainium takes only a few steps:
-
From your AWS console, launch a trn1.32xlarge
EC2 instance with the Neuron Multi Framework DLAMI called Deep Learning AMI Neuron (Ubuntu 22.04)
-
Activate the virtual environment source /opt/aws_neuronx_venv_pytorch_2_5_transformers/bin/activate
-
Install the vLLM with the following commands:
git clone -b v0.6.x-neuron https://github.com/aws-neuron/upstreaming-to-vllm.git
cd upstreaming-to-vllm
pip install -r requirements-neuron.txt
VLLM_TARGET_DEVICE="neuron" && pip install -e .
-
Deploy deepseek-ai/DeepSeek-R1-Distill-Llama-70B
using vLLM
python3 -m vllm.entrypoints.openai.api_server \
--model "deepseek-ai/DeepSeek-R1-Distill-Llama-70B" \
--tensor-parallel-size 32 \
--max-num-seqs 2 \
--max-model-len 8192 \
--block-size 8 \
--device neuron \
--use-v2-block-manager \
--port 8000
-
Or to deploy DeepSeek-R1-Distill-Llama-8B
using vLLM use the following code:
python3 -m vllm.entrypoints.openai.api_server \
--model "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" \
--tensor-parallel-size 8 \
--max-num-seqs 4 \
--max-model-len 8192 \
--block-size 8 \
--device neuron \
--use-v2-block-manager \
--port 8000
-
Invoke the model server
curl localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-70B", "prompt": "What is DeepSeek R1?", "temperature":0, "max_tokens": 128}' | jq '.choices[0].text'
-
Deploy directly from Hugging Face model cards
import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client("iam")
role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]
# Hub Model configuration. https://huggingface.co/models
hub = {
"HF_MODEL_ID": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
"HF_NUM_CORES": "2",
"HF_AUTO_CAST_TYPE": "bf16",
"MAX_BATCH_SIZE": "8",
"MAX_INPUT_TOKENS": "3686",
"MAX_TOTAL_TOKENS": "4096",
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
image_uri=get_huggingface_llm_image_uri("huggingface-neuronx", version="0.0.25"),
env=hub,
role=role,
)
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.inf2.xlarge",
container_startup_health_check_timeout=1800,
volume_size=512,
)
# send request
predictor.predict(
{
"inputs": "What is is the capital of France?",
"parameters": {
"do_sample": True,
"max_new_tokens": 128,
"temperature": 0.7,
"top_k": 50,
"top_p": 0.95,
}
}
)
Learn more: