Get started with DeepSeek R1 on AWS Inferentia and Trainium

2 minute read
Content level: Intermediate
3

Step by Step guide to deploy DeepSeek R1 Distilled models.

Authored by Pinak Panigrahi

The DeepSeek team recently introduced a new set of reasoning models called DeepSeek-R1-Distill-Llama-70B and DeepSeek-R1-Distill-Llama-8B. Getting started with them on AWS Inferentia and Trainium takes only a few steps:

  1. From your AWS console, launch a trn1.32xlarge EC2 instance with the Neuron Multi Framework DLAMI called Deep Learning AMI Neuron (Ubuntu 22.04)

  2. Activate the virtual environment source /opt/aws_neuronx_venv_pytorch_2_5_transformers/bin/activate

  3. Install the vLLM with the following commands:

    git clone -b v0.6.x-neuron https://github.com/aws-neuron/upstreaming-to-vllm.git
    cd upstreaming-to-vllm
    pip install -r requirements-neuron.txt
    VLLM_TARGET_DEVICE="neuron" && pip install -e .
    
  4. Deploy deepseek-ai/DeepSeek-R1-Distill-Llama-70B using vLLM

    python3 -m vllm.entrypoints.openai.api_server \
        --model "deepseek-ai/DeepSeek-R1-Distill-Llama-70B" \
        --tensor-parallel-size 32 \
        --max-num-seqs 2 \
        --max-model-len 8192 \
        --block-size 8 \
        --device neuron \
        --use-v2-block-manager \
        --port 8000
    
  5. Or to deploy DeepSeek-R1-Distill-Llama-8B using vLLM use the following code:

    python3 -m vllm.entrypoints.openai.api_server \
        --model "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" \
        --tensor-parallel-size 8 \
        --max-num-seqs 4 \
        --max-model-len 8192 \
        --block-size 8 \
        --device neuron \
        --use-v2-block-manager \
        --port 8000
    
  6. Invoke the model server

    curl localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-70B", "prompt": "What is DeepSeek R1?", "temperature":0, "max_tokens": 128}' | jq '.choices[0].text'
    
  7. Deploy directly from Hugging Face model cards

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

# Hub Model configuration. https://huggingface.co/models
hub = {
    "HF_MODEL_ID": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
    "HF_NUM_CORES": "2",
    "HF_AUTO_CAST_TYPE": "bf16",
    "MAX_BATCH_SIZE": "8",
    "MAX_INPUT_TOKENS": "3686",
    "MAX_TOTAL_TOKENS": "4096",
}


# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    image_uri=get_huggingface_llm_image_uri("huggingface-neuronx", version="0.0.25"),
    env=hub,
    role=role,
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.inf2.xlarge",
    container_startup_health_check_timeout=1800,
    volume_size=512,
)

# send request
predictor.predict(
    {
        "inputs": "What is is the capital of France?",
        "parameters": {
            "do_sample": True,
            "max_new_tokens": 128,
            "temperature": 0.7,
            "top_k": 50,
            "top_p": 0.95,
        }
    }
)

Learn more:

profile pictureAWS
EXPERT
published 20 days ago3.5K views
1 Comment

I believe there's a small typo here?

VLLM_TARGET_DEVICE="neuron" && pip install -e .

should be

VLLM_TARGET_DEVICE="neuron" pip install -e .
replied 17 days ago