AWS announces preview of AWS Interconnect - multicloud
AWS announces AWS Interconnect – multicloud (preview), providing simple, resilient, high-speed private connections to other cloud service providers. AWS Interconnect - multicloud is easy to configure and provides high-speed, resilient connectivity with dedicated bandwidth, enabling customers to interconnect AWS networking services such as AWS Transit Gateway, AWS Cloud WAN, and Amazon VPC to other cloud service providers with ease.
Deploying Mistral 3.2-24B Instruct (2506) on AWS: EC2 GPU Instance Selection and Benchmark Insights
This article evaluates instance performance for deploying the Mistral 3.2-24B Instruct (2506) model on Amazon EC2 using vLLM. It aims to identify the best balance of price, performance, and scalability for interactive inference workloads, helping users choose the most cost-effective EC2 instance for running large transformer models efficiently at different concurrency levels.
Introduction
You are looking to deploy a fine-tuned Mistral 3.2-24B Instruct (2506) model for interactive inference (chat-style workloads) on Amazon EC2 instances powered by NVIDIA GPUs. The goal is to identify the best price-performance configuration for vLLM-based inference, focusing on throughput efficiency. You want to provision the right amount of infrastructure to meet service-level expectations while avoiding overprovisioning and unnecessary cost.
Assumptions
Since the goal is to evaluate instance performance before moving to full production, a few practical assumptions are needed. While some parameters are well known — such as model size and precision — others must be estimated to create a realistic but manageable testing setup. These assumptions help define a consistent baseline for benchmarking throughput and resource utilization without overcomplicating early-stage evaluations.
Key model & workload specs:
- Model size: 24B parameters (BF16) — ~64 GB VRAM required
- Engine: vLLM
- Average output: ~1K tokens
- Target throughput: 30 tokens/sec per session
- KV cache enabled
- Benchmark dataset: ShareGPT
Candidate EC2 Instances
To evaluate performance meaningfully, we focus on Amazon EC2 instances equipped with NVIDIA GPUs, since these accelerators directly impact model inference speed. Rather than testing every possible instance variation, we select a representative set that covers different GPU architectures and performance tiers. The goal is to compare key accelerator characteristics — such as VRAM capacity, FP16 compute throughput, and memory bandwidth — as these factors have the greatest influence on large language model inference workloads like Mistral 3.2-24B.
The following instance families were evaluated.
Important: G instances were tested using all GPUs on the instance, while P instances were tested using only a single GPU, despite having 8 GPUs per instance. This distinction is critical for interpreting the results.
| Instance | Total accelerator(s) on instance | Accelerator used in test | VRAM (GB) per accelerator | FP16 TFLOPS per accelerator | Memory bandwidth (GB/s) per Accelerator | VRAM Usage |
|---|---|---|---|---|---|---|
| g5.12xlarge | 4x A10G | 4 | 24 | 121 | 300 | A |
| g6e.12xlarge | 4x L40S | 4 | 48 | 366 | 864 | A, B |
| p5en.48xlarge | 8x H200 | 1 | 141 | 1,071 | 4,917 | A, B, C |
| p6-b200.48xlarge | 8x B200 | 1 | 181 | 2,849 | 7,672 | A, B, C |
| Legend | Description |
|---|---|
| A | Model fits with headroom |
| B | Strong memory bandwidth |
| C | Test reflects single‑GPU performance |
Source: accelerator specifications from NVIDIA data sheets.
Benchmark Methodology
Since the fine-tuned version of the Mistral 3.2-24B Instruct (2506) model is not yet finalized, we use the base model as a proxy for benchmarking. This approach is valid because fine-tuning modifies the model’s weights but not its overall structure — such as layer count, parameter size, or compute requirements — so inference performance remains largely consistent. To ensure comparability across tests, we fix the inference engine as vLLM, which serves as our control variable. Using the same model and inference framework across all GPU types allows us to isolate hardware-level performance differences rather than software or implementation effects.
We also standardize inputs and workload characteristics for reproducibility. Prompts are drawn from the ShareGPT dataset, representing typical conversational input lengths. Instead of testing a single static load, we simulate varying concurrency levels --max-seq-num = [1, 2, 4, 8, 16, 32] — to understand how each instance behaves under different user demand scenarios, from light overnight usage to busy peak hours. The benchmark target is 30 tokens per second per sequence, which provides a practical throughput baseline for interactive workloads. Performance above this threshold indicates that the configuration can maintain a smooth user experience as concurrency scales.
Performance goal: ≥ 30 tokens/sec per sequence.
Results Summary
The benchmark results highlight clear performance differences across the tested EC2 instance families as concurrency increases. As shown in the chart, g5.12xlarge (A10G) instances perform adequately for low concurrency, maintaining around 30 tokens per second with one or two active sessions before dropping below the target threshold as load increases. The g6e.12xlarge (L40S) offers stronger scaling, sustaining near-target throughput up to moderate concurrency levels. In contrast, the p5en.48xlarge (H200) and p6-b200.48xlarge (B200) instances deliver consistently high throughput across all concurrency levels, far exceeding the 30 tokens per second benchmark. However, since these larger instances include multiple high-performance GPUs, their capabilities often surpass what’s needed for smaller interactive workloads, introducing a potential trade-off between capacity and utilization.
| Instance \ Batch | 1 | 2 | 4 | 8 | 16 | 32 |
|---|---|---|---|---|---|---|
| g5.12xlarge (4x A10G) | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ |
| g6e.12xlarge (4x L40S) | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
| p5en.48xlarge (H200, 1 GPU) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| p6-b200.48xlarge (B200, 1 GPU) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Note: The P instances results show single‑GPU performance only, even though the instances contain 8 GPUs.
Observations
-
Low concurrency (1–4 sessions):
- g5.12xlarge (A10G) and g6e.12xlarge (L40S) are most efficient — balance price and speed.
- Ideal for smaller chatbot workloads.
-
Scaling gap:
- Once concurrency increases, g5.12xlarge / g6e.12xlarge do not scale as well.
- Transitioning to P-series yields large jumps in cost and capability.
-
High concurrency (8+ sessions or multi-model):
- Single H200/B200 GPU outperform full multi‑GPU G‑instances.
- Optimal when multiple replicas or workloads fully utilise all GPUs (8 per instance).
- Otherwise, underutilisation may lead to unnecessary cost
Recommendations & Next Steps
- Validate concurrency and prompt sizes to match production use
- Assess latency - Time to First Token (TTFT) for production tuning.
- For low concurrency: standardise on G5 or G6e.
- For high concurrency:
- Use p5en.48xlarge/p6-b200.48xlarge with orchestration (e.g. Ray Serve, Triton) to distribute traffic.
- Explore replica scaling and multi-model deployments.
Example vLLM Commands
Serve
vllm serve mistralai/Mistral-Small-3.2-24B-Instruct-2506 \ --enable-chunked-prefill \ --dtype bfloat16 \ --tensor-parallel-size 1 \ --gpu-memory-utilization 0.8 \ --max-num-seqs [ 1 | 2 | 4 | 8 | 16 | 32 ]
Bench
# --dataset-path - Ensure you've downloaded a sharegpt database # --num-prompts 10 - Adjust for batching (10 * max_num_seq) # --max-concurrency 2 - (2 * num_seq) 50% reqs. on queue at all times vllm bench serve \ --backend vllm \ --model mistralai/Mistral-Small-3.2-24B-Instruct-2506 \ --dataset-name sharegpt \ --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json \ --num-prompts [ 10 | 20 | 40 | 80 | 160 | 320 ] \ --max-concurrency [ 2 | 4 | 8 | 16 | 32 | 64 ] \ --sharegpt-output-len 1000
- Language
- English
Relevant content
AWS OFFICIALUpdated 4 months ago- asked a year ago
- asked 2 years ago