Skip to content

Unexpected Performance Degradation After Migrating from c7i.8xlarge to m5zn.12xlarge

0

We experienced significant performance degradation (up to 154% slower on certain operations) after migrating our ECS container instances from c7i.8xlarge to m5zn.12xlarge. Despite maintaining identical container specifications, our CPU-intensive API workloads now perform substantially worse.

Environment Details

Previous Configuration (Better Performance) Instance Type: c7i.8xlarge vCPUs: 32 Processor: 4th Gen Intel Xeon Scalable (Sapphire Rapids) Base Frequency: 3.0 GHz, Turbo up to 3.5 GHz Memory: 64 GiB Network Performance: Up to 12.5 Gbps

Current Configuration (Degraded Performance) Instance Type: m5zn.12xlarge vCPUs: 48 Processor: 2nd Gen Intel Xeon Scalable (Cascade Lake) Base Frequency: 4.5 GHz (all-core turbo) Memory: 192 GiB Network Performance: Up to 100 Gbps

Workload Characteristics

Platform: Amazon ECS (EC2 launch type) Container Specifications: Unchanged between configurations Workload Type: CPU-intensive APIs Primary Operations: Database interactions, sample/trial creation, material management

Performance Test Results We conducted identical performance tests on both instance types. Here are the key results:

Create New Study: 1.86s → 3.88s (108.90% slower, 2.1x) Delete Phase: 3.05s → 7.78s (154.63% slower, 2.5x) Create Sample: 4.95s → 8.44s (70.55% slower, 1.7x) Create Trial: 6.50s → 10.03s (54.41% slower, 1.5x) Create 25 Trials: 7.20s → 10.07s (39.96% slower, 1.4x) Create All Samples: 10.77s → 13.90s (29.05% slower, 1.3x)

Summary: 11 out of 12 operations showed significant regression, with most operations being 1.5-2.5x slower. Questions

Processor Architecture Impact: Given that m5zn uses older Cascade Lake processors (2019) versus c7i's Sapphire Rapids (2023), could this explain the degradation despite higher clock speeds? Are there specific architectural improvements in Sapphire Rapids that benefit CPU-intensive workloads? CPU Performance Modes: Are there any differences in how CPU performance states, turbo boost, or C-states are configured between these instance families that could impact sustained CPU performance?

ECS Container CPU Allocation: With more vCPUs available on m5zn.12xlarge (48 vs 32), could there be CPU scheduling or allocation issues affecting our containers? Should we adjust container CPU reservations? Virtualization Overhead: Are there differences in the hypervisor or virtualization technology between these instance types that could impact CPU-intensive workloads?

Memory Bandwidth and Latency: While m5zn has more memory, could there be differences in memory bandwidth, latency, or NUMA configuration affecting our database-heavy operations?

Instance Family Optimization: The c7i family is compute-optimized while m5zn is memory-optimized with high network throughput. For CPU-intensive APIs that don't require 100 Gbps networking, would we be better served staying with compute-optimized instances? Recommended Instance Type: Based on our workload characteristics (CPU-intensive, ECS containers, database operations), what instance type would you recommend for optimal performance?

Additional Context

Container resource specifications remained identical between tests No other infrastructure changes were made during migration Performance degradation is consistent and reproducible We expected improvement with more vCPUs and higher base frequency, not degradation

What We're Looking For

Technical explanation for the observed performance degradation Guidance on instance type selection for CPU-intensive ECS workloads Recommendations for optimization if continuing with m5zn family Confirmation whether returning to c7i family or upgrading to c7i.16xlarge would be advisable

Thank you for any insights!

3 Answers
0

The performance degradation you're experiencing after migrating from c7i.8xlarge to m5zn.12xlarge is surprising but can be explained by several factors.

Processor Architecture Impact

The generational difference between processors is likely a significant factor. The c7i instances use 4th Gen Intel Xeon Scalable processors (Sapphire Rapids), which include architectural improvements over the 2nd Gen Intel Xeon (Cascade Lake) in m5zn instances. Despite the higher clock speed of m5zn, newer processor generations often include:

  • Improved instructions per cycle (IPC)
  • Enhanced cache architecture
  • Better branch prediction
  • More efficient memory access patterns

These improvements can make newer processors perform better even at lower clock speeds for certain workloads.

Instance Family Optimization

The instance family selection is critical. The "c" in c7i stands for compute-optimized, making it specifically designed for CPU-intensive workloads like yours. The "m" in m5zn indicates memory-optimization with high network throughput, which may not align with your CPU-intensive needs.

Compute-optimized instances typically have:

  • CPU architectures optimized for compute-intensive workloads
  • Better CPU-to-memory ratio for computational tasks
  • Processor features prioritized for raw computational performance

Memory Bandwidth and Latency

While m5zn has more total memory, the memory subsystem architecture matters significantly. The c7i family likely has memory subsystems better optimized for CPU-intensive operations, potentially with:

  • Better memory bandwidth per core
  • Lower memory access latencies
  • More optimized cache hierarchies
  • Better NUMA configuration for your workload pattern

ECS Container CPU Allocation

Having more vCPUs doesn't necessarily translate to better performance if the underlying cores are less efficient. With 48 vCPUs on m5zn.12xlarge versus 32 on c7i.8xlarge, there could be:

  • Different CPU scheduling behaviors
  • Potential resource contention
  • Less efficient core-to-thread mapping

Recommendations

  1. Instance Type Selection: Based on your workload characteristics, returning to the compute-optimized c7i family would likely be advisable. Consider c7i.8xlarge or c7i.16xlarge if you need more capacity.

  2. If Continuing with m5zn:

  • Adjust container CPU reservations to better align with the architecture
  • Consider testing with different CPU allocation strategies in your ECS task definitions
  • Monitor CPU utilization patterns to identify potential bottlenecks
  1. Performance Optimization:
  • Cache frequently accessed data to reduce database interactions
  • Consider profiling your application to identify specific operations affected by the architectural differences
  • Optimize container images to improve startup and execution time

The significant performance degradation across all operations strongly suggests that the compute-optimized c7i family is better suited for your CPU-intensive workloads, despite having fewer vCPUs and a lower clock speed on paper. The newer processor architecture and compute-optimized design appear to provide substantial benefits for your specific use case.
Sources
Disappointed in AWS EC2 compute performance: | AWS re:Post
Training LLM on p5en.48xlarge (H200×8) is 2.5× Slower than on My Personal Server (H100×8) | AWS re:Post
Performance efficiency - Migration Lens
Diagnose and resolve performance issues in my Amazon ECS-hosted applications | AWS re:Post

answered 8 days ago
EXPERT
reviewed 8 days ago
0

In terms of CPU Utilization Pattern If your workload is CPU-bound and single-threaded: m5zn will be SLOWER despite more cores .If your workload can use all 48 cores: m5zn might be FASTER overall

In terms of Memory Bandwidth m5zn: 192 GB RAM (3x more than c7i) Better for memory-intensive workloads c7i: Better for compute-intensive workloads

EXPERT
answered 8 days ago
0

Without profiling data of the running code, and assuming that the issue is purely compute-bound (and not slowed down by memory, network, EBS bandwidth, etc.), it's very difficult to guess which processor features the code you're running would benefit from. However, given that the differences are so enormous that they can be readily quatified, there could be a lot to gain by testing different instance types with different processors. This would also help to narrow down whether the issue is likely related to the older Intel processor generation on the m5zn or other factors.

Close to the size of a c7i.8xlarge would be r7a.4xlarge (4th Gen AMD EPYC, "Genoa"; 16 cores at 3.7 GHz, 128 GB RAM). If your code runs on or can be compiled for the arm64 platform, m8g.4xlarge (16 cores at 2.8 GHz, 64 GB RAM) and c8g.8xlarge (32 cores at 2.8 GHz, 64 GB RAM) would use AWS's own Graviton4 processors.

Do you know if the compute-intensive code is able to leverage GPU acceleration? While GPU capacity may not always be available the way it used to be before GenAI, and even though the prices have gone up, offloading relevant operations to the GPU can, of course, make a massive difference to certain compute-bound operations. For example, g5.4xlarge would use only 2nd generation AMD EPYC processors (8 cores at 3.3 GHz, 64 GB RAM) but includes one NVIDIA A10G Tensor Core GPU.

Once the bottleneck and performance drivers are more clearly identified, the other optimisation aspects, including costs, would be simpler to figure out.

EXPERT
answered 2 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.