Performance Optimization at Scale for Amazon Prime Video by Andrew SteffeyPerformance Optimization at Scale for Amazon Prime Video by Andrew Steffey

Performance Optimization at Scale for Amazon Prime Video

Andrew Steffey

Software Engineer

Amazon EC2

Java

Kotlin

Video Streaming

Performance Optimization at Scale: Amazon Prime Video

Delivered a 76% reduction in compute costs and 3X per-core throughput increase by optimizing instance selection, improving cache utilization, tuning memory allocation/garbage collection/thread pools, and partnering with upstream teams to reduce oversized responses.

Overview

A critical Amazon Prime Video service faced escalating infrastructure costs as the customer base grew. The service needed to handle increasing traffic while maintaining strict SLA requirements and cost efficiency.

Approach

Proposed that larger instance sizes would provide non-linear performance benefits for the service's highly-variable workload, where response complexity varies significantly. Load testing verified this approach: larger instances reduced the probability of any single host receiving a disproportionate number of expensive requests simultaneously, while also improving on-host cache utilization, allowing for higher CPU utilization while maintaining latency SLAs. Implemented enhanced CPU monitoring to verify that larger instances prevented subsecond CPU spikes that would break latency SLAs.

Migrated to ARM-based Graviton instances, which offered lower costs and better performance for the workload. Verified latency and reliability through testing with simulated traffic loads.

Additionally, partnered with an upstream client team to reduce response payload sizes. Conducted A/B testing to validate that reduced response sizes had no discernible impact on customer behavior, while significantly improving service performance.

Optimized thread pool architecture by ensuring all code with potential off-host calls executed asynchronously. Consolidated CPU-intensive tasks onto a single thread pool sized just above the number of cores. Leveraged Netty for async off-host calls wherever possible, allocating larger IO thread pools only where async Netty clients could not be used.