Python's GIL limits PutVectors bulk-load throughput to ~490 vec/sec regardless of thread count when loading high-dimensional vectors into Amazon S3 Vectors. This article diagnoses the bottleneck, walks through switching from ThreadPoolExecutor to ProcessPoolExecutor, and shows measured results: on the same c5.2xlarge, throughput improved from 490 to 1,709 vec/sec simply by switching concurrency model.
The Problem
When bulk-loading high-dimensional vectors into Amazon S3 Vectors using Python, you may observe that PutVectors throughput plateaus regardless of how many concurrent threads you use. In our testing on a c5.2xlarge, throughput stayed at ~490 vectors/sec whether we used 5 threads or 20. The exact ceiling varies by instance type, but the pattern is the same on any EC2 instance: adding threads does not increase throughput. This is a classic symptom of Python's Global Interpreter Lock (GIL) constraining CPU-bound work in a multithreaded architecture.
Workload Profile
- 1 billion vectors, 1024 dimensions, float32, euclidean distance
- PutVectors with batch size 500 (maximum per API call), ~2MB payload per request
- Source data: Parquet files in Amazon S3
- Client: Python with boto3, running on Amazon EC2
Why Multithreading Does Not Scale
Each PutVectors call with 500 x 1024 float32 vectors produces a ~2MB payload. Constructing that payload—building Python dicts, serializing to JSON—is CPU-bound. In CPython, the GIL ensures only one thread executes Python bytecode at a time. Threads release the GIL during I/O waits (network calls), but with large payloads, the CPU-bound serialization prevents threads running in parallel.
Multithreading Results (c5.2xlarge, 8 vCPUs)
| Workers | Wall time (s) | Effective vec/sec |
|---|
| 5 | 101.8 | ~491 |
| 20 | 101.9 | ~490 |
Quadrupling the thread count from 5 to 20 produced no improvement—wall-clock time and throughput are identical. The GIL serializes the CPU-bound payload construction, so additional threads simply wait their turn rather than running in parallel.
The Fix: Switch to Multiprocessing
Replace concurrent.futures.ThreadPoolExecutor with concurrent.futures.ProcessPoolExecutor. Each worker process gets its own Python interpreter and GIL, enabling true parallel execution of both payload construction and PutVectors calls.
Key Implementation Details
1. Create one boto3 client per process
boto3 clients are not safe to share across processes. Initialize a client in each worker:
from concurrent.futures import ProcessPoolExecutor
from botocore.config import Config
import boto3
def init_client():
global client
client = boto3.client('s3vectors', config=Config(
max_pool_connections=50,
retries={'max_attempts': 3}
))
2. Move payload construction into the worker
This is critical. The payload construction phase is CPU-bound and GIL-constrained. Each process should receive a file path or raw data and build PutVectors payloads locally:
def process_file(file_key):
"""Each process handles a complete file end-to-end."""
# Download from S3
data = download_parquet(file_key)
# Build batch payloads (CPU-bound—now parallel across processes)
batches = build_batches(data, batch_size=500)
# Sequential PutVectors calls per process
for batch in batches:
client.put_vectors(
vectorBucketName='my-vector-bucket',
indexName='my-index',
vectors=batch
)
3. Use shared-nothing data passing
Pass file paths or Amazon S3 keys to workers rather than large Python objects. This avoids pickling overhead across the process boundary.
4. Size your instance to match worker count
Each process needs CPU and memory. Start with workers equal to your vCPU count, then scale up and measure.
Multiprocessing Results
With multiprocessing, throughput scaled with the number of worker processes and available vCPUs:
| Configuration | Vec/sec | Projected time (1B vectors) |
|---|
| Threading, 5-20 workers, c5.2xlarge (8 vCPUs) | ~490 | 23.6 days |
| Multiprocessing, 5 workers, c5.2xlarge (8 vCPUs) | ~1,709 | 6.8 days |
| Multiprocessing, 15 workers, c5.4xlarge (16 vCPUs) | ~2,500 | 4.6 days |
Reading the Results
Two things to note:
The throughput improvement. On the same c5.2xlarge, switching from threading to multiprocessing delivered a 3.5x improvement (490 to 1,709 vec/sec). Scaling up to a c5.4xlarge with 15 workers reached 2,500 vec/sec, cutting load time from 24 days to under 5 days.
The bottleneck shifted from client to service. On the c5.4xlarge with 15 workers, average per-batch put time rose to ~1.54s compared to ~1.15s with 5 workers, while build time stayed flat at ~0.28-0.38s. The increasing put latency indicates the API is now the constraint rather than the client. At 2,500 vec/sec, the client is saturating the per-index write throughput—which is exactly where you want to be.
Recommendations
- Always use multiprocessing (not multithreading) for bulk PutVectors loads with high-dimensional vectors. The GIL makes threading ineffective for payloads that require significant CPU-bound serialization.
- Match your EC2 instance to your target concurrency. A c5.2xlarge (8 vCPUs) is undersized for high-throughput bulk loads. Scale up to c5.4xlarge or larger to support enough worker processes to saturate the API.
- Be aware of API throttling behavior. As you scale concurrency, you may encounter per-index write throughput limits. If you observe increased latency or throttling errors at high concurrency, back off the worker count—this indicates you are approaching the service-side throughput ceiling rather than a client-side limitation.
- Start with workers equal to vCPU count, then increase. Monitor for diminishing returns as the API becomes the bottleneck.
- Configure boto3 connection pools per process. Set
max_pool_connections appropriately and add retries to handle transient connection errors.
- Keep batch size at 500 (the PutVectors maximum). The throughput gains come from parallelizing across processes, not from tuning batch size.
Summary
If your Python-based Amazon S3 Vectors bulk load is stuck at ~400-500 vec/sec regardless of thread count, you are hitting the GIL. Switch to multiprocessing, size your instance appropriately, and you can reach the documented per-index write throughput of 2,500 vec/sec.
References