Skip to content

High latency (30–40 seconds) for very short Arabic audio using Amazon Transcribe (Batch and Streaming)

0

Description

We are experiencing unexpectedly high latency when transcribing very short Arabic audio files using Amazon Transcribe.

Architecture

Our workflow:

Student uploads short audio (2–5 seconds, typically 1–3 spoken words).

Audio is stored in S3.

AWS Lambda is triggered.

Lambda invokes Amazon Transcribe.

After transcription completes, the result is passed to Amazon Bedrock (Nova Pro) for evaluation.

There is no delay from Bedrock. The bottleneck appears to be Amazon Transcribe.

Observed Behavior

For very short audio files (2–3 seconds):

Transcription consistently takes 30–40 seconds

This occurs even when the audio contains only one short word

Latency is significantly higher than expected for such short input

What We Tested

We initially used Batch Transcription (StartTranscriptionJob) with polling.

To reduce latency, we migrated to Streaming Transcription via AWS SDK from Lambda.

However:

Streaming transcription produced similar latency (30–40 seconds)

No noticeable improvement compared to batch mode

This suggests the delay is not caused by:

Batch job scheduling

Polling logic

Lambda cold start

Or client-side implementation

Technical Details

Language: Arabic (ar-SA)

Region: eu-central-1

Audio format stored in S3: MP3

Audio duration: typically 2–5 seconds

Invocation: Python SDK from AWS Lambda

Streaming: full audio streamed and waiting for final transcript event

Total observed end-to-end transcription time: ~30–40 seconds

Expected Behavior

For 2–3 second audio clips, we expected transcription time closer to:

Audio duration + small processing overhead

Ideally under 10 seconds

Questions

We would appreciate clarification on:

Is 30–40 seconds expected latency for very short Arabic audio?

Does Arabic transcription require additional model initialization or inference time?

Are there known latency characteristics in eu-central-1 for Amazon Transcribe?

Is streaming transcription expected to improve latency for short pre-recorded audio?

Is Amazon Transcribe optimized primarily for longer recordings rather than very short clips?

Are there recommended configurations to reduce latency for short Arabic audio?

Would another region provide lower latency for Arabic transcription?

asked 13 days ago24 views
1 Answer
0

Based on the architecture you've described, there are several important considerations regarding your streaming transcription implementation and latency expectations:

Streaming vs. Batch for Pre-recorded Audio

Streaming transcription is designed for real-time audio streams, not pre-recorded files. When you stream a complete audio file from S3 through Lambda, you're not gaining the primary benefit of streaming—which is receiving partial results as audio is being captured in real-time. For pre-recorded short audio files, batch transcription is typically the appropriate choice.

Streaming Best Practices You May Be Missing

If you continue with streaming, several factors significantly impact latency:

  • Chunk size and timing: Latency depends heavily on audio chunk size. Optimal chunks should be between 50-200ms. The stream should be as close to real-time as possible—meaning you should send audio chunks progressively, not all at once.

  • Audio format: PCM-encoded audio provides the best streaming efficiency. Since you're storing MP3 files in S3, converting to PCM before streaming could improve performance.

  • Uniform chunk delivery: Chunks should be sent at a consistent rate that mimics real-time speech delivery, not as fast as possible.

Recommendations for Your Use Case

For very short pre-recorded audio (2-5 seconds):

  1. Use batch transcription rather than streaming, as it's designed for complete media files in S3
  2. Optimize audio format: Ensure your MP3 files use appropriate sampling rates (16,000 Hz offers the best balance of quality and processing efficiency)
  3. Consider audio quality: Lossless formats generally provide better transcription results

Regarding Your Specific Questions

The 30-40 second latency you're experiencing is likely not typical for the audio duration itself, but may be related to how the audio is being processed or streamed. The streaming service expects real-time delivery of audio chunks—if you're sending the entire file at once or not properly chunking the audio, this could contribute to unexpected behavior.

For language-specific or region-specific latency characteristics, and whether Arabic requires additional processing time, I'd recommend reviewing the service quotas and supported features documentation for your specific region and language combination, as streaming support and performance can vary by language and region.
Sources
Streaming and partial results - Amazon Transcribe
Transcribing streaming audio - Amazon Transcribe

answered 13 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.