- Newest
- Most votes
- Most comments
Based on the architecture you've described, there are several important considerations regarding your streaming transcription implementation and latency expectations:
Streaming vs. Batch for Pre-recorded Audio
Streaming transcription is designed for real-time audio streams, not pre-recorded files. When you stream a complete audio file from S3 through Lambda, you're not gaining the primary benefit of streaming—which is receiving partial results as audio is being captured in real-time. For pre-recorded short audio files, batch transcription is typically the appropriate choice.
Streaming Best Practices You May Be Missing
If you continue with streaming, several factors significantly impact latency:
-
Chunk size and timing: Latency depends heavily on audio chunk size. Optimal chunks should be between 50-200ms. The stream should be as close to real-time as possible—meaning you should send audio chunks progressively, not all at once.
-
Audio format: PCM-encoded audio provides the best streaming efficiency. Since you're storing MP3 files in S3, converting to PCM before streaming could improve performance.
-
Uniform chunk delivery: Chunks should be sent at a consistent rate that mimics real-time speech delivery, not as fast as possible.
Recommendations for Your Use Case
For very short pre-recorded audio (2-5 seconds):
- Use batch transcription rather than streaming, as it's designed for complete media files in S3
- Optimize audio format: Ensure your MP3 files use appropriate sampling rates (16,000 Hz offers the best balance of quality and processing efficiency)
- Consider audio quality: Lossless formats generally provide better transcription results
Regarding Your Specific Questions
The 30-40 second latency you're experiencing is likely not typical for the audio duration itself, but may be related to how the audio is being processed or streamed. The streaming service expects real-time delivery of audio chunks—if you're sending the entire file at once or not properly chunking the audio, this could contribute to unexpected behavior.
For language-specific or region-specific latency characteristics, and whether Arabic requires additional processing time, I'd recommend reviewing the service quotas and supported features documentation for your specific region and language combination, as streaming support and performance can vary by language and region.
Sources
Streaming and partial results - Amazon Transcribe
Transcribing streaming audio - Amazon Transcribe
