Skip to content

Nova Sonic with Text and Voice

0

I would like to be able to use Nova Sonic in a conversation with the user that could be voice to voice, voice to text or text to text depending on what the user needs. E.g. if they are in a noisy environment with other people around it might be useful for the user to mute their microphone and type input instead.

I can't seem to get the Nova Sonic agent to process text input once a voice conversation has started. Is there a way to to setup or configure the agent conversation so that text input is also processed by the agent during a voice conversation?

asked a month ago157 views
1 Answer
0

To enable Nova Sonic to handle both voice and text inputs during the same conversation, you'll need to understand how the bidirectional Stream API works and modify your event flow accordingly.

Nova Sonic uses an event-driven architecture with structured input and output events. When you want to switch from voice to text input during an ongoing conversation, you need to follow the same event pattern but change the content type.

Here's how you can implement this functionality:

  1. For voice input, you typically send audio frames as audioInput events within a content container (started with contentStart and ended with contentEnd).

  2. To switch to text input during the same conversation, you would:

  • Close any open audio stream with a contentEnd event (if you were previously using voice)
  • Send a new contentStart event with the role set to USER
  • Send a textInput event containing the user's typed message
  • Send a contentEnd event to close this text input segment
  1. You can then switch back to voice input later by following the same pattern but using audioInput events instead.

The key is maintaining the same conversation session while changing the input modality. Since Nova Sonic's architecture is designed to process different types of content within the same conversation flow, you can alternate between voice and text inputs as needed.

Make sure your application's UI provides clear options for users to switch between input modes, and that your backend properly handles the transition between these modes by sending the appropriate event sequences to Nova Sonic.
Sources
Handling input events with the bidirectional API - Amazon Nova
Using the Amazon Nova Sonic Speech-to-Speech model - Amazon Nova

answered a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.