AlexaDev Tuesday: SSML vs. Audio Player In Alexa Skills

Today it’s a closer look at using SSML vs. Audio Player in Alexa skills. SSML isn’t just for speechcons or sound effects, this versatile option can be used to play audio clips up to 90 seconds long.

 

Bach End Dev

 

SSML: It’s Not Just For Sound Effects

Most Alexa skill developers already know they can use Speech Synthesis Markup Language (SSML) tags in their skills to play short audio snippets like sound effects or speechcons, but this capability brings other options to Alexa skill developers as well.

The SSML audio tag can be used to insert audio of up to 90 seconds in length in your skill. 90 seconds is a pretty long time: a full minute and a half. That’s as long as a medium-length TV or radio advertisement, and there are even complete songs that run 90 seconds or less as well.

If you’d like your skill to play short audio other than speechcons or sound effects, you have a decision to make: will you use the Audio Player or SSML audio tags?

 

Audio Player Pros and Cons

The main pros for using Audio Player in an Alexa skill are:

1. Audio Player can handle audio longer than 90 seconds

2. Audio Player doesn’t require special formatting in the audio file, a standard MP3 will do

3. Audio Player supports standard audio playback commands like pause, stop and resume

 

The main cons for using Audio Player are:

1. Your skill must include and gracefully handle all of these built-in AMAZON.{intent} options for audio, even if your skill doesn’t actually need to use them:

AMAZON.PauseIntent
AMAZON.ResumeIntent
AMAZON.CancelIntent
AMAZON.LoopOffIntent
AMAZON.LoopOnIntent
AMAZON.NextIntent
AMAZON.PreviousIntent
AMAZON.RepeatIntent
AMAZON.ShuffleOffIntent
AMAZON.ShuffleOnIntent
AMAZON.StartOverIntent

Even if you’re using Audio Player only to play short snippets, you will need to write error traps for all of these built-in intents.

2. Audio Player runs kind of like a standalone widget. On Alexa devices with a screen (e.g., Echo Show, Echo Spot), when Audio Player is called the user will be shown the same type of interface as if they were listening to music. However, developers are not yet allowed to customize the information the interface displays. There will be a square for album cover art shown, but it will not have an image and the developer cannot provide an image to display there. Developers also cannot supply other metadata or text to display in the Audio Player interface: you can’t use it to show the user lyrics, instructions, or any other type of information or message.

3. Related to #2 above: if you’re using Audio Player to play a single, short clip, having the Audio Player interface pop up on screen may confuse users. It may appear to the user as if your skill has closed and Amazon Music (or some other music service) is now running.

 

SSML Pros and Cons

The main pros for using SSML audio tags in an Alexa skill are:

1. No need to launch Audio Player, no requirement for your skill to handle all the built-in intents related to audio playback

2. 90 seconds is a lot of recording time to work with: when you need to play audio that’s longer than a sound effect or speechcon, but still no longer than 90 seconds, SSML is Amazon’s intended solution

3. On Alexa devices with screens, the interface will not change when your SSML clip plays

 

The main cons for using SSML audio tags in an Alexa skill are:

1. SSML audio clips have specific formatting requirements: every clip must be formatted as MPEG-2 audio and play at a bit rate of 48kbps at 16000Hz. If you have just a handful of clips to format, you can use the free converter tools offered by Sayspring or Jovotech. For larger batches, the open source Audacity audio remastering software is free and offers an Apply Chain command, kind of like a macro creator tool, that can be used on on entire folders for batch processing.

2. The 48kbps bit rate of SSML clips means the sound quality SSML delivers is lower than what you get with Audio Player – comparable to what Alexa’s voice sounds like coming out of an Echo Dot

3. If your SSML clip includes speech and you want that speech to display to the user, you will need to transcribe the words yourself and customize your skill cards to display it because the Alexa service handles SSML clips as objects, not outputSpeech

4. There have been reports of phonemes in SSML clips causing false wakes – if your SSML clip makes Alexa think she heard the wake word, your skill may abruptly shut down and kick the user back out to the main Alexa Voice Service for a response, and remember: Alexa is not the only possible wake word now so you’ll need to cover multiple bases on this issue in your testing

5. Standard Alexa audio commands do not work when SSML audio clips are playing; a user command of ‘stop’ will shut down your skill, and commands like ‘skip’, ‘loop’ and so on will generate an error response – Amazon’s assumption is that the user will not issue any playback commands while an SSML audio clip is playing so they don’t require skills that employ SSML audio tags to handle all the standard audio AMAZON.{intent} commands, but if your intended clip is longer than one minute you might still want to write handlers for those intents to improve the user experience

 

For shorter audio clips, consider using SSML audio tags next time.

 

Related Links

From Amazon’s Alexa Skills Kit developer documentation:
Handle Requests Sent by Alexa > Include Short Pre-Recorded Audio in your Response

Speech Synthesis Markup Language (SSML) Reference

Speechcon Reference

Alexa Skills Kit Sound Library

AudioPlayer Interface Reference

 

From the Amazon Developer Blog:

Alexa Skills Kit (ASK) Feature: Audio Streaming in Alexa Skills