Building a Voice Agent with Cisco VXML
Last updated
Last updated
In this guide, you'll learn how to easily build a virtual agent for voice channels using two main answers templates and the technical text.
First, add a voice channel, which can be done in two different moments: when you're creating a virtual agent or adding it later to an existing agent. In the later, access the side menu option "Channels" and then click the "Create channel" tab.
Before continuing, make sure you have read this step-by-step guide until the Welcome Flow item.
Once you're in the Channel's Library, choose the Phone category to open a modal to configure the channel in 5 steps.
Fill the fields for Name, Description (optional), a unique DNIS for this environment (see how to configure DNIS), and Type. After you select VXML as the voice channel type, the Secret field will appear. Ensure the Secret field contains exactly 16 characters.
In step 2 you will fill in the settings for the call properties.
Please refer to each property section and table to understand the configurable fields used in the DNIS configuration and their reference values for the VXML type: TTS (text-to-speech) properties used in audio and text answer templates, such as Transfer, Fetch, and how it should handle Errors.
On the next step, set the behavior for the first answer, Conversation Timeout for the welcome message and also call duration, and Regional Expressions.
In this step, you'll set the TTS (text-to-speech) and audio configurations. Refer to the text and audio sections to see the default values and how to overwrite them with commands in the technical text field of answer cells.
On the last step you'll set the values for DTMF and DTMF Voice Menu and also inform the Automatic Speech Recognition (ASR) voice provider.
The following JSON contains all the data and configurable properties you must provide eva.
This JSON allows you to insert the default DNIS configurations, including setting up a Conversation Property (voice providers).
These properties can be modified individually within the flows by utilizing the "technical text" field of the answer cells, as demonstrated ahead in this documentation.
To build a voice agent in eva, there are a few concepts that are different from a "text first" agent. The flow building logic is the same, the difference is the consistent use of the technical text field using JSON. We'll call it property; each property has a command that will tell the agent what to do.
Before jumping to them, let's see how an answer cell for voice agents would look like in eva?
Don't worry if you don't understand some of the terms in the following example, we'll get to all the concepts ahead in this chapter. 😉
Now, imagine you have an audio file with a greeting and a menu, and you want the user to choose a number option off of the menu:
1. Click the + icon to add a cell, in this case, a Welcome flow.
2. Select the channel and choose the audio template
3. Add the audio URL (WAV or FLAC formats)
4. Use the "Add option" to create buttons that will be used to identify the menu options
5. After that, attach a JSON to the technical text field with the DTMF menu property, as follows:
6. Finally, click Save.
If you don't have an audio, just choose the text template to use the text-to-speech function and proceed to step 4.
The following example is an answer using the audio template. The formats supported are WAV and FLAC.
There are a few properties that you can attach to the technical text field to enrich the experience, like allowing the user to interrupt the audio playback at anytime.
JSON example:
Other audio commands that overwrite the default settings:
bargeIn
boolean
Allows users to interrupt an audio using a DTMF keypad input. For ex., in a menu audio, the user wouldn't have to wait all the options to finally be able to choose.
Text-to-speech technology receives a text as an input and produces speech as an output. To produce the audible speech for IVR, create an answer using the text template. You can either fill it with regular text or with a SSML (for this configuration use the Speech Synthesis Markup Language Version 1.1).
When you insert a regular text, the IVR will play the default configurations, but if you want to change the default rate, pitch and even voice, use a SSML with the new configuration.
The text field has 2000 characters limit.
You can also overwrite the default configurations using the following JSON in techinal text field:
JSON used in the example:
In the example above, we used a mask with the variable $TEXT to replace with the content you have written in the text template, so you don't have to repeat it in the xml. If the content of the answer is an xml starting with "<speak" the default xml won't be used.
Other TTS commands that overwrite the default settings:
bargeIn
boolean
Allows users to interrupt an audio using a DTMF keypad input
bargeInOffset
Long
This configuration allows users to interact with the IVR from a specific point in the audio. For ex., if you set the value 300ms, this means that the user will be able to interact with the IVR when it is 300 milliseconds before the audio stops playing.
voiceProvider
String
TTS Provider Name. So far, only the MICROSOFT value is supported.
microsoftTtsConfig
JSON Object
Credentials to access Microsoft
Now that we know the basics of how an answer cell for IVR looks like in eva using audio and text templates, let's move on to the technical text field.
To use the eva-evg channel or implement a connector that will be integrated to an IVR, there are some configurations that need to be informed. They are the properties, i.e. a regular JSON attached to the technical text field.
In case no properties are attached to the technical text field, the system will use the default properties.
Let's breakdown the properties and learn how to use them to create commands.
Mostly used when you need an input from the user. You can use all templates available: audio, text and custom.
There are three types of menu:
DTMF: allows the user to interact with the IVR by the telephone keypad
VOICE: allows the user to interact with the IVR by speech
DTMF VOICE: allows the user to interact with the IVR by both, telephone keypad and speech
Let's breakdown each type.
As mentioned, the DTMF menu allows the user to interact with the IVR through the telephone keypad. See the example bellow:
JSON used in the example:
It's possible to overwrite some configurations of the DTMF menu:
numOfDigits
int
Numbers of digits to be captured
1
timeout
int
Pause timeout in milliseconds for the user to send an input (DTMF or speech).
5500 ms
interDigitTimeout
int
Inter-digit timeout in milliseconds for the user to enter a DTMF input
3000 ms
termTimeout
int
Timeout in milliseconds since the user's last input (DTMF or speech) before terminating the call
300 ms
termChar
String
Users can indicate when the DTMF input has finished by sending a special character.
If the user types only the character # (hashtag) without informing any numbers, this is the value sent to eva; but if there are other information sent along, the # won't be sent.
#
Timeouts: Refer to the pauses between words or phrases when speaking or when entering DMTF inputs. You can control the length of these pauses so the engine can detect when a user has done speaking or entering the DTMF input.
To overwrite the default settings we can enter the following JSON in the technical text.
JSON example:
Usually, a DTMF menu is used with buttons to find the input that will be sent to eva during the conversation. For example, when the user press "1" in the phone keypad, eva will receive the value, like in the example bellow, the value sent to eva was "Schedule".
As mentioned, the Voice menu allows the user to interact with the IVR by speech, see the example below.
If you want use the voice property but not overwrite any other configuration, just attach the following JSON in the technical text, as seen in the example above:
In case you want to overwrite some default configurations, use the following commands in the technical text:
voiceProvider
String
Provider Name: CISCO_GOOGLE
-
sensitivity
double
Noise reduction sensitivity. Lower values will lower the audio silence threshold and more noise will be recorded. Higher values will raise the audio silence threshold and louder audio will be needed to trigger the record. Valid values go from 1 to 100.
20
timeout
int
Pause timeout in milliseconds for the user to send an input (DTMF or speech)
5500 ms
maxSpeechTimeout
int
The maximum duration of user speech. If this time elapsed before the user stops speaking, the event "nomatch" is activated.
15000 ms
incompleteTimeout
int
Timeout in milliseconds the IVR will wait for a page/json fetch
300 ms
This is how it will look like:
JSON used in the example:
As mentioned, the DTMF VOICE menu allows the user to interact with the IVR by both, telephone keypad and/or speech.
To overwrite the default settings we can enter the following JSON in the technical text.
If you want use the DTMF VOICE property but not overwrite any other configuration, just attach the following JSON in the technical text, as seen in the example above:
Settings for the DTMF VOICE menu will be the same as those used for DTMF and VOICE.
This is how it will look like:
When used with buttons we can find the input that will be sent to eva during the conversation. For example, when the user press "1" in the phone keypad, eva will receive the value, a word or a phrase like "I want to buy".
Let's learn how to use buttons in the context of eva-EVG. All three answer templates for voice channels allow you to add buttons. Click "Add option" to expand the two fields for buttons: Option and Value.
The value saved in the context works as a map, helping eva identify where the user should be led.
When combined with a DTMF or DTMF VOICE menu, it's possible to associate the "Option" field with the digit and send the value to eva. For example:
Option: "1" Value: "Buy clothes"
When the user press "1", the value that was actually sent to eva is "Buy clothes", leading the user to the appropriate flow.
Users may also consider an alternative approach by spelling out the number instead. So these are the third input possibilities:
"1" (phone button)
"Buy clothes" (spoken)
"One" (spoken)
To cover this third option, represented by "One" in this example, you can add a Cardinal System entity (eva NLP pre-built entity for numbers) followed by a Rule cell, as seen below.
On the Rule cell you can create a condition (see example below) to segment the flow and, subsequently, add a Jump cell to said flow. Use this field to handle possible input options and help the STT recognize any variations of the spoken number.
Transfer property is used to transfer the call to live agents.
uui
String
dest
String
Call destination, where it will be transferred to. You can declare it as sip or tel.
Below are some examples:
How to declare you want a call to be transferred (remember to replace the information inside the quotation marks):
In the example above, the value "48656C6C6F20776F726C64" will be translated as "hello world" by the agent.
By combining transfer configurations, it's possible to overwrite audio configurations, using TTS (text template). It's possible to combine multiple configurations of different items to achieve a proper menu customization, as in the example below:
In the example above, the hex encoding is declared in the default.
Important: Transfer property has priority over menu and play silence. When you attach these commands with transfer, the other two will be ignored.
This property is used to end the flow. In other words, after this, the call will be terminated. Simply attach in the technical text the following JSON:
By combining terminate configurations, it's possible to overwrite audio configurations, using TTS (text template). It's possible to combine multiple configurations of different items to achieve a proper menu customization, as in the example below:
Important: Hangup property has priority over transfer, menu, and play silence. When you attach these commands with hangup, the other three will be ignored.
The recall property can be used to simulate an asynchronous delivery of the answers and also to send eva a user input that can be used to trigger a flow or validate a service.
This behavior is useful when the system requires a lengthy processing and you don't want to hang the user waiting in silence wondering if the call is still active.
💡 It's a good practice to give the user a feedback with audios with background music or informative messages.
This is how you use a recall. Add a wait-input cell after the answer you want delivered before continuing in the flow.
You can use the same parameters as those in the Conversation API to specify the user input (if it's text, code, context, intent, confidence, or entities).
In the example below, the code "357YVU" is being used as a value to validate a service.
In the following case, the intent "shopping" was triggered without the need of identifying utterances, you just have to inform the name of the intent the way it's registered in eva.
This next example is a simpler way of using the recall property. In this scenario, eva would be called with an empty input.
The fetch property represents the waiting time for the IVR to make a new request to eva and then continue the flow. You can also overwrite the default setting it in the technical text to only reflect a specific execution (audio playback, TTS, etc.).
fetchTimeout
Long
The default amount of time in milliseconds the IVR will wait for a page/json fetch.
fetchAudio
String
The path to the default audio file to be used during IVR platform fetch events.
fetchAudioDelay
Long
The default value for the fetch audio delay. This is the amount of time in milliseconds the IVR will wait while transitioning and fetching resources before it starts playing the fetch audio.
fetchAudioMinimum
Long
The minimum time in milliseconds to play a fetch audio source, once started, even if the fetch result arrives in the meantime. The idea is that once the user does begin to hear a fetch audio, it should not be stopped too quickly.
fetchAudioInterval
Long
Controls the time interval between fetch audio loops. The default value is 0. A value of -1 is valid and will prevent the audio loop.
Below are some examples:
Fetch configuration
By combining fetch configurations, it's possible to overwrite audio configurations, using TTS (text template). It's possible to combine multiple configurations of different items to achieve a proper menu customization, as in the example below:
Important: If none of the properties above mentioned (DTMF, VOICE, DTMF_VOICE, play silence, transfer, hangup, or recall) are attached, a DTMF_VOICE with the default configurations will be added.
This property gives a contextual understanding of expressions and words variations. For example, in English it's common to say O (letter) instead of zero when giving a phone number.
To help the STT intelligence understand this is the number 0 and not a letter, you can use a JSON file that gathers all “Regional Expressions”, as in the example:
Important: The JSON with regional expressions has to be a public file. To enable it, provide the URL in the JSON with the default configurations.
To enable this property, simply attach in the technical text the following JSON:
This way, the agent will have a better recognition of specific entities such as phone number, credit card number, etc. Bear in mind that each time a new change is made to the file, it can take up to one hour to reflect in the call.
If you want to start the conversation with a different flow, use the following code to set the first interaction when configurating the DNIS:
See here all the properties you can use in this JSON.
You can use this scenario to change the channel, to start on a specific seasonal flow, or outbound calls, for example.
When a call is interrupted unexpectedly, either because the user hung up accidentally or as a result of some system error, it's possible to configure a flow in eva so that the conversation resumes from the same point if this same user calls again in less than 5 minutes.
This setup not only enhances user experience but also refines the abandonment metric by filtering out abandoned calls and excluding those that were resumed.
To create this scenario, you'll have to:
Create a welcome answer with a transactional service to identify the call.
Create a User Journey flow specifically for this use case. Add the utterance "USER_DISCONNECTED" to your intent followed by a service cell (see image below) to identify the call and resume from the same point where it left off.
When the user doesn't interact with the virtual agent within the configured timeout, which means there isn't a DTMF or a speech input, the system sends eva the code IVR_NO_INPUT, visible on the User Messages column on Dashboards (see image below).
Used to manage events when it is not possible to identify or transcribe the input, the system sends the code IVR_NO_MATCH, visible on the User Messages column in Dashboards (see image above).
During a call some errors may occur. We list below possible errors:
Communication with eva, due to some misconfiguration.
Failed authentication with eva
Flow not found (when a Not Expected flow wasn't created, for example).
The use of a template not supported by the IVR channel.
There are two ways of handling them:
Redirect the call to a live agent
End the call
For both cases, we recommend you to deliver a message notifying the user what will happen next.
audio
String
The field must contain an audio URL in WAV or FLAC format, when this response is delivered to the IVR it will play the audio content.
tts
String
The field content will be synthesized by the IVR, you can fill it with free text or with an SSML.
transfer
boolean
If set as true the call will be transferred after the message is played; if set as false or when the property is not specified the call will be terminated after the message. To make the call transfer we will use the default transfer settings.
A custom message that will be transferred along with the call via the user-to-user SIP header. We recommend to use a .