Speech Recognition

It can be easy to build Speech Recognition applications using Voice Elements. Voice Elements supports both LumenVox and the Microsoft Speech Platform that has support for 18 languages.

Grammar Files

First, you will need to create a grammar file. Grammar files contain lists of words that you would like to be able to recognize. For example, if you wanted to let the user identify which state they are calling from, you could create a grammar file with all 50 states. It’s very easy to create grammar files. Here is a link that describes how to do so. Here is a link with a simple grammar file, that allows the user to say “Yes”, “Agree”, “Correct”, “OK”, and “Right”, to signify a “YES” response. It will allow a user to say “NO”, “False”, “Wrong”, “Incorrect”, and “Negative” for a “No” response. Please note that this grammar file is only compatible with the Microsoft Speech Engine.

Writing the code

The best way to learn about using Speech Recognition, is to see actual code that demonstrates it. Below is an example of a simple speech recognition application:

        public void RunScript()
        {
            try
            {
                string gramFile = VoiceApp.Properties.Settings.Default.Directory + "YesNo.Gram"; // I set up a variable in the config file to manage the directory where we search for files.
 
                // We will loop up to 3 times to get an accepted response.
                for (int i = 0; i < 3; i++)
                {
                    VoiceResource.SpeechRecognitionEnabled = true;
                    VoiceResource.SpeechRecognitionGrammarFile = gramFile;
                    VoiceResource.SpeechRecognitionMode = VoiceElements.Interface.SpeechRecognitionMode.MultiplePlays;
 
                    // Set how long we wait until we timeout
                    VoiceResource.MaximumTime = 10;
 
                    // This determines if we allow the user to "Interrupt" the play, by saying something early.
                    VoiceResource.SpeechRecognitionPermitBargeIn = true;
 
                    // Determine how many digits we should accept
                    VoiceResource.MaximumDigits = 1;
 
                    VoiceResource.PlayTTS("Please say Yes or No, or press 0 to be transferred to an operator");
 
                    TerminationCode tc = VoiceResource.GetResponse();
 
                    // We should disable Speech Recognition now, otherwise it could be inadvertantly used on subsequent plays
                    VoiceResource.SpeechRecognitionEnabled = false;
 
                    Log.WriteWithId(DeviceName, "TC: {0} DigitBuffer: {1} Score: {2} Word: {3}", tc, VoiceResource.DigitBuffer, VoiceResource.SpeechRecognitionScore, VoiceResource.SpeechRecognitionReturnedWord);
                    if (tc == TerminationCode.Speech)
                    {
                        if (VoiceResource.SpeechRecognitionScore >= 700) // Scores range between 0 and 1000. Generally, anything above 700 could be considered a strong positive detection
                        {
                            string word = VoiceResource.SpeechRecognitionReturnedWord;
                            VoiceResource.PlayTTS("You said: " + word + ". Goodbye");
                            break;
                        }
                    }
                    else if (VoiceResource.TerminationCodeFlag(TerminationCode.Digit) || VoiceResource.TerminationCodeFlag(TerminationCode.MaximumDTMF))
                    {
                        if (VoiceResource.DigitBuffer == "0")
                        {
                            // Transfer to operator
                            break;
                        }
                    }
 
                    VoiceResource.PlayTTS("I'm sorry, I did not understand your response");
                }
            }
            catch (HangupException hex)
            {
                Log.WriteWithId(DeviceName, "Caller Hungup!");
            }
            catch (Exception ex)
            {
                Log.WriteException(ex, "Unexpected exception: {0}", DeviceName);
            }
            finally
            {
                try
                {
                    ChannelResource.Disconnect();
                }
                catch{}
 
                try
                {
                    ChannelResource.Dispose();
                }
                catch{}
            }
        }

Key Notes

There are a few key pieces in this application:

SpeechRecognitionPermitBargeIn

You can allow a user to “BargeIn” while you are performing a play, this way a user doesn’t have to wait until the audio is finished. However, this can sometimes negatively affect Speech Recognition performance (for example, if a user is using speakerphone, it may pick up trailing audio from what you are playing back).

Determining What Score to Use

The score that is returned by the Speech Recognition engine ranges from 0-1000. Generally, anything above a 700 could be considered a good positive score. However, you will want to do some testing on your own to determine an appropriate score, as the score returned is influenced by the number of words in the grammar file, and could be influenced if they sound similar to other words/phrases in the grammar file.

Detecting Digits and Speech

Often, you will want to allow users to enter digits, or use Speech. This code example, shows how to do so. When using Speech or Digits, you will want to get the termination code, and depending on what the termination code is you will need to handle accordingly. In the example above, if the user presses a digit, VE will get a termination code of “Digit” or “MaximumDTMF.”  Alternatively, if the user speaks, the terminationcode will be “Speech”.

Next Steps

This is a very simple example of using Speech Recognition technology. If there is something that you would like to develop, but aren’t sure how to start, or have questions, feel free to contact us at support@inventivelabs.com