Programmable Voice

  1. Home
  2. Docs
  3. Programmable Voice
  4. HMP Elements
  5. Improving Speech Recognition Results

Improving Speech Recognition Results

End User Frustration Factor

My previous DSL provider had a Speech Recognition system to direct calls, and it was aggravating every time I had to use it. It had a prompt that said something like this “Please say what you would like to do”. I would say “Pay my bill”, and it would then respond with something like:

“I understood ‘upgrade my service’…I’ll transfer you to our sales department”, or “I understood ‘technical support’…I’ll transfer you to our support department”.

After this, I’d generally just pound on the “0” key in an attempt to get routed to an operator that could manually take care of what I needed done.

Voice Elements improves your Speech Recognition Results

Speech Recognition works well, but there are a few rules to keep in mind:

Better Instructions

When using Speech Recognition, you need to give clear instructions to your users. Open ended questions, can work well, but there needs to be some guidance.

For example, instead of asking the user to say what they would like to do, with little instruction. I would recommend saying something like:

“Please say what you would like to do. For example, if you would like to pay your bill, say “Pay my bill”, if you would like to speak to an operator say “Operator”.

This gives the customer a little bit more guidance as to what is expected. You can then program Grammar File, with various phrases or words that you might expect the user to say.

Longer Words or Phrases

Another common use for Speech Recognition systems is for a user directory. Often these systems will ask the customer to say the name of who they would like to speak to. However, in English there are several names that sound a lot a like, for example:

Mary, Barry, Jerry, Harry, Larry, Carrie, etc.

There may even be multiple Mary’s, or Jerry’s at a company, and this can impact both the performance of the Speech Recognition Engine, and the user experience.

The more source data that the Speech Recognition Engine has to work with, the more likely it is able to understand correct results.

In these situations, I would recommend asking the user for both their first and last name, and perhaps giving them a separate option to say just their first name (perhaps they don’t know the person’s last name), that would take them to a menu that would then play all of the people with a given first name.

This way, when someone says “Mary Tyler Moore”, it’s much easier for the Speech Recognition engine to recognize the correct name, over a different name like “Jerry Lewis”. When dealing with Non-English names (or names that may not be spelled traditionally), it can be useful to use Phonemes. Add link that explains phonemes.

Reduce the Number of Words in Your Grammar File

Often users will want to put all possible items into a single grammar file, while this may work if you notice that you aren’t getting the results that you would expect to get, you may want to consider creating a few prompts instead of just having a single prompt. The more words/phrases in a grammar the harder it is for the Speech Recognition Engine to determine which value is correct.

For example, let’s say you want to prompt the user to say where they would like to go on vacation, you could get anything from Phuket, Thailand, to Salt Lake City, UT. Instead, you may want to ask first, “Which country would you like to visit”, to narrow them down to a specific country.

Generating grammar files on the fly is simple to do using Microsoft Speech Platform, and can also be done using Lumenvox.

Read more in this article Create Microsoft Speech Compatible Grammar Files.

Please e-mail if you would like help with generating grammar files, using Lumenvox.

Why can’t it work like Siri?

I’ve gotten this question a few times. It would be great if you could just have open ended prompts. However, there are a few technical limitations inherent with using Telephone systems. The codec used in the USA is 8bit 8khz mulaw. This means that audio is only 8 KB per second when transferred. In contrast, the iPhone has a high quality microphone that can record audio that is orders of magnitude better, allowing for a much clearer picture of what a user is saying.


Speech Recognition, when used appropriately can make for a much better user experience than traditional DTMF. However, the guidelines outlined above should be considered when developing any Speech Recognition application.

Feel free to contact with any questions.

For more information on writing a Speech Recognition application using Voice Elements, please see this article How to Create A Simple Speech Recognition Application.

Was this article helpful to you? Yes 14 No