Can you hear me now?

The limitations of voice

Finally, after decades of watching characters on science fiction movies and television programs tell computers what to do, we have voice-recognition technology. Devices like the Amazon Echo, Google Home, and Apple HomePod allow users to command them to perform a variety of functions, simply by telling them what to do.

It’s game-changing technology, but it’s young.

With any young technology, you’re going to have limitations. Still in its infancy, voice recognition and operation is no exception. Many of these limitations came into sharp focus during a project we recently completed for a client. While this wasn’t our first voice user interface design project (we’ve been doing them for years), it was filled with many new learnings as the technology continues to change.

The product is an iPad app that serves as a digital store room attendant for facility managers at hotels and apartment complexes. It features a voice-activated inventory management and ordering interface that allows maintenance technicians to ask if a part is available, just as they would to a human attendant.

If the part is available, the app allows the technician to verbally “check out” the part and automatically updates the inventory. Read about our work to develop th is app here.

The project is a proprietary, voice-activated digital assistant. Because we built it from the ground up, without relying on existing voice-activated technology like Alexa (the product was built as a defense against Amazon), we learned a great deal about the limitations of voice technology.

Through the process of developing th is app, we learned firsthand some of the key challenges related to voice operation.

New technology

The first has to do with the immaturity of the technology and its ability to recognize certain vocal commands. Human speech patterns have nearly endless variation, certainly between different countries and languages, as well as different cities and regions in the United States. Even individuals from the same place can speak different ways.

At this point, there simply isn’t enough information on speech patterns available to computers to be able to process commands from different people. That’s why, for iPhone users, you have to repeat “Hey Siri” when you set up your phone. The phone has to “learn” how you talk so it can recognize your voice and speech patterns.

That same setup process is necessary for any voice-controlled device or application. In the case of our store room app, it requires any and all technicians who will use it to input their voice so their commands will be recognized.

Determining intent

The second challenge with voice technology is determining intent. When you interact with a software application on a desktop or a touchscreen, intent is simple. You indicate you want the software to perform a function by clicking a button or a link. It’s very clear.

With voice commands, it’s far less clear, in large part because of the speech pattern recognition problem mentioned above. It’s not easy to figure out what the user wants, because people can say the same thing different ways.

This can mean having to go through multiple steps to confirm the user’s desires. In the case of the store room attendant app, a technician may inquire about the available inventory of a part, then vocally request their desired quantity, then confirm they are taking it. That’s a minimum of three voice commands, compared to one click on a screen.

Speed

Those steps to determine intent lead to the third challenge with voice, speed. The amount of time it takes to proceed through multiple layers to confirm intent is an obvious detriment to the speed at which a computer can perform a function.

But processing the commands themselves is also slower. Clicking a button or a link sends a network request through the software, and the computer begins performing the function within nanoseconds. But a voice command can add several seconds to the time it takes for the network request to occur. That’s essentially a lifetime.

The main reason for that again goes back to being able to recognize speech patterns and intent. The computer first has to recognize that it’s being addressed (which is the reason voice assistants are given names, like Alexa and Siri). Then it has to recognize when the command is complete, as indicated by a second or two of silence.

What takes almost no time at all with a screen interface can take five, ten, even 20 seconds or more to complete with a voice assistant.

Clearly, voice technology has a ways to go. But that doesn’t mean companies you should ignore it. Instead, start to learn what potential it has for you.

One of the benefits of today’s technology is it allows you to run tests and learn how it adds value and how your customers might accept it. This means you can experiment with immature technologies, like voice, without deploying it.

When the technology matures, it will happen fast. And you’ll be ready.