3 Challenges To Designing A Voice Interface

Here's advice on how to overcome these challenges, as we increasingly rely on voice to interact with connected, smart devices.



Today we're using our voice to engage with our smartphones and cars, and in the not-so-distant future, voice interfaces will extend to other areas of our lives, perhaps even to our favorite appliances as part of a more intelligent connected home. Your kitchen, for example, could become a voice-enabled control center of sorts for the entire house.

But designing for a voice interface -- and integrating voice as part of the overall device experience -- requires different thinking than designing for the keyboard, mouse, and touchscreen. As designers for Nuance, a global leader in voice and natural language technologies, we're focused on this fundamental challenge. For companies looking to incorporate voice interfaces into their products, we see three fundamental adoption challenges that designers should focus on overcoming: a lack of trust, discovery issues, and simple usability concerns.

These issues are important because voice and natural language understanding have become table stakes for device interactions. The technology has become incredibly accurate and highly sophisticated with elements of artificial intelligence that create an intuitive and natural conversation.

[Names, ideally, tell us what an object can and can't do. See Self-Driving Cars: 10 More Realistic Names.]

Even the most sophisticated speech system in the world, however, will fail if it does not support users the way they expect. To deliver a better voice experience, we can combine fundamental design concepts with an understanding of natural conversational principles to build systems that listen, understand, and respond to get us relevant information. Here's how to overcome these three broad challenges.

Challenge #1: Lack of trust
When people talk, there is a natural cadence that leads us from start to finish within a conversation. A chat begins with input, which could be a nudge for attention ("Hey!") or a request ("Is there a coffee shop around here?"). The other party will recognize ("He said, 'Is there a ...'"), interpret this ("He's looking for a local place to get coffee ..."), and respond ("Travis Café is five minutes down the street"), based on contextual knowledge like location.

Virtual personal assistants should offer the same, because the closer an experience follows the path of natural conversation, the more trusted and understandable it will be. In our coffee shop conversation, we also may want to continue the dialogue to get more information -- is Travis Café popular among the locals? How do I get there from here? Allowing personal assistants to understand context and navigate an extended dialogue is all part of the design process.

An abrupt end to the conversation at an unexpected point in a dialogue without an explanation (most likely due to lack of resources) hurts the trust in a virtual personal assistant experience. It's OK to end conversations or refer users to other sources to continue -- even humans don't know everything. The key is to establish a framework that users can recognize and understand.

(Source: Alex Washburn of Wired, under Creative Commons license)
(Source: Alex Washburn of Wired, under Creative Commons license)

One of the biggest barriers to a voice system achieving trust is inconsistency. Product designers are applying voice and natural language to devices that already boast established and accepted input methods, and a key step toward trust will be voice technologies first replicating and then improving upon these established methods. When using a television, for example, pressing the "Guide" button on the remote brings up a corresponding interface. When voice is incorporated, it is vital that a spoken request for the "Guide" brings up the same interface. Once users understand these consistencies between input methods, they will develop greater trust in the system. Once designers build that trust, they can turn to offering a better experience than that handheld remote through more sophisticated natural language and reasoning capabilities.

Challenge #2: Discovery
Quite simply, people need to know that they can speak to a system, and what kinds of things they can say. Basic identification of speech may be simple -- a microphone icon is straightforward and recognizable -- but guiding users around what they can say is often more challenging.

In some cases, a proactive introduction could be a useful solution. If, for example, a personal assistant is the main way users will interact with the device, the assistant might introduce itself during device setup, engaging the user through voice interactions right from the start.

So, what can I say to a device? With natural language, the power and challenge are one and the same: You can say anything.

Context is important and an integral part of a well-designed speech system. For instance, if speech is part of a pizza-ordering application, it will probably only support pizza-related conversation. For applications with broader scope, like personal assistants, the challenge is greater -- these systems need to rely on context and user insights to have a fruitful dialogue, but not be so narrow that they operate within a confined set of boundaries.

Remember also that initiating human-to-human conversation is a two-way street -- we engage in conversations we're invited to by others, not just ones that we start. And it's the conversations that others start where we often receive new, sometimes surprising and delightful information.

Starting a dialogue means we're looking for something and expect a response, and this is largely the premise for our interactions with personal virtual assistants. However, today's assistants are becoming much more anticipatory and proactive, offering up information that we're likely interested in without having asked for it, such as sports scores, music recommendations, or an urgent email. Such proactivity can further reduce the challenge of discovering what you can talk to a personal assistant or voice system about.

With that possibility in mind, it's important to design thoughtful systems. They should engage the aforementioned notion of context to provide proactive insight at the right times, such as providing traffic updates when you're heading out, and not in the middle of the night when you're sleeping. If they're offering to read out news headlines, they should do it when you're getting into the car, and not stepping into a meeting.

Challenge #3: Usability
As natural language systems build this trust and become easy to discover, people will experiment with them and make requests that aren't supported. Systems must be flexible enough to account for the unknown inquiry. For instance, a person may direct their avatar in a computer game



to choose a weapon by saying, "The cheapest one!" However, the system needs to know which is the cheapest within a menu of reasonable choices, versus only understanding a command to select a rock.

Further, systems should be designed to provide responses to known-unknown scenarios through a conversational dialogue, such as, "I'm sorry, I can't create a playlist for you just yet," and then explain what elements are missing.

But, we can't anticipate everything. When we communicate with other people, sometimes we don't understand the other party, don't know how to help them, or simply can't hear them. But we can typically resolve those issues through the conversation. The same should be part of the voice experience design, where the system should aim to identify what went wrong so users understand how to reengage or better yet, have the system redirect the conversation so it can still complete the task.

People often bring experiences with previous speech systems with them -- including ones that require specific voice commands. A virtual personal assistant may be listening for "Hey, can you put on some jazz for me?" but instead the user might say, "Play ... music ... jazz." A true natural language system shouldn't proscribe what "natural" means, and should support "simple" requests as well as full dialogues. Natural conversation is in the eye, or mind, of the beholder.

As we think about what else makes for a successful speech interaction, we need to keep in mind the importance of the two tracks we have in conversational feedback. In face-to-face conversations, people subconsciously follow a content track and a management track. The content track manages ideas -- recognize, interpret, and respond. The conversational management track is where we monitor the other party -- can they hear us, are they attentive, are they confused? If anything in this management track goes awry, we can solicit feedback or change the dialogue to get back to the content.

People look for the same feedback from virtual personal assistants and voice systems. It should be very obvious when the system is available and attentive, and when it is listening, processing, understanding, and responding. If people aren't sure when a system is listening, for example, they won't know when to talk, resulting in partial speech being captured and misrecognitions -- and of course, frustration.

When we interact with other people, we respond to the physical and visual cues of our partners -- "Take that small orange piece, and place it next to the wheel here." We communicate with our bodies, our hands, our expressions, and our words.

This is where designing voice interfaces with context is key. People increasingly expect devices and their personal virtual assistants to have a basic understanding of us and the world -- to know where we are, what we are doing or just did -- and surface that knowledge through responses. We expect existing modalities -- touch, mouse, gesture, and others -- to coexist with speech. Browsing and selecting a photo may be easier via touch, but texting it may be easier through speech. Speech should not be viewed as the solution, but should work with and keep up with other input methods for a holistic conversational experience.

The Internet of Things demands reliable connectivity, but standards remain up in the air. Here's how to kick your IoT strategy into high gear. Get the new IoT Goes Mobile issue of InformationWeek Tech Digest today. (Free registration required.)

Tim Lynch leads all design activities for Nuance Communication's Mobile-Consumer division, encompassing a range of devices, including smartphones, televisions, the connected car, wearables, and many others. His experience ranges from leading design efforts for several ... View Full Bio
We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Email This  | 
Print  | 
RSS
More Insights
Copyright © 2020 UBM Electronics, A UBM company, All rights reserved. Privacy Policy | Terms of Service