The world changed profoundly when our interaction with the digital universe shifted from keystrokes to voice commands: “Hey, Siri,” “Hello Alexa,” “OK, Google” all quickly unlock doors into information and services in ways we couldn’t have imagined just five years ago.
Consumers have embraced speech-recognition technology:
- More than 60% of respondents use speech recognition technology when their hands are occupied, according to the 2016 KPCB Internet Trends Report.
- By 2018, 30% of all interactions with devices will use speech recognition, according to research company Gartner.
- Amazon’s Alexa-powered Echo products are among its most-popular sellers.
The transformation driver is simple: People can speak up to four times faster than they can type. However, while the technology works because it’s more natural, speech recognition systems are still in their infancy. Significant challenges lie ahead if we’re going to continue to make this user interface the center of our daily digital lives.
Can you hear me now?
Consider the complexity of language. Native English-speaking adults understand an average of 22,000 to 32,000 vocabulary words and learn about one word a day, according to an American-Brazilian research project. Non-native English-speaking adults know an average range of 11,000 to 22,000 English words and learn about 2.5 words a day.
Approximately 170,000 words are used regularly by native speakers and the entire English language contains more than 1 million words, with 8,500 new words added each year. Yet most contemporary embedded speech-recognition systems use a vocabulary of fewer than 10,000 words. Accents and dialects increase the vocabulary size needed for a recognition system to be able to correctly capture and process a range of speaker variability within a single language. The gulf between technology capability and requirements is therefore vast.
Despite the gap, technology innovation hasn’t slept since the first voice-recognition technologies — IBM’s Shoebox machine and Bell Labs’ Audrey device — were introduced more than a half-century ago.
With the continually improving computing power and compact size of mobile processors, large vocabulary engines that promote the use of natural speech are being built into OEM devices. Not to mention, the adoption rate is picking up steam as the footprint for such an engine has been shrunk and optimized.
However, more needs to be done.
Effective speaker-recognition requires the segmentation of the audio stream, detection and/or tracking of speakers, and identification of those speakers. The recognition engine fuses the result to make decisions more readily. For the engine to function at its full potential and to allow users to speak naturally and be understood—even in a noisy environment like a train station or airport—pre-processing techniques must be integrated to improve the quality of the audio input to the recognition system.
A new approach to recognition systems
The other key to improving voice recognition technology is distributed computing. Often today, voice inputs in edge and mobile devices are computed in the cloud and the results whisked back to the users. However, there are limitations to using cloud technology when it comes to its application in a real-time environment that requires privacy, security, and reliable connectivity. The world is moving quickly to a new model of collaborative embedded-cloud operation—called an embedded glue layer—that promotes uninterrupted connectivity and addresses emerging cloud challenges for the enterprise.
With an embedded glue layer, capturing and processing user’s voice or visual data can be performed locally and without dependence on the cloud. In its simplest form, the glue layer acts as an embedded service and collaborates with the cloud-based service to provide native on-device processing. The glue layer allows for mission-critical voice tasks—where user or enterprise security, privacy and protection are required—to be processed natively on the device as well as ensuring continuous availability.
Non-mission critical tasks, such as natural language processing, can be processed in the cloud using low-bandwidth, textual data as the mode of bilateral transmission. The embedded recognition glue layer provides nearly the same level of scope as a cloud-based service, albeit as a native process. And it tightens voice security in ways similar to how fingerprint-recognition technology is stored on local devices rather than in the cloud.
This approach to voice-recognition technology will not only revolutionize applications and devices, it will continue to fundamentally alter how we interact with the digital world in a safer, more secure, more productive manner.
Ali Iranpour is the Director of Mobile Strategy, Client Line of Business at Arm. He’s been with Arm since 2014 and is primarily responsible for developing the strategy around the mobile and wearables markets. Ali graduated from Lund University with a PhD, and prior to joining Arm worked for Sony Mobile and Ericsson.