Digital assistants: neural networks drive voice AI growth
When it comes to big trends in tech, voice-enabled applications definitely place high on the list. Companies have made great progress in using voice-tech as an authentication tool. In healthcare, speech recognition services are streamlining workflows and saving clinicians hours of administration. And a fascinating third piece of the puzzle is the rise of voice AI for creating advanced and highly customizable digital assistants – a technology that has its origins in query-by-singing/humming software for smartphones.
Voice-based future
“When we came out of stealth mode in 2015, we knew that our technology could do so much more,” James Hom, Chief Product Officer and co-founder of SoundHound told TechHQ. “We’ve always had a vision that people could talk to their devices and have spent a lot of time making sure that our models are robust.” Today, much like as Hom predicted, SoundHound’s technology has become a platform for a wide range of voice-enabled applications.
In automotive, software engineers are working with OEMs to develop much more accurate voice assistants that can provide information to drivers and other passengers based on natural speech queries. And clients include Honda, Hyundai, Kia, Mercedes, and Stellantis, which owns Alfa Romeo, Chrysler, Jeep, and Maserati, to name just a few brands in the group. Using voice to enter destinations and query navigation details means that drivers can keep their hands on the wheel and eyes on the road ahead. “Voice in automotive is a no-brainer,” said Hom.
Typically, voice-enabled applications rely on a two-step operation to make sense of spoken queries. The first – automatic speech recognition – involves comparing a phoneme sequence against a pronunciation dictionary. Then, in the second part, natural language processing is used to define meaning from the spoken words. But this process takes time and doesn’t always provide the results that users are expecting. To improve on this, SoundHound’s team instead applies neural network architecture to map input sequence data (the voice query) into output sequence data (a structured request). The method provides an efficient way of separating the intent from the variables in a spoken query – for example, the intent could be ‘tell me the weather’ with the variable ‘in London’.
Acoustic upgrade
There are other advantages too, such as the ability to tune the model so that it can perform well not just in a quiet setting, where voice commands can be clearly understood, but in noisy environments too. Engineers can use pilot data to compensate for the audio characteristics of different vehicles, for example. It also means that the technology is well suited for food ordering systems deployed in busy restaurants – another application that’s on the rise for voice AI.
Hom points out that use cases can even be combined – for example, a driver could ask the in-vehicle voice assistant to find restaurants nearby and then, after selecting one, choose items off the menu by piping the audio through to the automated food ordering system at the end-destination. Smart TVs are another application where voice AI is giving customers a whole new level of integration, as well as greater insight into user behaviour.
“One of the things that we are proud of is that we partner with the people we work with,” said Hom. And this includes giving clients access to their data. SoundHound can provide dashboards that help customers to identify features that users are asking their voice assistants for. Another popular selling point of the technology is the ability for companies to develop a branded voice. “We can train our product completely from scratch,” adds Hom. “There are lots of options, including using machine learning to give more natural sounds.”
Embedded approach
Voice AI has taken great strides in solving the problems that made progress hard in the early days of speech recognition. There’s a big difference between asking for ‘no mayo’ versus ‘extra mayo’, for example, and models needed to reach a point where they could capture key details reliably. There have been advances in other areas too such as embedded solutions, which have opened the door to voice assistants being available on the edge.
Packaging voice AI as a standalone chipset means that systems can operate in scenarios where connection to the internet could be intermittent. Such configurations offer continuity of service – for example, in automotive applications as vehicles pass through a tunnel. What’s more, active arbitration schemes mean that live data, such as weather or sports results can be gathered when cloud connectivity becomes available. Another benefit of edge solutions is that developers have the option of building solutions that are completely cloud-independent, which may be useful in manufacturing or healthcare settings, for example. “The availability of more edge options will allow businesses to store and protect sensitive data locally – which could help brands build customer trust,” commented Hom.
Bringing clients on board also involves making solutions available in multiple languages. “The core engine is language agnostic, but there is a language-dependent part, which is where our team of linguists and other experts fits in,” said Hom. In 2022, SoundHound’s platform supports 25 languages and the firm is aiming to add over 100 languages and variations as part of future developments.