By Nicolas Perony, Co-founder and CTO of OTO Systems
The e-commerce juggernaut premiered its voice-enabled fitness tracker in August 2020, but the potential of Amazon’s Halo could be hamstrung by the technology that underpins it.
In 2017, some three years before Dr Maulik Majmudar announced early access to Amazon Halo, “a tool that analyses energy and positivity in a user’s voice”, Teo Borschberg and I were hard at work pioneering the frontiers of speech technology. Unbeknown to us the technology behemoth was also busy filing patents related to voice intelligence.
What Teo and I did knew back then was that the voice market was huge, and set to explode. Humans and businesses were already exchanging over a hundred billion calls annually, and nearly half of America was using voice assistants on their smartphones.
Teo and I spun out emotion technology from SRI International’s Speech Technology and Research (STAR) Laboratory to create a startup called OTO Systems Inc. The STAR lab is known for making Apple’s Siri. For those who don’t know, SRI is thought of as “the birthplace of some of Silicon Valley’s most important innovations.”
Before the dawn of technology that measures tonality, automatic speech recognition could only crudely capture the context of human language. But what was lost in transcription was all the emotion, a sure tell for human behaviour.
Our dream at OTO? To ensure that people are better understood by machines. We wanted to use AI to improve the human experience by perceiving and measuring behaviour through tonality. This without compromising human privacy.
The first technology we worked with was based on a number of designed acoustic features, including Mel-frequency cepstral coefficients (MFCCs) — the basic building blocks of most speech understanding and modelling applications until the late 2010s. But MFCCs are limited in what they can capture because they are expensive to compute, highly parametric in nature, and only offer a lossy representation of the input signal.
Looking through patents granted to Amazon Technologies Inc you’ll find number US10096319B1, granted in October 2018, and called “Voice-based determination of physical and emotional characteristics of users”. Filed by inventors Huafeng Jin and Shuo Wang, the patent states: “Features used for voice processing algorithms may include Mel-frequency cepstral coefficients (MFCCs).” Almost two years later Amazon opens access to Halo, a tool that promises to measure “body composition, activity, sleep, and tone of voice”.
At OTO, we quickly abandoned MFCCs because they are an outdated and computationally expensive way of extracting information from audio signals. Instead, we focused on learning representations directly from the raw waveform, using modern deep learning frameworks such as contrastive learning. This has the advantage of being much faster and potentially preserves more of the information contained in the input signal.
But back to Amazon’s new wearable. Their biggest problem seems to be the lack of generalisation in the emotional expression that Halo can understand. US technology news brand The Verge asked Amazon whether Halo’s Tone feature had been tested across cultures, accents and genders. A spokesperson for the behemoth said this was a “top priority”, adding: “if you have an accent you can use Tone but your results will likely be less accurate. Tone was modelled on American English but it’s only day one and Tone will continue to improve.”
We know this is a hard problem to solve, and we spent years refining OTO’s embedded voice intelligence, DeepTone™. We trained this proprietary technology on tens of thousands of distinct voices with a variety of languages and accents to capture the diversity of emotional expression in much of the world’s population.
Running on the edge, DeepTone™ preserves human privacy while offering organisations real-time voice intelligence with deep insights that can radically improve the human experience. Frankly, Amazon has much better options for Halo than technology designed in the 1970s.
Today DeepTone™ is the perfect technology for wearables and home devices, but importantly it is being used to pioneer solutions in medicine for Alzheimer patients and to humanise robots.
At the heart of this invention is the understanding that communication is more about how something is said, rather than what is said. Behavioural psychology, in general, reveals the importance of emotion as a key driver in decision-making, while research by Albert Mehrabian reveals that words themselves remain ambiguous when it comes to conveying meaning. This is because, when transcribed, they are stripped of the rich nuance carried by intonation.
When it comes to affective decision making (like the intention to purchase a product), a speaker’s tone contains five times as much information as the words themselves:
The breakthrough we made with DeepTone™ was generalising the extraction of low-level behavioural markers from the raw waveform, like anger, frustration and joy, from anyone’s voice in any language. By clustering these markers into high-level behaviours relevant to industry scenarios — like negativity, frustration or satisfaction — OTO can isolate and identify key moments that reveal and rank human experience at scale, without compromising human privacy.
OTO would like to invite Dr Maulik Majmuda and Jeff Bezos to try out DeepTone™ using our web demo. Anyone interested in developing smart, speech-enabled products can build voice intelligence right into their applications with the DeepTone™ Software Development Kit or API. Journalists, analysts, and other interested persons can process audio files or streams to experience how the AI works. And work it does.
But see for yourself. We’d like to challenge anyone interested in the future of voice intelligence to put OTO’s DeepTone™ to the test.
OTO leverages cutting-edge voice technology to understand key behaviours and acoustic signals in real-time to provide a rich, acoustic map of voice created by our lightweight DeepTone™ engine that extracts over 100 measurements, multiple times every second.
COVID-19 has changed the way we live and work forever, accelerating the use of voice. Microsoft reports that on a single day in April 2020, 200 million Teams users interacted and generated more than 4.1 billion meeting minutes. Research analysts Juniper report that there will be “8 billion digital voice assistants in use by 2023, up from an estimated 2.5 billion at the end of 2018.”
Edge Computing and the roll out of 5G will realise a surge in speech, but despite the hundreds of billions of calls that humans and brands exchange annually, only 10% are sampled for quality assurance. This means there’s an insight goldmine begging to be analysed in the name of improving human experience.
Protecting privacy while offering real-time speech analysis at scale is the breakthrough that healthcare, robotics, business and humanity needs to create more useful and meaningful experiences between people and machines.