At OTO, we sense human behavior through speech. Our mission is to enable businesses to deliver a hyper-personalized customer experience at scale.

We currently work with contact centers to learn from hundreds of thousands of customer conversations in order to understand a range of key behaviors (interest to purchase, sentiment, satisfaction, etc.) with state-of-the-art accuracy. The ability to measure such behaviors in real-time opens up massive opportunities for businesses to scale the diagnosis and personalization of service and sales, down to each individual customer.

Voice is already big today, and it’s about to be gigantic.

Businesses and consumers exchange over a hundred billion phone calls every year. 1 in 5 Americans already interacts with a smart speaker on a daily basis, and the share of Google voice search in the US is reaching 30%. It’s all logical, voice is the most seamless form of human communication.

But as of today, computers have a limited capability to understand human sounds and grunts. Over the past 30 years, the pioneers of speech technology have made incredible strides in speech recognition systems to match human performance (+95% accuracy). However, most progress in voice technology today is limited to “speech-to-text”: it takes the rich dimensionality of voice and shrinks it down to a unidimensional series of words. Despite recent progress in Natural Language Processing, words themselves remain very ambiguous when it comes to convey meaning, because they are stripped from the rich nuances carried by the intonation.

In fact, when it comes to affective decisions (such as the intention to purchase a product), a speaker’s tone contains five times as much information as the words themselves:

When you say “yeah, thank you” to Alexa, it cannot differentiate between a heartfelt “thank you” or an irritated “thank you”. However, OTO can. Below is an illustration of 3,000 “acoustically aware” variations of a happy vs tired “thank you”, extracted from customer conversations analyzed by OTO.

The story behind OTO

After Apple acquired the SRI International spin-off Siri, a group of SRI scientists led by Elizabeth Shriberg decided to further push the frontiers of speech understanding by combining deep expertise in behavioral science and artificial intelligence. The result is a technology that can computationally model a speaker’s intonation directly from raw sound waves to derive meaning, sentiment and behavior beyond words and in real time.

Intonation analysis has endless applications; smart speakers, live sales assistants, agent augmentation, AI assistants, health care, customer analysis, avatars, robotics, etc. All of these opportunities rely on one key technical innovation; going from speech-to-text to speech-to-meaning.

Fascinated by the prospect of bringing this innovation to the world, Nicolas Perony and I spun off this technology from SRI International and co-founded OTO Systems Inc.

How OTO uses intonation to deliver value today.

Businesses and consumers exchange over a hundred billion calls per year, yet less than 10% of these calls are being sampled for quality assurance. This is an untapped gold mine of customer insights.

Applying speech-to-text on these calls, besides being expensive, only provides a summary of topics discussed. But would it output an understanding of how each customer actually feels? Was the representative engaging? What did the customer like and dislike? What turned them off? What excited them? How satisfied are they? How likely are they to remain a customer?

Since we’ve started 9 months ago, OTO has extracted over 3 billion intonation measurements (compared to 20 million spoken words) from customer conversations, to model different sets of behaviors at over 90% accuracy on intonation alone! There is a clear acoustic signature in the voice of sellers/buyers, and we are pretty good at modeling it.


Everyone has heard: “what gets measured, gets done”. Now that we can measure intonation at scale, we can measure each customer’s behavior. Measuring behavior at scale implies that businesses can take customer relationships to the next level. Higher personalization, satisfaction, conversion rate, life-time value and reduced churn are values that go straight to the bottom line.

We successfully demonstrated these ideas earlier this year when we modeled how a “top” call center agent sounds like when delivering great service. First, we monitored agent behavior for quality control efficiency, then we coached agents in real-time (during each call) to help all of them sound like a top performer.

As a result, we increased the overall conversation engagement by 19%, which has led to an increase in sales conversion rate of 5% on tens of thousands of inbound calls.

While we scale up the agent coaching, we now also focus on modeling the customer’s speech structure and dynamics to better understand the interaction as a whole, and steer the entire conversation toward better outcomes.

Building a library of human behaviors.

We started where the value is, where the data is. But that’s just the beginning.

Human-machine voice interfaces will be predominant in the near future, which will force businesses to skillfully handle enormous volumes of voice interactions while maintaining a high degree of personalization. Their success will largely depend on their ability to understand and connect with their customers at the right level. OTO will be there to power this very skill.

Stay tuned. We will keep you updated as we transform how we hear the world.

The OTO Team

OTO is an SRI International spin-off venture building the next generation in voice tech.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store