Skeptical about voice control

There was a “Sunday NYT article on voice recognition”: and how we are all going to control our TVs and other devices with voice. Building on the Siri wave, there is a popular belief that voice will become a significant or even dominant way we interact with devices and services.

I’m a big believer in voice. Ignition is an investor in “Spoken”: who is doing great with their existing cloud voice processing business, and have some great ideas for the future. We’re an investor in “AVST”:, “Twisted Pair”:, “Public Mobile”: — all voice-based businesses, all doing great. People are never going to get tired of talking to one another.

And that is what voice is really all about — people talking to people, not to devices. I will invest all day long in technologies that improve people talking to people — making it easier, more accessible, cheaper, augmenting with additional services, hosting conversations, etc.

On the other hand, we don’t talk to our tools and instruments. We touch them. A well designed tool or instrument fits the hands naturally, and in the hand of a skilled practitioner allows great creativity and/or great performances. The feedback during its use is important, we are very sensitive to the feedback and can adjust our use in very fine increments. We don’t attempt to use voice which is an imprecise, error-prone method — in fact, trying to talk very precisely can be quite annoying and unnatural.

So are our computational devices more like tools, or more like people? Do we want to interact with them as tools, or as people? My gut says more like tools, and that we will be more effective using touch and gestures than voice.

There are always going to be edge cases in which voice control is preferred — people with disabilities, handsfree situations. But I’m not convinced voice control will become significant.

One thought on “Skeptical about voice control

  1. A couple of quick points:
    – voice as a separate communication modality has plenty of defficiences even for inter-human transactions and is being quickly replace by multi-modal interfaces (putting AVST and TwistedPair both in the niche player category)
    – The Siri approach is broken because it is a generic interface, doesnt have contextual information (save location and name) about the user; try teaching it that you like Dogs and see what it has to say; imagine if you woke up every morning and you had to start the relationship with your spouse anew (like 50 First Dates), you’d get tired of it pretty quickly, as humans we need to see a relationship developing to build trust and intimacy

    Social intelligence – an understanding of facial expressions, gestures, and vocal intonations — is the essence of effective communication. Social intelligence encourages engagement, trust, attention, learning, and bonding in relationships. Facial expressions and spoken words transmit information more quickly and instinctively that any textual method; they instantly signal what responses are appropriate as they enrich text with vital signals – enthusiasm, skepticism, relative importance of words and concepts – to enhance and motivate more perfect communication.

Comments are closed.