Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Voice Interfaces (dcurt.is)
54 points by uptown on Sept 16, 2014 | hide | past | favorite | 48 comments


When I open to the home screen on my phone, I should be able to just say “Instagram” and have that app open.

Is it me, or is the benefit of that extremely small? I can move my finger and tap the 'Instagram' icon quicker than I can say "Instagram". Never mind that the app itself is highly visual, so there's no point launching it without already looking at the screen and holding the phone to your hand.

When I’m inputting my home address in a web browser (on mobile or desktop), I should be able to tap the “State” dropdown and just say “California” and it have it select that option for me.

Or it should just already know your home address. That's why I don't think Google Now should be included in this criticism - the idea of it is to preempt any need for you to issue voice commands, because it's already presented you with the information you need.

On iOS, when I get a notification that covers the top of the screen, I should be able to just say “ignore” and have the notification instantly disappear.

I really don't think voice is a great interface at all, because it involves invading other people's personal space. I don't want to be sat on the train with people saying "Instagram","Facebook", "ignore", "ignore" all around me. I don't see what's so bad about the current interfaces, nor do I see anything in this post that is particularly new or insightful.


"I don't want to be sat on the train with people saying "Instagram","Facebook", "ignore", "ignore" all around me. "

Is it possible to implement a voice virus or recursive echo algo using a pile of VUI phones, like in a sardine like open plan office or a crowded bar?

Could I yell into a crowded room "OK Google Now say Siri say ok google now say siri" and let it viral transmit between phones until everyone shuts off their phones in an explosive echo chamber?

Just curious if Siri and Now can either generally or specifically (back door?) talk to each other and if so has anyone exploited it yet.

I also wonder if aside from owners voice there is a back door where stereotypical midwestern newsreader voice auto hacks in and how that interacts if at all. Or a recorded Steve Jobs voice auto unlocks all iphones or something. Or a string of hex digits or nato phonetics as a VUI back door. If there isn't a backdoor, likely you're not looking hard enough or its too heavily classified. And what kind of fun can I have once it becomes known.


> Is it me, or is the benefit of that extremely small?

It's not even a small benefit. It's a horrible interface. Streets and offices filled with people jabbering commands at machines would be just horrible.

It's a gimmick to sell phones that looks cool in a demo and grows old after a little actual use.


Right now it it's probably not worth it yet unless you're already using speech dictation for a disability (including RSI e.a.)

Once we get eye tracking by default on both desktop and AR there will be nothing more natural than the look-and-speak interface though...

But yes; ideally we would quickly get sub-vocalization detection too so that we can address your noisiness concern as well.


>Is it me, or is the benefit of that extremely small? I can move my finger and tap the 'Instagram' icon quicker than I can say "Instagram".

That's because you're picturing it as the slow clunky voice detection we have today. Imagine just picking up your phone and saying naturally: "untog's instagram" and it loads up, no touching no prompting.


Er, I still don't understand how that is quicker. You mean I get to skip unlocking my keypad? Apple's fingerprint scanner thing is a much better solution to that problem.

And how often does that get triggered by accident? I saw to a friend, "did you see my photo on Instagram", and my phone loads the app.

I'm sorry, but voice is an awful interface. And it has been for decades, ever since people first proposed the idea of office workers sitting around dictating their work. It never happened, and that's a good thing.


I know what you're saying but try to imagine a fully-functioning voice AI. Like in the movie Her. In this hypothetical situation you just talk naturally to your device and it knows what you want and can realistically understand you. That's the future.


But an app like Instagram doesn't make any sense in that world. A fully-functioning voice AI only works over voice - once you have to do something visual it loses its purpose. Imagine browsing flight choices by voice only. How is that better than looking at a table?

In any case, I find it interesting that some people walked away from Her and held it up as an example of our technological future, while others walked away and held it up as an example of a horrible dystopian nightmare. To each their own, I suppose.


That is, however, the opposite of what the OP is proposing.


> Is it me, or is the benefit of that extremely small? I can move my finger and tap the 'Instagram' icon quicker than I can say "Instagram". Never mind that the app itself is highly visual, so there's no point launching it without already looking at the screen and holding the phone to your hand.

On the other hand, if you're disabled, cooking, holding a baby, or under a car hood with greasy hands - maybe voice is suddenly 100x more interesting.

From large vacuum tube monstrosities to punch cards to command line to mouse/windows UI to touch to voice - all these progressions kept the previous generation capable (and in some times was only workable in a hybrid solution)

In each progression, another class of viable usage models is exposed. In each progression more users can leverage the power of computing. Just because I find my phone easier to browse with on the train doesn't mean it replaces my laptop at home.


Exactly. The only time I really want to use voice commands is when I'm driving and should not be staring at and touching my phone. It's nice that it can transcribe my voice into a text message. It's nice that it can read a website to me. These are my use cases.

Unfortunately, they are still a little awkward to use. To get it to text someone, I still have to type in their name because it's terrible at learning how my friends' names are pronounced. And to get it to read a web page, I have to highlight the text first and then click Speak. It'd be nice if these were easier to do hands-free.


> Is it me, or is the benefit of that extremely small? I can move my finger and tap the 'Instagram' icon quicker than I can say "Instagram". Never mind that the app itself is highly visual, so there's no point launching it without already looking at the screen and holding the phone to your hand.

I actually built this for Android: https://play.google.com/store/apps/details?id=com.precon.wid...

and the key insight to a good UX was making it so it always listens, and changes dynamically, but you confirm that the app should open/the recognition was correct by tapping on it like any other icon.


If you combine the new Moto X with the Moto Hint you essentially have this listening only to you.

That would enable you to say "OK [Moto], Launch Instagram" and it hear you regardless of people around you (specifying that due to a follow up comment you've already received).

You can already say "OK Google [Now], [Open|Launch] Twitter" and it should work.


I think his post demonstrates a poor understanding of the state of voice technology.

> That being said, current voice recognition technology is incredibly good at certain things. It’s great at detecting and transcribing words, listening for specific commands, and making matches against expected inputs.

Current state of the art is not actually great at transcription or detection, in fact only Google and Apple's algos are any good and they both involve sending all your voice recordings to their servers where huge models are used to transcribe, meaning lag times and it's not efficient to always be sending every minute of audio to them. If you want to do continuous listening or use local voice transcription it's possible but you are limited by your hardware.

Listening for activation words is also a hard problem. In fact, Moto X has a chip designed just to listen for "Ok Google" so clearly you see some of the problems: 1. it uses lots of power, 2. it has no flexibility.

Other examples he gives assumes the phone won't intercept outside noise or will differentiate between voices. This is unfortunately not the case -- you could maybe apply a training and have a custom voice model but it's still just an unrealistic idea at this point.

> The reason current voice interfaces suck is because they force the speaker to consciously enter a “voice” mode and then create context around the action they want the computer to perform. This makes no sense; the computer should just always be listening for potential commands within the context of whatever the user is doing.

Yes, this is true. But it's all side-effect because of power, bandwidth, and processing-power constraints given today's algos / models. Today's voice processing algos rely on lots of data. And also the fact that voices aren't differentiated right now and it'd be havoc if everyone were setting off each others' phones.


Your post is a bit ironic seeing as the Moto X does differentiate between voice (if you're good at imitating someone's voice you can set it off, but sending a command is unlikely) and the power usage is not as bad as you assume (it's got average battery life but can easily make it through the day), but I will agree that the "Ok Google Now" being inflexible is a niggling point.

The new Moto X is always listening as well, but allows you to change the trigger phrase via some magic they've worked out.


> Your post is a bit ironic seeing as the [..]

Ironic?

I think you're saying that they're incorrect, that the power usage of the voice chip isn't that bad and that it can differentiate between voices but not command phrases.

Which I think is a misunderstanding of what they're saying. The fact that you need an entirely custom chip to even make voice processing reasonable in power consumption shows that it's a domain-specific heavy calculation where silicon is the only place to speed it up. The new Moto X has a better domain-specific chip.


I see, thanks. I didn't know these things about the Moto X. It's really a spectacular feat of engineering!

But in the broader picture my point stands, he still made a number of false assumptions regarding voice tech ("right now it's good at x y z," and it's not by any means). It's as if he were imagining today's voice tech to be at the equivalent of capacitive touch screens when in reality it's not, they aren't quite invented yet and our current level of tech is still resistive. So of course sensitivity is poor and multi-touch hasn't been implemented. It's not design oversight, it's technological limitation.

This isn't to say things aren't changing -- that's what the Moto X represents, improvement on the bleeding edge :)


I think this just shows that Siri has fallen behind Android. On Android:

"Who invented the light bulb?": lists the 3 inventors

"Open web-page The Economist": goes to theeconomist.com

"Launch Instagram": launches instagram app

"Send email to <person>": starts an email to <person>

There are lots more, though you do have to guess/remember the magic words if you don't want to do a web-search. And yes, it is annoying that you can't enter California in a drop-down by voice (but really, your browser should be auto-completing that for you anyway). But the hard problems have been solved, and I see a bright future here.


With the exception of the light bulb query, these work just fine with iOS too – the thrust of his post is that these actions should be available without having to specifically request a voice interaction mode to be turned on, which Android requires as well.


I certainly take your point; that would be awesome. And thanks for the iOS information!

It was more confused by the original article though; he seemed to be suggesting that you would click on a particular UI component first: Want to open an webpage? Go to the home screen, then say "Web Browser", then click on the browser bar, then say 'The Economist'.

But why touch on small areas of the screen at all, if Siri is just a button-hold away and takes you right there? I think the take-home message is that many of those UI components could go away - we don't need a browser bar, we don't need a home screen, if we choose to use voice instead.


The whole point of the article is that the button hold is bad because Siri has no context to understand what you are about to say, isn't very intelligent, and that means you end up making up weird phrasings to try and indicate that you want the website "the Economist" to load in a browser, rather than directions to The Economist office building, or facts about The Economist website traffic from wolfram Alpha, or to open The Economist latest issue in iBooks ...

By being in a browser and selecting the URL bar, the context is narrowed from "anything you can say in English" to "a website, if it's not a website, search for it".

By dropping a dropdown, the context is "one item from this list".

N.b. he also talks about desktops, not just mobile.


It's not just voice input that's severely neglected by UI designers—it's text-based interface in general. Consider his example of choosing "California" from a list of states—most people would find it way easier to just type "CA" than to scroll and hunt for "California" in a list—but how often do you see an interface that prods you to do this, or even makes apparent that it's possible? (In some web apps that use custom menus, it isn't!) Why, when I want to apply a 10px blur in Photoshop, can't I just type "blur 10 pixels" instead of digging through 2 nested menus and a modal? WIMP interfaces are great for _discovering_ what an app can do, but terrible at letting you do a specific thing fast.

Things like spotlight and wolfram alpha are a step forward, but we're still light-years behind where we could be if we took text-based UI—or hybrid WIMP/text UI—seriously.


I worked with TellMe (a voice reco company) right after they were acquired by Microsoft. My job was mainly to work on VUIs (voice user interfaces) for the Windows Phone, Xbox, and Cortana during its early days.

There are two things blocking voice interfaces from being more prevalent. The first is an action word. As you said yourself, all voice systems today require the user to enter in to a voice mode before actions can be recognized. It's sadly a necessity as devices today can't differentiate between when they're being talked to and when a user is talking to someone else. Even if you only allow voice commands within a specific context (i.e. an alert pops up, the user says "ignore", the pop up goes away), you're playing a very dangerous game. Take this scenario for instance:

Bob on his computer: "Have you seen this cat video? Look at how the cute kitty cat will completely..."

Phone: POP UP WITH ALERT ALL YOUR ICLOUD NUDES ARE BEING STOLEN

Bob on his computer: "...ignore..."

Phone: pop up goes away, never seen by Bob (unlike his nudes, which are now everywhere)

Bob on his computer "...everything around him! I would never ignore things that were important."

See the problem there? While it is a slim chance of happening, the effects of misinterpreting a command can be dire. The potential negatives outweigh the convenience. Maybe if everyone actively used voice commands, the net sum would be positive despite it destroying a select few peoples' lives, but that brings us to our other problem:

Voice commands are socially looked down upon and are rarely used by most people aged 25-50.

This group of people have two things in common - they typically know how to use a keyboard and trust its input, and they've all used the terrible voice commands from old phone systems in the past, which leads them not to trust voice. The lack of trust means few people use it, making it seem weird to do in public. This is getting better, and certainly more people are using voice commands now than they were five years ago. But it's still an extremely small percentage of users that are voice power users. Ask yourself this: how many people do you know that send text messages with voice? What percentage of your friend group is that? I know two people - one of them was a voice engineer who had to test voice input on phones constantly.

Voice will get better, but it's going to take time. More people need to use it before we can take more risks, and we need to develop systems that know when they're being addressed before we can get rid of action words.


Something like this already happens all the time if programs steal focus. I have no idea.how often I've been happily typing only to have some dialog pop-up and accept my newline Enter as confirmation for whatever it wanted. Things are better now that I switched to Linux and enabled all the prevention but it still happens every now and then. Sometimes I end up writing my passwords in a thieving application


> Siri and Google Now are simply not yet ready to exist

After reading the article: Dustin's ideas are simply not ready to exist.

And somehow he hand-waves toward the ever-enlarging group that uses Siri and Google Now all the time. Sports, weather, directions, etc.


"ever-enlarging group that uses Siri and Google Now"

I was curious where we are on the stereotypical fad graph, the upswing, the plateau of disillusionment, the decline, whatever. Turns out its VERY hard to find usage stats.

"In the wild" I've never seen any human being use siri or now, other than fooling around a couple years back when it was new. My wife and I both have Now equipped phones and its mostly just an annoyance when it detects an upward swipe.

Anecdotally I find voice UI much like having printer support on my phone. Something I imagine would be insanely useful, so of course I have an app that can print to my printer. What an amazing network of possibilities. But in practice I never use it. Ever. Once to verify it works. Yup, it works. Never printed anything again. To some extent this is just "owning a laser printer" in general, which I really don't need and will not replace when it eventually breaks.


One source of this problem is that only Google and Nuance have voice recognition engines that are any good, and they are very closed up (unless you pay tons of money to Nuance like Apple does for Siri).

Most developer's only option for voice technology is to use Nuance's API which requires uploading the voice sample and waiting for a response, this is nowhere near fast enough for pleasant interaction. Things will only get better when Apple, Google and Microsoft open up really high quality on-device speech APIs for their operating systems.


    In addition to using voice actions to launch
    activities, you can also call the system's
    built-in Speech Recognizer activity to obtain
    speech input from users. This is useful to obtain
    input from users and then process it, such as
    doing a search or sending it as a message.
https://developer.android.com/training/wearables/apps/voice....


My personal hunch is that Apple is going to heavily roll out physical/location context in the next few years. If my friend has a future Apple TV, I should be able to say "Hey, Siri"* and have Siri recognize that it's me and make my entire media library available for playback.

This isn't as hard as it sounds, I don't think--Apple only has to search my friend's contacts' voices, I'm probably on my friend's wifi already, etc.

* iOS 8 includes the ability to say "Hey, Siri" at any time when your iPhone is charging, but I've had crap luck with it during the betas (trying to use it in my car).


> Because they are still so frustratingly limited, Siri and Google Now are simply not yet ready to exist.

In fairness, Google Now's pitch is a lot more about location-awareness than it is about speech recognition, innit?


> In fairness, Google Now's pitch is a lot more about location-awareness than it is about speech recognition, innit

Google Now's pitch is a lot more about all-around awareness -- not just "location" awareness. Voice isn't really even part of it (the voice actions are all part of the Google app, which Now integrates, but they aren't really part of Now -- and actually predate it.)


Dustin is right: a truly pleasant user experience would necessary involve continuous listening (without hotword à la OK Google), as well as leveraging the user context much more than what Siri or Google Now do today. See for instance the vision depicted in the movie Her [1].

Beyond Google or Apple, a few startups (like us at wit.ai) are hard at work designing building blocks to help developers solve these problems.

[1] https://wit.ai/blog/2014/02/24/her-the-movie


When I highlight the browser address bar, I should be able to just say “The Economist” and have it automatically find the address in my favorites and go there.

Would be nice, but not as good as 'Computer_name...anything new in the Economist?' 'There's a provocative analysis of the arms trade and some particularly egregious punning of the sort you claim to despise.'

I have a marvelous scheme to stimulate market demand for such services, but this comment is insufficiently well-funded to contain it.


This post is right about one thing: Speech recognition technology is good at context-specific recognition, i.e. with a small grammar, as in VoiceXML IVR applications (anyone else remember those?). This has been true on typical PC hardware since at least 2000, so it should be easy to run that kind of speech recognition locally on a smartphone.

But, last time I did anything serious with that kind of speech recognition, it still required a push-to-talk button or the like. Maybe a trigger phrase would work now.


Of the examples Dustin offers here, the ones that have voice replacing typing make the most sense:

- When I’m inputting my home address in a web browser (on mobile or desktop), I should be able to tap the “State” dropdown and just say “California” and it have it select that option for me.

- When I highlight the browser address bar, I should be able to just say “The Economist” and have it automatically find the address in my favorites and go there. ...

- When I click the “To” field in a mail app or in Gmail, I should be able to just say a person’s name and have it fill in automatically (and maybe show me a dropdown to select which email address to send to).

Dustin's examples where the voice recognition replace a series of swipes and or taps seem, to me, a bit frivolous:

- When I open to the home screen on my phone, I should be able to just say “Instagram” and have that app open.

- On iOS, when I get a notification that covers the top of the screen, I should be able to just say “ignore” and have the notification instantly disappear.

I know exactly where Instagram is on my phone. It's a swipe and a tap. And to ignore a notification on iOS, all one has to do is swipe up on it.

The incremental time spent vs voice control seems negligible, and not something I'd fret about as a user nor a UX designer.


The reason current voice interfaces suck is because they force the speaker to consciously enter a “voice” mode and then create context around the action they want the computer to perform

Isnt Google testing this right now? ex http://i.imgur.com/fJrQZ0H.jpg


Not exactly this, but iOS and google both have voice mode shortcuts you can use "Hey Siri" (which i think only works when the device is plugged in) and "Ok, Google".

The problem I see is users getting frustrated with false positives - having the device do things they are not expecting. Pretty much the experience someone has when they first try vi or ed.


They still require the utterly ridiculous "Ok, Google" activation though. I know its irrational, but I really don't want to have to use phrases I would never use in day to day life to communicate with my computer.


Just wonder whenever someone had experimented with use of constructed or just uncommon languages for such purposes. Such approach is certainly not consumer-friendly, but fun nonetheless.

Say, a geeky enough person could use Esperanto or Lojban to control always-listening home automation without worrying too much it'll accidentally get unintended commands. Or, maybe, not a general purpose conlang, but a specially constructed one tailored just to quickly express the necessary concepts to machine.


Isn't that the whole point of an obscure catchword? To avoid the computer intercepting normal conversation?


Sure, but that's just an artifact of the crude capabilities of current speech recognition. A sufficiently advanced speech recognition engine wouldn't need you to insert a stopword to know that you're talking to it. It would just... know.


Sometimes I, as a human being, don't know if someone is talking to me unless they preface their statement with my name. How are computers supposed to be any better?


Long term, because it can monitor way more world state and be designed with as much processing power as necessary; you have one brain and don't want to spend 100% of your time working out if someone is talking to you.


That's a bit strange. I don't always know whether people are talking to me, and I wouldn't say my speech recognition is crude.

Why don't they just allow for naming a device? You'd use names in your everyday conversation.


I've given a small amount of thought to attempting a shell designed for spoken interaction, and also - relatedly - a scripting language designed to be spoken. I've not really gotten anywhere, though.

This doesn't really sound like the same thing as either of those, but close enough I wanted to mention.


So, let's assume that the issue is context (based on discussion, that seems to be a major consensus point).

Obvious solution is to allow more context to better tune understanding of voice, right?

Allowing better context is pretty much predicated on constant audio surveillance.

Is this worth it, truly?


You can open apps with Siri on iOS.


Okay? Noted? I'm not sure how much there is to respond to here.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: