About This Case

Closed

27 Nov 2007, 11:59PM PT

Bonus Detail

  • Top 5 Qualifying Insights Earn $100 Bonus

Posted

15 Nov 2007, 12:00AM PT

Industries

  • Advertising / Marketing / Sales
  • Consumer Services / Retail Industry
  • Enterprise Software & Services
  • Hardware
  • Internet / Online Services / Consumer Software
  • Media / Entertainment
  • Start-Ups / Small Businesses / Franchises
  • Telecom / Broadband / Wireless

Buttons On Phones Are Nice, But What About Voice Interfaces?

 

Closed: 27 Nov 2007, 11:59PM PT

Earn up to $100 for Insights on this case.

LetsTalk's PhoneTalk blog wants to add new voices to its website, and they're posting regular Cases here for the Techdirt Insight Community to add interesting new content to their site. The winning submissions for each Challenge Case will be posted (perhaps with some editing) on the PhoneTalk blog -- with credits to the author. The following is LetsTalk's next assignment:

While multitouch displays on mobile devices have been getting a lot of attention recently, voice interfaces are also on the rise. Services like 1-800-FREE411, 1-800-CALL411, and 1-800-GOOG411 (to name just a few) aim to provide speech-based services for mobile phones. The accuracy of automated speech recognition is obviously not perfect, but it is improving. Besides the accuracy issue, though, why isn't speech a more popular interface? What speech-based apps or services do you find particularly useful for your mobile phone? Will speech recognition ever become a truly mainstream interface? What will be the killer app for speech recognition? Do you use any speech services now? If not, what would get you to try one?

16 Insights

 



Mobile phones interface innovations are useful when they suit a consumer need, not when they add function.

 Vocal interfaces have not taken off because they are slow and often arduous.  It is often much faster, easier and more intuitive for phone users to simply push a button or two than to speak to a phone. 

 Mobile phone users have 15 years of experience typing input into phones to access functionality.   Voice functionality has grown entirely as it relates to communications, not access or infrastructure manipulation. 

Voice functionality will continue to grow as it relates to functionality users want to interact with via speech.  Goog-411 and Free-411 are growing because they are simple and because they deliver solid value to the end user.  Speech interfaces will only grow once there will is sufficient value delivered to the end user - sufficient for user to learn a new mode of interface with a machine.  We speak to people, we type on machines.

Speech isn't more popular because it still doesn't work perfectly.  All it takes is one word off and you're getting connected to "the orange factory" instead of "the porridge factory". 

Speech-to-text is way more useful than just "call person xyz" if an all-in-one speech recognition program is implemented.  I should be able to call on my phone and do anything the phone can do without touching more than one button (to start the speech engine).  I should be able to set the phone options, turn the phone off, anything with the command functions.  When the phone can be operated without manual dexterity, then it becomes an accessibility option for people of limited dexterity.  

Technology to strip out background noises in speech-to-text will also boost usability. 

Great speech recognition will require a starting command word so that talking about calling someone doesn't try to actually call the person.  For instance, if I say "x34 call so-and-so", then it should call them, otherwise the phone ignores what I'm saying.

(Warning - slightly lengthy & detailed :-)

~~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~,

 ________
 Besides the accuracy issue, though, why isn't speech a more popular interface?
____

I've been in speech recognition for over a decade, both hardware & software ;
And for the past three years primarily in remote speech recognition.
I've learned a lot, and most of it's not good.

The short answer is: A combination of things.

1) As to accuracy:


After about 7 years of dedicated research, the leaders in the speech recognition world, Microsoft (as they have proven with their introduction of the world's first 16khz recognizer), has concluded there is not, nor ever will be such a thing as "Speaker Independent" speech recognition, and for a variety of very valid reasons.

     a) There are too many nuances in the ways five different people will say the same identical phrase (5) times each, for    example. It's possible for IVR systems (which are Not speech recognizers, per se) to do a pretty good job with short bursts of speech, containg words & phrases that are farly common with respect to a pronunciation denominator.

     b) Dialects (Southern, NorthEastern, Midwestern etc) and accents.

As to the majority of attempts at speaker-independent recognition, from the majority of IVR's: 

The more complicated the words are, vis-a-vis their exclusivity, the better job the IVR will do.
Example: An IVR system will do very well with a phrase like "This is the uttermost sensation" than with "It's easy to recognize speech".

As regards the first of those two, it is fairly "exclusive" and there are very few words/phrases that sound (and are constructed) like "uttermost" and "sensation".

As regards the latter, it's easy for the IVR to stumble when deciding wonder what was uttered, exactly..
"It's easy to recognize speech" .. ?? |or did I hear|  "It's easy to wreck a nice beach" .. ??

And in that vein.. think of the thousands and thousands of words that sound alike.

+{ .. Accuracy, Part 2 }+

     c) Our voice vs. sloppy hardware ÷ bandwidth conservation + signal pollution.

It's hard for the general public to know or understand what really happens to our voice as it traverses respective networks on it's way to the terminating number/application, and/or how awful the signal is before it even gets started.

Major cell phone makers are not about to spend more than a dollar or so on a good microphone element.
Why? There's no need! We humans can extrapolate from the garbled speech data that cheapo mic. elements spit out, using surrounding words and phrases, what was almost certainly said to a very high degree of accuracy.
So why in the world would they want to use a micophone element that does a far better job, but costs 500% more?

However, speech recognizers, especially where there's not a speech model available for each speaker, can't "fill in what they don't hear" - no matter how advanced they are, how much AI they might contain.
(Remember the old euphamism.. "GIGO" ..?)

But that's just the beginning of the problem.
Let's take a phone call made from a T-Mobile, or AT&T/Cingular cellphone as an example (not to single anyone out; just an example) that terminates to a server somewhere.

This phone call will begin by taking the terrible analog signal from the $1.00 Pac Rim mic. element and encoding it into GSM. The resulting digital signal will then arrive at the POP where it is to enter the PSTN, or, the VoIP network in the next leg of it's journey.

Now:  This digital signal will be uncompressed, maybe re-converted into analog, maybe not, and then re-compressed into another codec; probably G.72632kbps, or G.711, or, if the next network in the journey is 2nd tier VoIP provider, maybe a thin Speex or some other "bandwidth friendly" codec.

Presuming there's only (3) hops from cellphone to IVR (wireless network --> PSTN/VoIP --> terminating network) than our voice has been converted/re-converted three times!

However, this is the proverbial tip of the iceberg. Both the cheapo mic. elements that cell phones use, and the wireless networks that carry the signal (who are far more concerned with spreading their bandwidth to as many simultaneous users as possible than they are concerned with quality) only push along about 3khz - 4khz of audio bandwidth. 
Again, these providers know humans can extrapolate missing speech data extremely well, and therefore rely on that premise to define the economics vs. quality algorithm.

Problem is, a tremendous amount of speech data resides well above the 3khz - 4khz range, (the range ceiling for any such large network) so if we say "My cousin Vinny does a lot of sailing" the IVR is very hard pressed to determine if Vinny owns a boat, or is doing poorly in school. This is due to the fact that destinguishing between the "S" sound for sailing, and the "F" sound for [failing] requires that the IVR's recognizer get data up & above 6khz, which is just not there to begin with, regardless.

Add in the fact that most cell phones' microphone elements don't do noise canceling very well, if at all - So if the user happens to be near traffic, around wind noise or in a busy mall, then the recognizer at the other end will have all sorts of ambient sound to hope it can successfully identify & discard, which is of course impossible if those sounds fall in the range (50Hz - 4kHz) that it "knows" is speech.

____
why isn't speech a more popular interface?

[+] We just don't like talking to machines, for the most part [+]

In their book "Wired for Speech" Clifford Nass and Scott Brave discuss many things, and there is a ton of superb research in this book about voice interaction and point out "we respond to voice technologies as we respond to actual people".
It's very, very very hard to have a computer voice emulate a human, no matter how polite it is, how "cute" it sounds, how well it enunciates words & phrases.. it's an all but impossible task.

Neospeech has done some pretty amazing things with their two most advanced TTS voices, but consider this.
For each incoming telephone call, there has to a a correspondong open port.
Each port cannot "share" the speech recognizer, or it's voice, with other ports.
One port = One instance of recognizer & voice.

Recognizers, and voice licenses... are far from cheap.
If a company like, say MasterCard wants to be sure their customers never get a busy signal, then they have to be ready for, say at peak hours, maybe 10,000 simultaneous incoming calls.
And that means 10,000 open ports.

And that means 10,000 instances of an expensive recognizers, and 10,000 instances of an expensive voice .. ??

Not very likely.

So, what we as consumers hear & interact with are economic versions of each, which don't work as well as they could, so right away.. we don't like them.
The website, http://www.gethuman.com/ gets hundreds of thousands of visits monthly, and understandably so.

Me, personally? The moment I hear an IVR begin, I begin to push the "0"_+_"#" keys, to foil it, to make it say
"OK! I surrender! Let me get you to a human!" and if that doesn't work, I begin to utter "Operator---Operator---Operator" because I know that this defalult handoff is built into all the decent IVR's, to keep customers or potential customers from hanging up and never calling back!

~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.
What will be the killer app for speech recognition?

I don't know what everyone's "Killer App." would be, but I do know there's a simple, easy solution to making speech recognition universally workable, accurate and friendly from mobile phones everywhere..
A process called "Distributed Speech Recognition".

You can read a lot about it here, but it's really fairly simple.
Our speech is sampled by a "local" recognizer, performing the front-end processing in the mobile device directly on our speech, and then this digital signal is sent via the cell phone's data channel, to the remote recognizer.

All the pitfalls and all the drawbacks of codec reconversions, channel error, bad hardware and the like are eliminated, and in the ETSI tests done several years ago, a less than 1% error rate was documented.
     Less Than One Percent.

Why don't we see it everywhere?
Because it would call for standardization by all the wireless carriers & cell phone makers,  and everyone would lose their "exclusivity" and lots of companies would lose big the revenues from this emerging speech recognition & IVR market.
So we, as consumers suffer.

Do I use any speech services now?
Of course - I use the awesome 16kHz Windows Vista™ speech recognition, and I hardly touch the keybaord these days..

In fact - Almost everything you read here, was done with my voice, while I caught up on straightening out my desk!

 

Bill Burke
http://wirelessspeech.blogspot.com/

icon
Bill Burke
Tue Nov 20 12:49pm
Just some passing comments about various insights..

"Voice functionality has grown entirely as it relates to communications, not access or infrastructure manipulation"

This is a tad inaccurate. Speech recognition is widely used in various programming environments, hen there is VoiceXML and a growing number of websites that interact with speech command & control. There are also a great many industries where a variety of manufacturing functions are speech-driven.

Re: Access:
I surf the web all day long w/o touching the keyboard - this is built into Vista and requires no leaning curve whatsoever.
icon
Bill Burke
Tue Nov 20 12:53pm
Just another comment:

""Speech interfaces will only grow once there will is sufficient value delivered to the end user .. We speak to people, we type on machines"

Nuance has seen fit to run network TV ads each morning for their latest version of Dragon (read: $500,000), and well over 30% of Microsoft's staff uses speech recognition all day long for everyday work.
icon
Bill Burke
Tue Nov 20 1:02pm
"Technology to strip out background noises in speech-to-text will also boost usability"

Unfortunately, software-based noise canceling is a holy grail that is years and years away, if at all possible.

It's possible to use hardware very close to a user's mouth that ignores everything not coming from that direction; But it still remains impossible to "trick" a recognizer into discarding one sound within the frequency range it's listening to, but yet accepting another/different sound in that range.

There's been significant progress in this arena, but still no reliable process to isolate "speech only" as the only accepted sound waveform - Problem is: How would a recognizer know, for example, the difference between the hum of car tires at around 2khz, and the uttering of words at around 2khz .. ??
icon
Derek Kerton
Tue Nov 27 11:46pm
Hi, to your own last comment, I think the answer to that is in the array side of noise cancellation technology. By having multiple microphones at varying distances from the mouth, and aimed in different directions, technology is available to get a better idea of what was uttered by the mouth at 2khz, and what is from the hum of car tires at 2khz.

I've seen/heard demos and it can produce remarkable results. Of course, we still have the problem you outlined about the desire for low-cost microphones.

I like your point about encoding the voice signal in the phone, and sending it upstream as data, thus avoiding the degradation along the voice path. You've reminded me that this is what Promptu is doing with their Multi Modal voice search, which I discussed in my post.
Speech interfaces aren't popular for several good reasons.
  • It takes longer. I can instantly recognize an appropriate icon and tap it within a second. But the same operation takes noticably longer if a slowly-enunicating voice has to describe my options. (Especially if it then has to describe a second set of choices following from the first selection!)

  • Ineffectiveness. Speech interfaces only work if every possible request can be anticipated. Some services do have a very specific set of choices. (For example, Moviefone.) But others simply can't provide a voice prompt for every possible path that a user might take, and this problem is compounded by different words users offer for the same requests. I remember a frustrating call where I ended up shouting to the voice prompts "Operator! Representative? Live Person. Help!!!"

  • Bad design. When I call my mutual fund to check my balance, their speech interface prompts asks me which fund to check. Since I only own one fund, it's a stupid question — annoying, and time-consuming. If I were speaking to a live operator, they could a.) see that I had only one fund, and b.) instantly provide the balance. Speech interfaces leave the user at the mercy of a pre-configured set of choices, and this is ultimately a disempowering experience.


Obviously speech recognition is already becoming a mainstream interface — but not because users like it. Banks implement it to save money by reducing the number of live operators. (And to torture their users with mandatory advertisements for their other products.) Since 1997 my dentist has also been using a headset recorder with a voice interface to record measurements of my teeth (while keeping both his hands in my mouth...)

This points to a key opportunity for speech interface systems — situations where a touch interface isn't practical. One great example? When users are driving. Onboard navigation systems can let users request real-time traffic information or driving directions while they're keeping their hands on the steering wheel. Speech interfaces could provide a variety of other services for drivers, from ordering meals, buying tickets, or even pre-ordering flowers for a big date. Ultimately users could even surf the web while driving, having the text of web pages read to them with a text-to-voice interface! Commutes represent the one part of the day when users actually have some extra time for using new services.

Most speech interfaces drive me crazy, but I think that qualifies me to say exactly what's missing. Most use an impersonal voice striking a "professionally" friendly tone which isn't fooling anyone. At least Moviefone uses a distinctive (and enthusiastic) voice with an original personality. At the very least a speech interface could give users the option of customizing the voices they'd hear. (Would users prefer a male voice, a female voice, a synthesized voice, or a live operator.) But I'd love to see a speech interface hire celebrities to record their voice prompts. Can't you just imagine it?

Press 1 for William Shatner
Press 2 for Elmo
Press 3 for Larry King
Press 4 for Mr. T...
icon
Bill Burke
Thu Nov 22 8:02am
Excellent points! and maybe I can brighten Techreporter's day..

As re:
"Ultimately users could even surf the web while driving, having the text of web pages read to them with a text-to-voice interface! "

Adondo's PAL does just this:
http://www.adondo.net/adondo-pal-features.aspx
and a whole lot more, including the capability to manage one's Outlook by speech -and- reply to, and/or dictate & send new messages with (typically) 98% accuracy.

If Techreporter would like to contact me off-list, I'd be happy to provide him with a Professional version serial number!
icon
Jeff Foley
Tue Dec 18 8:40am
Just wanted to counter two elements of this statement:
"Obviously speech recognition is already becoming a mainstream interface — but not because users like it. Banks implement it to save money by reducing the number of live operators."

Here at Nuance we commissioned some primary research on this subject, and found out that people don't hate the speech interfaces -- they hate poorly designed systems that hold them prisoner. Things like using the IVR as a gatekeeper, trapping callers so they can't get to an agent easily. (Touchtone systems suffer from this too.) When presented with examples of speech and touchtone interactions, folks overwhelmingly (80%) preferred the speech systems due to their ease of use, clarity, speed of transaction and completeness of the service experience.

I wrote a whitepaper on this and other related subjects presenting some of this research.(http://www.nuance.com/mythbusters/)

The second point is that it's not just financial institutions using speech in their call centers -- it's telecommunications and cable providers, insurance, healthcare, travel & hospitality (airlines, trains, rental cars, hotels), retail stores, and utilities. There are at least 3000 applications out there and the number is growing. And, like ATMs, people are embracing the automation when they realize it gets them service faster and more conveniently than waiting in line to talk to an agent.

______________________________________
What will be the killer app for speech recognition?
____

 After some thought:

A cell phone speech-driven interface for Twitter, blog posting, Jaiku and maybe even Facebook.
It would be built on Adondo's technology, which already empowers users to get instant real-time traffic reports, weather forecasts, stock quotes, hear podcasts, blogs and feeds or have website's text content recited, via any phone.

It's unique because it uses the necessary local language model for extremely accurate speech transcription, can send email and can interact with websites and servers. I use Adondo's PAL technology each day, and I send hands-free email all day long by voice, with incredibly accurate transcribed results.

Just my take!

 

Bill Burke
http://www.phoneportals.com

span>
icon
Alexander CASASSSOVICI
Sun Nov 25 11:13am
spinvox provides already a twitter extension, check it out ;)

A couple of throughts about speech recognition :

* an accurate speech recognition requires a lot of CPU. to save CPU you must restrain yourself to recognizing a limited set of commands that would be preprocessed, thus limiting the use of the device to one single language and no real speech-to-text

*  Speech recquires a learning curve too, and it's probably harder to get used to sppech-based HMIs than visual ones. E.g. your operator's voicemail, "to delete the message, say "delete""... that hint phase lasts forever

 * A lot of mobile phones already provide a speech recognition function when using a bluetooth headset to help voice-based access to user directory - never was able to recognize my name with its fancy spelling.

* on the other hand, text-to-speech raises confidentiality issues, a lot of things must not be read out loud

 In conclusion, trus speech to text is far from being realistic on mobile from a technology point of vue, and from a usage point of vue I'd say that it has to be restricted to public infos, things that ca be said loud, and thus fit pretty badly a device that represent your intimiate friend and notes collection. 

icon
Alexander CASASSSOVICI
Sun Nov 25 11:14am
Forgot to mention that imho the only speech-recognition based app that's really making value out of this function is spinvox - i never listen to my voicemails but getting them in text by sms and mail rox :)

Besides the accuracy issue thier realy isnt much else, by the time i can say a name or number to where the phone gets it right i could have dialed the number.

but one speech recognition app that i would try is one hat ties in to some kind of gps system that you can say an address to and it will respond with a map of how to get to the address from your current location. and i would have to be programable as far as what all it repeats and what stupid noises it makes like repeating the address you gave it or not and beeping every so often just because kind of things.

it would also have to motorcycle friendly, something that can easily integrate in to an existing gps system and frs communications system.

it would have to actually work and be resonably priced or free and not have a million different charges to use it.

the only speech regognition i have used that i like is the one that comes with Windows Vista simply because it works fairly well and it lets me use my computer without using a mouse or keyboard.

 

What is the end game for voice based interfaces? truly hands free operation.

The impact then will be in services that are one or more of:

  • key intensive
  • time intensive

 Key intensive activities are those that require many key strokes, e.g. navigating IVR or text messages. Many handsets today (the Nokia E50 for one) have message readers that provide alternative methods for reading received SMS's.  With the growing restrictions on use of mobile devices whilst driving a service that allows you to dictate and send an SMS through a voice interface, opposed to keyboard strokes, could prove to be very popular and useful.

Google's mobile service provides such a text message service that can be used to "write" and SMS.  An awkward fit in the voice SMS space is the Bubble Talk application.  Essentially a voice SMS that inserts a voicemail into the recipients queue without their phone ringing.

This is one use of the Google service but the main service, available from other providers such as Jingle Networks, is directory services.  You can use the voice interface to search for services and typically will be connected automatically.

A directory search would normally be a time intensive and/or key intensive activity.  By creating an access channel via voice turns the interaction more into a conversation or discussion.  This has the benefit of creating appeal in less technically able groups who look for a more familiar access path.

Further examples of time and key intensive functions are navigation through IVR for actions such as:

  • Balance checking
  • Pre Paid top up
  • Payment checks

Virgin mobile in the UK has a fairly efficient service that does a pretty good job of opening up the IVR menu to a voice interface.  The usual troubles caused by accent and background noise can still confuse the system though.  Virgin has expanded the service into the 4321 Talk service that allows subscribers to get updates on Sport, Soap Operas, News, Weather etc through a simple voice command set.

Another player in this area is TellMe which offers voice search as well as a application that can be downloaded to view the results on your device.

 Other application approaches to voice access is the offering provided by MobiVox which allows for voice access to Skype from your mobile phone and allows for connections through a local call rate.  This is targeted at more techno savvy users and may not have the general appeal that voice search offers.

 A good platform that extends the application of voice across several lines in IfByPhone.  In particular Voice Broadcast evolves the call center and softens the impact on the customer experience caused by hang ups, lack of CSR capacity and the "IVR Robot".

What are the key benefits of a voice access approach?

  • Hands free functionality of useful services
  • Conversation type approach that attracts less technically capable users
Overall the most appropriate use of such technology will be directory search as the faster approach and unplanned navigation means that this is best suited to a conversation or dialog channel for most users.  With Google already in the space the question would how much space is there for new players?

Speech recognition has been relegated to a myopic set of applications such as looking up telephone numbers or retrieving flight information. A narrow vocabulary is easier for speech recognition software to implement accurately. Speech recognition is also challenged by noisy backgrounds. Noise removal algorithms help considerably. The quality of speech recognition software and noise removal algorithms will improve accuracy enough for developers to consider a wider set of applications beyond single purpose functions.

Speech is a great and natural interface for us to use, but past accuracy problems makes users reluctant to rely on applications built around speech interfaces. If people have to say their input more than twice, they become frustrated and quickly stop using the application. Mobile devices and applications are a great places to implement voice interfaces because of the small size of the mobile device and state laws against using them in automobiles. Anything that uses a remote control is a perfect place to substitute a voice interface. How about changing the channel on the TV just by shouting, "Up Channel?"

Local search, information, and directions on location-aware mobile devices are a great application for voice recognition. I would love to have an application on my phone and Treo where I could activate it from my Bluetooth headset to find out the nearest gas station, bar, or traffic conditions. For instance I could ask it for the nearest gas station and it would show me on the screen and read to me the top results where I could chose one from a menu. Once I choose a station, I would have choices to call, map, direct, or more information. Results would be read and shown on the display for me to view. This single application could do all of these functions without resorting to the keypad. The value comes from hands-free operation aggregating all of the information I may want about my environment at that specific time.

Sprint and Microsoft have their Live Search application for my phone where it uses GPS information to search using keyboard or voice input. It maps, calls, and directs to results, but it is limited to finding locations and businesses not weather and other useful information. I tried to use it to find the local weather on my trip last weekend, and it gave me all local businesses with "weather" in their name. The application is nice because I do not have to fumble with the phone to input my search terms, but it only applies to businesses or people. The application could easily be triggered to present specific information based on keywords like sports scores, weather, airline, news, and many others. Sometimes I would like to just have a map of my current location. Live Search does not do that as far as I can tell. Why can't I just say, "Where am I?" and have it pop up with a map of my current location? This is a simple thing to do with a keyboard, but trickier with voice. Google Maps can't even do it.

Speech is a natural interface to use, especially in mobile applications, but the implementation is tricky because there are little standards or commonly accepted practices for an interface. An enhanced search client that takes into account all things a mobile user would like to know about their surroundings is an excellent initial application followed by a hybrid IM client, and voice activated browser. All of these applications are in reach of current technology.

icon
Bill Burke
Wed Nov 28 3:21pm
"Speech is a natural interface to use, especially in mobile applications, but the implementation is tricky because there are little standards or commonly accepted practices for an interface"

ETSI worked out a standardized interface that all but insured --every- moble speech-enabled device would work flawlessly with -every- remote speech recognition server.

What stopped DSR dead, in the US?
Money, the fear of lost revenue and the word that wireless telco's love to hate.. "standardization".

E.G: Just look at how many wireless protocols there already are for wireless voice, wireless boradband et al, and the cries from each provider as to how much better theirs is/will be, compared to the competitors' flavor(s) ... !!!

Speech is the most popular interface known to man.  Long before the written word, man has used talk to influence and alter the world around him.  Young children rapidly learn it, entertainers do it to amuse, diplomats prefer it to war.  People talk to other people, to animals, to plants, to themselves and to God.  The main reason humans invented written symbols and language is to have permanent records.  Fragments of ancient texts that have survived to the present day state laws and business accounts as often as not.  Like the QWERTY keyboard, sometimes objects can become so familiar that we forget they were originally designed to work around technical and physical limitations.  If we could talk to machines, then we would.  In fact, we often do talk to machines – we just do not expect them to respond.  Even the most forward-thinking can forget this; Steve Jobs announced that “voice is the killer app” before showing off his new phone with a touchscreen.  He was half-right.  Voice is the killer app, but not just because people want to talk to each other.  Phones are designed to be talked through.  It would be just as natural to design phones to be talked to.

 

Speech interfaces are not new.  Back in 2000, UK cellco Orange acquired the Wildfire Communications, and its voice recognition service, for US$142m.  That deal was small compared to Nortel’s purchase of Periphonics for US$436m the previous year.  But in 2005, Orange terminated their Wildfire personal assistant service due to declining numbers of users.  As a consequence, Orange had to manage the uproar from a legion of visually impaired users who relied on Wildfire to make their calls.  Wildfire would have been more popular, and would still exist today, if it had worked well for the general public.  Talking is great because it is a fast and effective form of communication.  You would not email the fire station if your house was burning down.  But talking is frustrating and time-consuming if the person you talk to has difficulty understanding what is being said.  The same is true of voice recognition software.  Misunderstanding only a few percent of the words said may seem like a reasonable level of performance, unless you are the person not being understood.  When Orange pulled the plug on Wildfire, they had to meet their obligations to disabled users by voice recognition software on the handsets themselves.  This has one obvious advantage; the quality of the line is not a factor in whether the speaker is understood.  The drawback of voice recognition software on the handset is that handsets lack the processing power to match the sophistication of software run on dedicated servers.  So an approach based on thin clients, where a universal voice recognition service is accessed over a network, continues to be the most popular way to deliver this functionality.  It is especially popular when provided as a common front end to a service like booking tickets or directory enquiries.  In the US, the 1-800-FREE411 and 1-800-GOOG411 directory services are a good example of the latter.  The reason for that, though, has more to do with eliminating the cost of paying call centre staff to answer calls than it has to do with providing an enhanced service to customers.

 

The breakthrough for speech recognition is perhaps just around the corner.  Necessity is the mother of invention.  Handset manufacturers have been riding the crest of a wave over the last few years, always able to come up with new additions to their devices in order to generate replacement sales.  One of the interesting things about the iPhone, though, is that it shows the limit of the new ideas.  Better screens, in-built cameras, music, touchscreens… but what comes next?  Speech-driven interfaces are an obvious next step.  The poor history of Apple’s own speech recognition software shows that the technical challenge is enormous, but they have reason to keep on investing in research, and not just because of the social obligations to provide communications to the disabled.  If they do not, they will open up an opportunity for the networks to provide a valuable feature to their customers.  Why store phone numbers on your device, if you could just call your network and then tell them, through a spoken command, to put you through to the person you name?  Names pose an enormous challenge because, unlike commands, cultural and language differences cause many more variations in pronunciation.  But after video, speech is the last great frontier for mobile communications.  Whoever can get first-mover advantage in providing an effective voice interface to the most universal of demands – making calls, programming home devices like PVRs, and internet search – will reap the rewards.

I use voice commands on my N73 via my bluetooth handset to call while driving. This is fine if I talk clearly and there isn’t too much traffic noise. I tested out all possibilities on my N73, but got stuck trying to turn Bluetooth on/off with voice commands, didn’t matter how much I shouted “Bluetooth” it just wouldn’t recognise the command, but when I changed to “Greentooth” it recognised it every time. So the biggest thing with voice recognition systems is understanding dialect, or maybe my not so broad Scottish accent is actually too broad for these systems.

 

If I ever phone a company which has voice recognition systems I just put the phone down as they can never understand me. Oh the fun my Wife had listening to me try and book tickets via an automated voice recognition cinema ticket line.

Voice recognition is improving and the possibilities are endless, just think of the things you could say round the house to activate things:

“Lights ON”

“TV ON”

“Door OPEN”, “Door LOCK”

“Heating ON”, “Temperature 24 degrees”

“Toilet FLUSH”

 

One thing I do find with saying voice commands into my mobile in public is its a bit embarrassing. I think it’s more of a novelty item voice recognition, but would be of great benefit to the numerous number of blind or partially blind people in the work, so don’t give up on it yet.

Speech recognition is quickly moving from a nice-to-have to a must-have feature on mobile phones.  Part of this, of course, has been the improvements in accuracy, as speech R&D advances take advantage of more powerful resources on mobile phones.  But the real reasons for the rise of speech tie directly into three issues: speed, convenience, and perhaps most importantly, safety.

Speed.  Put simply, speech makes it faster to use your phone.  And with the advent of mobile dictation capabilities, people can dictate text messages much faster than using a T9 keypad, iPhone touchscreen, or mini-keyboard.  Some people mistakenly scoff at the value of speech applications for mobile phones.  "Why would I speak a text message -- why wouldn't I just call the person?"  (Why send text messages in the first place?  Sometimes you don't want to interrupt the person with a call.)   "What about privacy -- I don't want to say what I'm typing out loud!"  (As opposed to cell phone conversations?  Besides, you can always go back to thumbing it.)  The bottom line is that all the rich content and communication capabilities being added to today's phones can be accessed more quickly with a speech interface.  Which ties into the next point...

Convenience.  Mobile phones come standard with a slew of "bullet point" features, many of which will never be used by most users.  Why do I need a notepad if this thing doesn't have a keyboard?  And no, I don't want to download videos or buy casino games with bad graphics... this is a work phone!  But hand in hand with the concept of speed is the ability to unlock the power of mobile phones, and give easier access to otherwise buried features.  Case in point -- ever try to find and download a ringtone?  It probably takes you 6 or 7 menus deep into your mobile OS -- most users abandon their quest long before making it to the bottom.  But by its nature, speech flattens that interface so that a couple voice commands -- for instance, "Search for ringtones from Bon Jovi" -- gets you directly to the right screen.  Mobile carriers and manufacturers are keenly aware of this ability, as it means raising the Average Revenue Per User (ARPU) by removing these consumer barriers to downloading content

Safety.  This one in particular has driven the growth of mobile speech applications.  Cell phone distraction while driving reportedly cause 2,600 deaths and 330,000 injuries in the United States annually.  The media is not shy about highlighting stories about the dangers of distracted driving.  With voice commands, users keep their hands on the wheel and their eyes on the road.  Examples of useful applications include sending a text message, finding local businesses, and playing a specific MP3 music selection, as shown in this YouTube video.  instead of touchtone Automotive manufacturers have figured this out and begun to include hands-free eyes-free speech features in new cars -- but until these cars become more commonplace, the mobile phone is the perfect after-market complement to provide these features. 

All this sounds great, right?  So why aren't more people using speech systems on their mobile phones?  Lay the blame at the feet of three issues: poor initial applications, existing availability, and general unfamiliarity with what speech can do.

Truthfully, a surprising number of people are already using simple speech applications.  Free 411 services and customer care applications have increasingly begun to augment their automated systems with speech, not only for the speed and convenience issues mentioned above but also for the safety of the drivers using these systems.  For client-side mobile applications, the best example of a speech application is voice-activated dialing (VAD).  Hundreds of millions of mobile phones come standard with some form of VAD.  But many people who experimented with early versions of VAD laughed it off as a useless gizmo.  It's too easy to make a mistake showing off speech recognition, mobile or otherwise, and those mistakes are laughers. Those earlier systems were speaker dependent -- they required repeated training of each name in your address book and had trouble in noisy environments.  Nowadays, most VAD applications are speaker independent, require no training (even for most accents), and automatically guess the pronunciation of every name in your address book.  Still, early experiences with the older systems have given many consumers a foul taste in their mouth when it comes to using speech on their phones.

More advanced speech applications, like mobile dictation, command and control of your phone, and voice search capabilities are only just starting to become available on most mobile phones.   For most people, they need to see that eye-opening killer app that makes the value of speech clear.  Road warriors writing and responding to emails while driving.  Music lovers accessing MP3s with a few words instead of a lot of fumbling and clicking.   Asking where to find the nearest Starbucks, and having your phone respond with a map and driving directions based on your GPS location.  These are no longer pipe dreams, as phones like the Centro have started bundling applications like Nuance Voice Control.

That said, expectations for speech systems are sometimes unrealistic -- blame too many science fiction shows for that one.  Until the interface improves even further, using a speech system requires an understanding of the structure of available commands.  Most speech recognition errors are due to Out Of Grammar errors -- in other words, the system could understand what the user said perfectly, but had no idea what to *do* with the information because it lacked context, or "grammar" in speech parlance.  (What does, "Dammit, I just want to know how to get to 123 Central Street" mean?  Nothing... unless the system is already running a navigation program, waiting to hear a street name and knows to ignore everything else that doesn't sound like an address.)  But as more and more users begin to see the benefits of the technology and train themselves on the use of these systems, and as phones gain the processing power to spend more time sifting through Out Of Grammar issues, adoption should continue to rise.

 

Current speech recognition technology is unpopular not because of accuracy, but because of speed. This is an implementation problem rather than one of technology.

Speech communications generally occurs bidirectionally. In current implementations, the phone or other device speaks very slowly, with pauses before starting. In most implementations, each command must be verified as correct. These two things make it seem very cumbersome because with human speech we are used to (1) fast responses, and (2) verification queries only when we are misunderstood.

When you slow down a natural communication such as speech only a little, it seems painfully slow -- even if it is still faster than alternative forms of communication such as typing or clicking.

Accuracy is, of course, required in order for a fast speech interface. But technology is available, or will be soon. A fast, natural speech interface will be necessary before it is widely accepted.

As computers get smaller and cell phones / PDAs / iPods / GPS's, etc. get more powerful, we find the user communications interface (speech, typing, clicking) being a major limiting factor. Speech has the most potential in being a dominant interface, and will eventually become the primary human/machine interface for small devices.

A side note: As speech communication with computers evolves, we'll find that the machines can listen faster than we talk. Power users will learn to speak (and listen) faster with a varied diction, including faster consonants and click sounds. The development of a computer dialect seems like something out of science fiction, but it is actually no more far-fetched than watching a fast typist.

It has been many years since Microsoft's Bill Gates started talking about how speech based interfaces would change computing in profound ways, yet despite some notable progress with speech recognition on phones and computers there has been little change, and our interactions with data remain largely via keyboard and mouse input.

Simple speech routines like voice activated phone number access are fairly common, but speech as interface has not caught on nearly as much as you'd think given all the technological progress and capability.   Given that speech is the main way we communicate with each other, why isn't it the main way we choose to communicate with our technology?

I think part of the answer is simply ... habit.   Even though you can speak a lot faster than you can type, we have a long history of interacting with data using keyboards and touch rather than voice.   That's what we are comfortable with and that will be slow to change, though I'm confident it will change over time until we'll wonder how we ever could have lived without what will undoubtedly be flawless voice recognition software and voice controlled devices.

I'm constantly frustrated with how hard it is to effectively utilize the internet via my Treo, and also noticed how even the superb Google maps interface gets downright dangerous when you try to use it while driving down the interstate.    For me, a killer application would be voice recognition that would allow me to get mapping and data from the phone as I drove, but this is not a trivial application.   First, I'd want the phone to respond to my requests quickly and accurately.  Yet I would not want to have to memorize a menu of voice commands and would not want a limited menu, so the phone would have to have an enormous vocabulary or be able to learn from my requests the type of data I needed.  

My killer voice application would need to communicate data very effectively and allow me to "drill down" quickly to the data I needed.   For example for mapping I'd want a GPS style voice application with simple "A to B to C" directions, but for blogs I'd want my phone to access the internet, then the blogs I want, then list all available posts to read, then read the ones I wanted, allowing me to back up and fast forward quickly and easily.    I'd also want the ability to talk back to those blogs and post comments, so the application would have to allow me to interrupt my reading, comment, and then jump back in.

Another reason for slow adoption is that even huge voice control power has limitations that cannot be overcome.   While reading you can scan a lot of content and access navigation features very quickly.  Voice would slow down - very considerably in some cases - many features of web surfing and reading.   Thus even the best speech control applications probabaly won't signal a fundamental shift in the way we use our phones.

In summary I think we'll see great innovation in the speech space over the coming year as the open handset alliance brings much more powerful phones and applications, but our habits and the limits of speech recognition will make the inevitable changes come somewhat slowly.

 

icon
Bill Burke
Wed Nov 28 3:13pm
As mobile data transmission is not that reliably capable enough to handle voice data packets and the limited transmission range could be one of the major factors for the slow mainstream adaptation.
But the potential and productive uses are quite compelling, a good example is field news reporters, an individual just have to call a particular access numbers and his/her voice reporting will be conveyed in different mediums that remarkably bridges the online and offline world. As most of this services convert and relay this data to other platforms specially the ones that utilizes sms particularly twitter and jaiku, that really makes both world of online and offline interconnected. To a certain degree, the news delivered to their selected subscribers directly and almost in real-time. Services the likes of jott, utterz and spinvox.
Another sector of the society that will greatly benefit from this are the ones have physical limitations like having sight problems as this things can make it much easier for them. But that's without saying the casual users don't have anything to gain from such technology, this is where mobile individual and group interaction plus reminders comes in.
Jott seems really found the right market with its delivery mechanisms to major widely used platforms. Besides having their own sms service, jott will transcribe and send it as text whenever such audio indicates "Low Transcription Confidence".
Another service quite interesting in its implementation is Mobivox, Mobivox leverages on skype users. Mobivox lets you call all your skype and non-skype contacts without dialing, just its plain simple voice prompt.

Why Isn't Speech A More Popular Interface

Let's start by making the distinction between Speech in, and speech out -- both of which have some interesting elements. Speech In is speech recognition, and speech out is usually TTS, or Text To Speech, where a computer voice reads a test record from a database.

Speech output as an interface has been around for decades, most commonly experienced as the dreaded IVR or Interactive Voice Response. The IVR is an extremely unpopular solution because it basically replaces a one-to-one communication with a human to robot interaction with a machine that would fare very poorly if given the Turing test. Many people find their issue does not fit nicely in one of the categories of the IVR menu, yet they still need to sit through scads of unrelated outgoing messages and menu options before having a chance at talking to a human.

In my opinion, the biggest problem with speech out, or TTS is that it takes a remarkably long time to read through a list of options for a user, especially when read at "least common denominator" comprehension speed. In contrast, the same list of choices could have been scan-read in 5% of the time, and a selection made. Text reading speed can be determined by the reader, while TTS is almost always speed controlled at the server.

Companies that are hot again like TellMe had an earlier heyday in the dot-com era. In 2000 when TellMe and Quack were all hot to trot out the "Next Big Thing": Voice portals. Vortals, as we geeks learned to call them, would allow voice access to all the same content you could get at a Yahoo. You could customize, personalize, access on the fly, and of course access from a mobile phone (which at the time had very little data service). But for the reasons stated above (slow delivery) and a problem with people's inability (or lack of desire) to learn the correct keywords and navigation commands, the Vortals more or less flopped. Well, TellMe limped along to greater glory, and my hockey teammate Steve the Quack founder who got out in time by selling it to AOL. He's doing OK. I was at Disney's Infoseek portal in 2000, and running the mobile team, so I did biz dev talks with a few of these same Vortal players, but in the end decided that people wouldn't use it.

Now, for speech input. It sounds very appealing at first: the ability to simply state commands or search queries into the device, and have it respond. But then you learn that speech recognition is about 95% accuracy in a lab environment. But mobile phones don't get used in the lab too often, they are in noisy environments, in cars, outside, undergoing cell handoffs, digitization, jitter, cut-outs, and other artifacts that decrease the quality of a voice signal. The human brain is adept at interpolating the speech despite these artifacts, but machine recognition is nowhere as resilient. In the real world, recognition rates drop drastically from the 95% to unacceptable levels.

Since at home or work, we normally use our PCs, the voice interfaces is usually experienced while mobile. Unfortunately, the time when we want voice recognition to work is when we are offering the servers the worst possible sound quality we have to offer. No wonder it has proven of limited value.

 

Will Speech Recognition Ever Become Mainstream

Most recently IVR systems have upgraded from their DTMF tone user inputs to the much more friendly voice recognition input. At first it was just a question of offering a very small set of speech options (like Yes and No, and 0 - 9). This was barely an improvement over DTMF, if at all.

The problem stems from the incredible amount of possible words that can be humanly said, multiplied by the myriad different voices and intonations, cubed by the wide range of possible actions that could be articulating the words. It becomes very computationally intensive for a machine, in real-time, to cross-reference the uninterrupted flow of sounds coming at it against all of the possible combinations of words that could be a hit, and thus correctly spell out the sentence. And then there's the challenge of interpreting what the sentence actually means. To do this task, which humans do remarkably well, is called Free Speech Recognition -- and it's not happening on your cellphone or server anytime soon.

What the technologist to do reduce the complexity of the task, is to try to reduce the possible result universe to a smaller, more probable set of vocab, phrases, accents, and meanings. Then, with this more finite universe of options, they can match the speech they hear against, say, a thousand expected utterances, and come up with the most likely match, and deliver the expected result. The key is to limit interpretation to a finite set of options. That's why the first commercial speech recognition we saw was of the "Say your phone number" variety, since the answer universe consisted of just slightly more than 10 possible results, in a repeating sequence. It's more than 10 because there are multiple words for zero, "o", naught, and because some people might utter "one oh eight" while others utter "one hundred eight". Despite this, it was relatively easy to interpret.

But with faster processors, newer database technology, faster storage, cheaper memory, more time, and better programming tools, developers are finally able to take it up a level, and offer Quasi Free Speech Recognition. That is, they can make the universe of possible responses a lot bigger, and still come up with decent real-time matches. Thus, a voice recognition 411 service can narrow the possible input utterances down to know town names and 50 states, then ask for a category using the finite set of categories out of the phone yellow pages, and produce respectable results. I'll note that the TTS results sound decidedly robotic from services like Goog411, but for free why complain.

So, you see that in the paragraph above, most of the recent progress hasn't been made by improvements in recognition accuracy of what you're actually saying, but in limiting the universe of responses, and throwing Moore's Law (a boatload of MIPS) at the challenge. So, yes, clearly Speech Recognition IS becoming mainstream and will continue to do so, at least as long as Moore's Law holds. And any true enhancements in actual Free Speech accuracy will also have a big impact.

 

What Speech-based Apps Do I Use

I use what is probably the most popular speech recognition app for mobile phones, and that is a voice dialer. A voice dialer is a software package installed on many cell phones that allows a user to program several phone numbers as "voice tags" and then to dial them with voice commands anytime afterwards. Voice tags are not complicated programs because by requiring the user to "train" the software for each voice tag, the task becomes one of simply matching waveforms against a very limited finite set -- the set of tags that the user has trained in.

Yet despite how simple this recognition app is, it is very useful because it allows a user to place calls while driving, for example, without ever needing to enter the phone number manually, or even accessing the phonebook function. A driver for the success of voice dialing has been Bluetooth headsets, which work hand in hand with the phone and the software to allow the user to use a single button on most headsets to trigger the app, and they can simply speak into the headset. As such, users can place calls with one button easy, even if their phone is in their purse in the back seat. But this Voice Tagging has limited functionality: I need to remember for whom I have recorded tags, and I need to remember how I uttered their name when I recorded the tag. For example, "Richard White Cellphone" will get an error or a mismatch when the tag recorded was "Dick White Mobile".

Some software solutions proposed to go one further on Smartphones. For example, Voice Control by Nuance for Palm OS phones does not require you to record voice tags. It claims to voice-dial enable your entire contact list in the phone. But it does this by first doing a TTS interpretation of all the names in your contact list, then generating what it thinks is a representative waveform for the utterance of that name, then indexing all the waveforms much as Google indexes websites. Then when the user launches the program and utters a name, it matches the uttered waveform against the TTS tags IT created for your contacts. This offers less accuracy, but also takes more time to install, time to use, and time to match your utterance against the database. In fact, I could never use this program because in the fine print they mention that they limit the program to contact lists of 2,000 people or more, and I had over 2k. At some point the index database becomes too big and too cumbersome that even the vendor realizes that the product runs too slowly to be useful. Moore's Law will fix this limitation.

A product like Microsoft Voice Command is interesting because it promises features like Voice Control above, but also voice control of many other features of the Windows smartphones. This app can be used to find contacts, but also to control MP3 playback, or look up calendar appointments. But a lot of the coolest speech recognition apps like this one are currently only available for smartphones, and really only the most powerful among those with fast processors and ample memory.

 

What Will Be The Killer App For Speech Recognition

The Killer App for speech recognition is where the speech recognition gets pushed off of the relatively dumb phone, and onto the relatively powerful server. We talked above about how throwing Moore's Law and thousands of MIPS at the problem produces better results, so why not move the recognition off of the device, and use a voice channel to carry the voice as it is to the server, where better recognition can take place.

In fact, the real killer app might be Multi Modal speech recognition apps like voice search. Multi Mode refers to using two different modes of communication to close a single loop. The example is you have a dumb Verizon Get It Now phone. You want to search for Abba, because your buddies want to know the names of the band members. You launch a voice search app, utter "abba", and the phone carriers the sound to the server, which recognizes it, the server then delivers to your phone not voice through TTS, but text and images in a web or WAP page. The results could be structured like Yahoo's great OneSearch with both info on Abba, Wikipedia links, Amazon.com purchase links, and links to ringtones and screen savers, and maybe even a mini-bio that tells you about Benny, Bjorn, Anna-Frid, and Agnetha. Because the info is text and visual, you are able to quickly scan through it and find the min-bio.

A multi mode search like above eliminates the problem of the difficult input (number keypad) on most phones, works around the low power of most phones, and provides visually rich results that can be scanned quickly. If the solution provider is wise, it also has quality "OneSearch" type results. Verizon actually has a voice search feature, and while the multi mode technology is good, it is pitiful that their results are useless. Instead of giving the subscriber what she most likely wants (like OneSearch) they only use the search to try to sell ringtones and wallpaper. I've nothing against selling this stuff, but is that really what the average users wants as a search result for Abba, or anything really? The Verizon service is powered by Medio, although they shouldn't take the blame for Verizon's insistence on hawking instead of satisfying customers. Promptu is another provider of this technology.

TellMe, acquired by Microsoft, is being integrated into LiveSearch, Google is actively pursuing the mobile voice search business, as are a stable of others.

 

Do You Use Any Speech Services Now

I use the Voice tag app in my phone to place cellular calls to contacts. I also frequently use it with my Bluetooth headsets. My favorite is when I'm listening to tunes on my stereo Bluetooth headset, and I tap a single button which pauses the MP3, launches the recognition applet, recognizes my command, places my call, I do my biz and when the other party hangs up my MP3 resumes playback. That's a pretty easy way to place a call.

One thing I have done is program in the AT&T GSM command for call forwarding as a speed dial, and recorded a tag for it. Thus when I arrive at home, I activate voice recognition and say "Call Forwarding Home", and all my calls are forwarded over to my landline. I hate running downstairs to try to answer a call on my cell at home. I have 12 Uniden handsets around the house, I'm gonna use them. No my house isn't that big. That's just the way I am.

I also use the voice tagging feature of my HTC Tilt to launch programs on the smartphone. For example, I can say "Browser", and PocketIE is launched, or "Sling Player" and without another touch I'm watching my home Tivo.

Other than that, I mostly use phones with keyboards, so I'm not a prime candidate for voice input. I like to type my queries because I get better accuracy that way.

icon
Derek Kerton
Tue Nov 27 11:59pm
Hey, read a_chameleon's post, and need to add this to mine:

I liked his point about encoding the voice signal in the phone, sending it upstream as data, and still doing the recognition on the server - thus avoiding the degradation along the voice path. a_chameleon reminded me that this is what Promptu is doing with their Multi Modal voice search, which I discussed in my post. Definitely the best way to sub-divide and delegate the tasks.
icon
Derek Kerton
Wed Nov 28 10:21am
Well, after writing the post above, I followed up with an updated version of Microsoft Live Search. I had read that they lauched a new version of this for Windows Mobile devices, and the new version added local gas prices and VOICE INPUT.

After just a few minutes playing with it, I'm very impressed. It uses speech recog in the way my comment above mentions: by using the data channel only, and doing all the heavy lifting at the server farm. A local application is required, so it's not a browser service. The apps functionality can be though of as a little like Yellow Pages Mobile 2.0 - you can find businesses, get directions, see traffic and maps, and get local movie showtimes. Like Google Maps Mobile, the app needs a constant (good) data connection, since all the data resides on the server.

In terms of how it uses voice, the local app encodes voice input and sends it to the server invisibly. The server returns the correct match, which the app then displays in the field. To the user it's seamless, instant, and looks like it's done in their phone.

First you speak the name of your city/state combo, which is promptly displayed in the correct field. Then you speak the name of a business or business category (ex: Hotels, or Little Home Thai Restaurant). Amazingly, using a limited answer universe of the yellow page listings for my city/state, the app delivered correct results 10/10 tries.

A few button clicks/screen taps are required to manage the application locally, since the phone itself is not responding to speech commands. You have to navigate to the appropriate fields and trigger the speech applet, for example.
icon
Bill Burke
Wed Nov 28 2:54pm
Thanks for the mention, derek.

The founder of Distributed Speech Recognition,
[/ http://wirelessspeech.blogspot.com/2006/10/when-will-distributed-speech.html ] , Motorola Lab's Dr. David Pearce, has an interesting article on the VoiceXML site..
[/ http://www.voicexmlreview.org/Nov2004/features/dsr.html

It is a real shame that DSR isn't being adopted here; ETSI proved, with some pretty serious testing, it could successfully achieve an almost 99% accuracy rate - it liteally kills all the monsters in the way of "everyday, easy to use" speech recognition that 'just works' ...

Additionally, inside the Nuance site there is some mention of their SpeechPAK product which I understand helps things along in some arenas.

We're involved with the Speech Components Group and NUI @ Microsoft as an IHV for the development of the world's 1st (best we can tell) self-contained 16khz wireless headset which will do some pretty cool things, as well as allow Windows Vista's _awesome_ recognizer to do almost *everything* we do via keyboarding, now.

There will be some definitive shouting from the rooftops by the folks in Redmond, as to just how powerful, simple and overwhelmingly effective Speech Recognition inside Vista really is.. really soon.

Here's a 12 minute video that may open quite a few eyes as to what Vista can do with a wired, 16kHz headset..

http://on10.net/Blogs/laura/are-you-talking-to-me-no-im-talking-to- my-computer-check-out-this-sweet-voice-recognition-program/

.
Bill Burke
http://wirelessspeech.blogspot.com

.