Core77

Must-See Video: Real-Time English-to-Mandarin Speech Translation via Microsoft Research

By Ray Hu - November 8, 2012

in Tech | UX

As you might have noticed, we've had quite a bit of Asian design coverage lately (with a few more stories to come): between the second annual Beijing Design Week, a trip to Shanghai for Interior Lifestyle China and last week's design events in Tokyo, we're hoping to bring you the best of design from the Eastern Hemisphere this fall.

Of course, I'll be the first to admit that our coverage hasn't been quite as quick as we'd like, largely due to the speed bump of the language barrier. At least two of your friendly Core77 Editors speak passable Mandarin, but when it comes to parsing large amounts of technical information, the process becomes significantly more labor-intensive than your average blogpost... which is precisely why I was interested to learn that Microsoft Research is on the case.

In a recent talk in Tianjin, China, Chief Research Officer Rick Rashid (no relation to Karim) presented their latest breakthrough in speech recognition technology, a significant improvement from the 20–25% error of current software. Working with a team from the University of Toronto, Microsoft Research has "reduced the word error rate for speech by over 30% compared to previous methods. This means that rather than having one word in 4 or 5 incorrect, now the error rate is one word in 7 or 8."

An abridged transcript of the talk is available on the Microsoft Next blog if you want to follow along:In the late 1970s a group of researchers at Carnegie Mellon University made a significant breakthrough in speech recognition using a technique called hidden Markov modeling which allowed them to use training data from many speakers to build statistical speech models that were much more robust. As a result, over the last 30 years speech systems have gotten better and better. In the last 10 years the combination of better methods, faster computers and the ability to process dramatically more data has led to many practical uses.

Just over two years ago, researchers at Microsoft Research and the University of Toronto made another breakthrough. By using a technique called Deep Neural Networks, which is patterned after human brain behavior, researchers were able to train more discriminative and better speech recognizers than previous methods.

Once Rashid has gotten the audience up to speed, he starts discussing how current technology is implemented in extant translation services (5:03). "It happens in two steps," he explains. "The first takes my words and finds the Chinese equivalents, and while non-trivial, this is the easy part. The second reorders the words to be appropriate for Chinese, an important step for correct translation between languages."

Short though it may be, the talk is a slow build of relatively dry subject matter until Rashid gets to the topic at hand at 6:45: "Now the last step that I want to take is to be able to speak to you in Chinese." But listening to him talk for those first seven-and-a-half minutes is exactly the point: the software has extrapolated Rashid's voice from an hour-long speech sample, and it modulates the translated audio based on his English speech patterns.

Thus, I recommend watching (or at least listening) to the video from the beginning to get a sense for Rashid's inflection and timbre... but if you're in some kind of hurry, here's the payoff:

Amazing. Read the whole post here

TNW via Steve Clayton

o

Favorite This
Q
1

Comment

Comments

Haine Tari

12 years ago

Z

Z

Reply

That was a lot less than a 15% error rate. Call me a skeptic, but the way he went through the last few sentences, I got the sense that he was choosing sentences that worked well with their algorithm. Still, very Star Trek Universal Translator!
!Report as spam