A team of researchers at Oxford University have coaxed an artificial intelligence program into an impressive leap forward and towards our own obsolescence. The program, known as LipNet, is showing particularly promising ability to read lips in video clips, thanks to machine learning and a novel way of approaching the data. The key difference is that rather than try to teach the AI the mouth shapes of single words and phonemes, the LipNet is asked to interpret whole sentences. Using GRID, a huge bank of 3 second videos featuring brightly lit forward facing speakers, LipNet has learned to translate speech to text with a 93.4% accuracy rate. Compare that to humans' 52.3%. It doesn't look good.
To accomplish this, the team ran over 28,000 videos of actors speaking syntactically similar sentences through a neural network. Each contained a command, color, letter, number, preposition, and adverb, in the same order. When tested using 300 of the same sentence types, human lip reading translators had an error rate of 47.7%, whereas LipNet netted just 6.6%.
With this kind of accuracy, we might see better automation of closed captioning on news and entertainment videos, and some speculate it may be a feature in more personal communication as well. Imagine realtime translation of a Skype or FaceTime conversation with poor audio quality. I want that already.
Enter a caption (optional)
Detractors are quick to point out the structural limitations of the data set used, since apparently most movies, news and YouTube videos don't only feature well lit actors speaking directly into a camera in short sentences. However, given incrementally useful data sets, the LipNet framework appears capable of learning enough to do good, even if it won't be stealing jobs any time soon.
Check out the testing data and paper here.