facebook-nlp

Facebook Wav2vec-U for Unsupervised Speech Recognition

Speech recognition is a significant application of Natural Language Processing. But, it should not advantage only for people who are fluent in one of the world’s most widely spoken languages. One way to expand access to these NLP applications is reducing the dependence on annotated data.

Last week Facebook announced that it developed an AI model using wav2vec Unsupervised (#wav2vec-U). This is a way to build speech recognition systems that require no transcribed data at all. This model beats the the performance of the best supervised models from only a few years ago. Researches has used nearly 1,000 hours of transcribed speech to train those supervised models. Facebook, trained this new model for Swahili, Tatar, Kyrgyz, and several other languages. This is a remarkable step toward building machines that can solve a range of tasks by learning from their observations

Although the dominant form of AI for speech recognition falls under supervised learning. Supervised learning maps an input to an output based on example input-output pairs. However, this supervise learning techniques consume vast computational time and resources. Also tends to overfitting. Companies have to obtain tens of thousands of hours of audio and recruit human teams to manually transcribe the data. And this same process has to be repeated for each language.

How it works

This new Wav2vec-U learns purely from recorded speech audio and unpaired text, eliminating the need for any transcriptions. By using this self-supervised model and a simple k-means clustering algorithm, voice recordings segments into speech units that loosely correspond to individual sounds. ( Ex: The word cat, for example, includes three sounds: “/K/”, “/AE/”, and “/T/”)

The Company used a Generative Adversarial Network (GAN), combination of a generator and a discriminator network for word recognition. The generator takes each audio segment embedded by Wav2vec-U and K-means clustering algorithm. Then it predicts a phoneme corresponding to a sound in language. Generator has capability of trying to fool the discriminator, which check whether the predicted phonemes sequences look realistic. Discriminator, is also a neural network which feeds it the output of the generator. Discriminator should trained with real text from various sources that were phonemized to evaluate the output of generator.

Initially the GAN’s performance was poor in quality, but improve with the feedback of the discriminator.
“It takes about half a day — roughly 12 to 15 hours on a single GPU — to train an average Wav2vec-U model. This excludes self-supervised pre-training of the model, but we previously made these models publicly available for others to use,” Facebook AI research scientist manager Michael Auli told to media. “Half a day on a single GPU is not very much, and this makes the technology accessible to a wider audience to build speech technology for many more languages of the world.”

System Evaluation

System evaluated on the TIMIT benchmark. It reduced the error rate by 57 percent compared with the next best unsupervised method.
fb_nlp_benchmark
Wav2vec-U image from facebook blog
Researches also compared wav2vec-U performance with supervised models on the much larger Librispeech benchmark. Those supervised models typically use 960 hours of transcribed speech data.
fb_nlp_benchmark
image from facebook blog
“We found wav2vec-U as accurate as the state of the art from only a few years ago — while using no labeled training data at all. This shows that speech recognition systems with no supervision can achieve very good quality.” facebook mentioned their blog.

Both TIMIT and Librispeech measure performance on English speech for which good speech recognition technology already exists with huge labeled data sets. Although, unsupervised speech recognition is most impactful for languages for which little to no labeled data exists. Facebook researches tried this system on languages for few data resources, such as Swahili, Tatar, and Kyrgyz.

fb_nlp
image from facebook blog

Future Works

Code for Wav2vec-U is open source and available in GitHub. It enables developers to build speech recognition systems using unlabeled speech audio recordings and unlabeled text. Also Facebook is looking potentials for the model to support future internal and external tools, like video transcription.

Facebook acknowledges that more research must be done to figure out the best way to address bias. “We have not yet investigated potential biases in the model. Our focus was on developing a method to remove the need for supervision,” said Facebook to media. “A benefit of the self-supervised approach is that it may help avoid biases introduced through data labeling. But, this is an important area that we are very interested in.”


Leave a Reply

Your email address will not be published. Required fields are marked *