Facebook Wav2vec-U for Unsupervised Speech Recognition
Speech recognition is a significant application of Natural Language Processing. But, it should not advantage only for people who are fluent in one of the world’s most widely spoken languages. One way to expand access to these NLP applications is reducing the dependence on annotated data.
Last week Facebook announced that it developed an AI model using wav2vec Unsupervised (#wav2vec-U). This is a way to build speech recognition systems that require no transcribed data at all. This model beats the the performance of the best supervised models from only a few years ago. Researches has used nearly 1,000 hours of transcribed speech to train those supervised models. Facebook, trained this new model for Swahili, Tatar, Kyrgyz, and several other languages. This is a remarkable step toward building machines that can solve a range of tasks by learning from their observations
How it works
The Company used a Generative Adversarial Network (GAN), combination of a generator and a discriminator network for word recognition. The generator takes each audio segment embedded by Wav2vec-U and K-means clustering algorithm. Then it predicts a phoneme corresponding to a sound in language. Generator has capability of trying to fool the discriminator, which check whether the predicted phonemes sequences look realistic. Discriminator, is also a neural network which feeds it the output of the generator. Discriminator should trained with real text from various sources that were phonemized to evaluate the output of generator.
Both TIMIT and Librispeech measure performance on English speech for which good speech recognition technology already exists with huge labeled data sets. Although, unsupervised speech recognition is most impactful for languages for which little to no labeled data exists. Facebook researches tried this system on languages for few data resources, such as Swahili, Tatar, and Kyrgyz.
Facebook acknowledges that more research must be done to figure out the best way to address bias. “We have not yet investigated potential biases in the model. Our focus was on developing a method to remove the need for supervision,” said Facebook to media. “A benefit of the self-supervised approach is that it may help avoid biases introduced through data labeling. But, this is an important area that we are very interested in.”