Automatic Speech Recognition (ASR) can be defined as machine translated speech into text in real time. Thus, it is also often referred to as Speech-to-Text conversion. Famous examples of systems which use ASR include Siri, Google Now, Alexa, etc. The Speech Recognition research has been around for more than 50 years now. The Speech Recognition research is not yet at a point where machines understand everything a person says in any acoustic environment.
The ultimate goal of ASR research is to allow a computer to recognize in real time, with hundred percent accuracy, all words that are intelligibly spoken by any person, independent of vocabulary size, noise, speaker characteristics or accent. The goal of an ASR system is to accurately and efficiently convert a speech signal into a text message transcription of the spoken words independent of the speaker, environment or the device used to record the speech (i.e. the microphone).The speaker’s sentence is possibly a sequence of words with pauses/fillers. The software then produces a speech waveform, which embodies the words of the sentence as well as the extraneous sounds and pauses in the spoken input. Next,the software attempts to decode the speech into the best estimate of the sentence using a syntactic decoder.
Here, it generates a valid sequence of representations. A speech recognizer requires statistical models, an Acoustic Model and a Language Model. The statistical methods for continuous speech recognition were established more than 30 years ago, the most popular being the Hidden Markov Models (HMMs). In speech recognition, decoding is equivalent to recognizing the word sequence of words given the acoustic observations. Acoustic Modelling is the heart of speech recognition. It estimates the probability of generating acoustic features for given words and thus, directly affects speech recognition quality. Acoustic modelling though, has only partial information available for training Acoustic Model parameters because the corresponding textual transcription is time-unaligned. The hidden information of the words alignment in a utterance makes acoustic modelling more challenging.
A Language Model effectively reduces and more importantly prioritizes the Acoustic Modeling hypothesis. A probability of acoustic features given word transcription estimated by Acoustic Modeling is combined with the probability of the words transcription estimated by Language Model in order to compute the posterior probability of transcription. Most current speech recognition systems use Hidden Markov Models (HMMs) for temporal variability of speech.Gaussian Mixture Models (GMMs) are used to determine how well each state of each HMM fits a frame of the speech input. An alternative way to evaluate the fit is to use the feed-forward neural networks. Feed-forward neural networks with more number of hidden layers are said to outperform GMMs on a variety of benchmarks.
One of the drawbacks of GMMs is that they are statistically inefficient to handle data which lies on or near a nonlinear manifold. Speech is produced by modulating a relatively small number of parameters of a dynamical system, and this implies that its true underlyingstructure is much more lower dimensional than is apparent in a window containing hundreds of coefficients. Artificial neural networks trained by backpropogating the error derivatives have the potential to learn much better models of data that lie on or near a non-linear manifold.
Two decadesago, researchers found some success by using a single hidden layer artificial neural network to predict the HMM states from windows containing acoustic coefficients. Advances in machine learning algorithms and hardware have led to more efficient methods for training DNNs with many layers and avery large output layer. In training, an acoustic model is built for each phoneme. This acoustic modeling using HMM captures the distinctive properties of the speech which takes into account speaker variations, pronunciation variations, context dependent phonetic coarticulation variations. For this reason, acoustic training corpus has to be quite large to obtain robust acoustic model.Firstly initial set of single Gaussian monophone HMMs are created.
The monophone expansion consists of three phonemes with each phoneme represented by tri-state HMM. The large output layer in the DNNs is required to accomodate the large number of HMM states that arise when each phone is modelled by a number of “triphone” HMMs. Kaldi is an open-source toolkit which contains severalrecipes to implement Automatic Speech Recognition. It supports all the latest features such as linear transforms, MMI, boosted MMI, MCE, discriminative analysis, deep neural networks, etc. Kaldi consists of library, command line programs and scripts for acoustic modelling. The basic unit used in many neural networks computes the weighted sum of it’s inputs and then passes this sum through a nonlinear function.
In TDNN, this basic unit is modified by introducing delays. The inputs of a unit row willbe multiplied with several weights, one for each delay. In this way, a TDNN unit has the ability to relate and compare current input to the past history of events. Each TDNN unit oulined in this section has the ability to encode temporal relationships within the range of N delays. Higher lauyers can attend to larger time spans, so local short duration features at the lower layer and more complex longer duration features at the higher layer. The learning procedure enssures that each of the units in each layer has itsweights adjusted in a way that improves the network’s overall performanceTotal 630 speakers• Each speaker has 10 utterances, i.
e, total of 6300 utterances• Speakers have been taken from 8 major dialect regionsin the US, labelled dr1-dr8• Total Male-Female ratio is 70-30• Breakdown of 6300 utterances:1) 2 dialect sentences designed at SRI (SA)2) 450 phonetically compact sentences designed atMIT (SX)3) 1890 diverse sentences designed at TI (SI)Experimentation was done on the TIMIT dataset using various Deep Learning models such as GMMs and DNNs for the Acoustic Modelling alongwith HMMs for the Language Modelling. The Word Error Rate(WER) for such combinations were recorded.