Language Translation using Long Short-Term Memory
Parul Singh and Peter Mascarenhas
CS6120 NLP Spring 2018
Deep Neural Networks (DNNs) are only as good as the training data. Indeed they are powerful because they parallelize the computation for good number of steps. A surprising example of the power of DNNs is their ability to sort N N-bit numbers using only 2 hidden layers of quadratic size 1. But they perform extremely poor on sequence-to-sequence learning (S2S). Furthermore, large DNNs can be trained with supervised back propagation whenever the labeled training set has enough information to specify the network’s parameters. Thus, if there exists a parameter setting of a large DNN that achieves good results (for example, because humans can solve the task very rapidly), supervised back propagation will find these parameters and solve the problem.
Despite their flexibility and power, DNNs can only be applied to problems whose inputs and targets can be sensibly encoded with vectors of fixed dimensionality. It is a significant limitation, since many important problems are best expressed with sequences whose lengths are not known. For example, speech recognition and machine translation are sequential problems. It is therefore clear that a domain-independent method that learns to map sequences to sequences would be useful.
Since Machine Translation (MT) is an S2S problem, DNNs do not perform well as they expect dimensionality of inputs and outputs to be known and fixed. For our project we are using Long Short-Term Memory (LSTM). LSTM’s ability to successfully learn on data with long-range temporal dependencies makes it a natural choice for this application, considering time lag between the inputs and their corresponding outputs. It can also convert an input sentence of variable length into a fixed-dimensional vector representation.
Figure 1: Our model reads an input sentence “ABC” and produces “WXYZ” as the output sentence. The model stops making predictions after outputting the end-of-sentence token. Note that the LSTM reads the input sentence in reverse, because doing so introduces many short term dependencies in the data that make the optimization problem much easier.
We aim to build two different deep LTSM recurrent neural networks (RNN) with at least four layers. We are calling them Encoder and Decoder. We also plan to reverse the order of the words of the input sentence to “establish communication” between the input and the output 4. Our workflow comprises of two essential steps. The first is to use the encoder LSTM to read the input sequence per time step and obtain a large fixed dimensional vector. This is very similar to Kalchbrenner and Blunsom 3 approach, the pioneers in the field to map the input sentences to vector. The second step is to use the decoder LSTM RNN to extract the output sequence from the vector in as shown in (fig. 1).
3 Related Work
There is a large body of work on applications of neural networks to machine translation. Ili is one of the devices that translates languages at a very fast rate and without being connected to the internet. This has been a turning point in us deciding to pursue this model. We have discovered so far that the most effective way of doing this is to start off with an RNN language model or Feedforward Neural Network Language Model (NNLM) 2. More recently, researchers have begun to look into ways of including information about the source language into the NNLM. This approach has been understood from some of the latest papers we have researched. Our work is closely related to Kalchbrenner and Blunsom 3, who were the first to map the input sentence into a vector and then back to a sentence, although they map sentences to vectors using convolutional neural networks, which lose the ordering of the words.
4 Dataset and Evaluation
Our model just like any other typical neural language model relies on vector representations and hence we will use a fixed size vocabulary set for both languages French and English. We will consider the top 100 phrases that are useful to a traveler when he enters a new country. We will get our data from the WMT’14 English to French dataset which contains over 50 million words. We will train this dataset to convert English sentences into French for now and we will extend this to other languages as future work. We will use the most frequent words for the source language and the most frequent words for the target language. Every out-of-vocabulary word was replaced with a special “UNK” token.
We will need to evaluate the performance of the LSTM and decide how many hidden layers would be ideal for good results. We would like to define the accuracy of the model in terms of generated word to the generated truth even if it represents a different sense or word but is right in meaning or generates a close, meaningful sentence to the result.
1 A. Razborov. On small depth threshold circuits. In Proc. 3rd Scandinavian Workshop on Algorithm Theory, 1992.
2 Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model. In Journal of Machine Learning Research, pages 1137–1155, 2003.
3 N. Kalchbrenner and P. Blunsom. Recurrent continuous translation models. In EMNLP, 2013.
4 Ilya Sutskever, Oriol Vinyals, Quoc V. Le.
Sequence to Sequence Learning with Neural Networks
5 Trung Tran, Creating A Language Translation Model Using Sequence To Sequence Learning Approach
22 T. Mikolov. Statistical Language Models based on Neural Networks. PhD thesis, Brno University of Technology, 2012.
23 T. Mikolov, M. Karafi´at, L. Burget, J. Cernock`y, and S. Khudanpur. Recurrent neural network based language model. In INTERSPEECH, pages 1045–1048, 2010.
27 A. Razborov. On small depth threshold circuits. In Proc. 3rd Scandinavian Workshop on Algorithm Theory, 1992.
28 D. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Nature, 323(6088):533–536, 1986.
30 M. Sundermeyer, R. Schluter, and H. Ney. LSTM neural networks for language modeling. In INTERSPEECH, 2010.