Neural Networks in Translation Systems: Part Nine

The first post in this series is here. The previous post in this series is here. The next post in this series is here.

Welcome back to the series where we explain scientific foundations of neural networks technologies to dive into their use of natural language processing and neural machine translation. The purpose of all that is to really show you what is going on behind the curtain.

In Part Eight of our blog, we took a closer look at the various ways in which semantic and syntactic relations are encoded in the linear space of word vectors. You might have a guess that these regularities can be exploited in translation software and you certainly would be right. Today, I will explain how this comes about. You can find everything in detail in the paper “Exploiting Similarities among Languages for Machine Translation”, by Mikolov et al., where the ideas were first formulated.

First of all, from a mathematical point of view, it is useful to think of translation in a more abstract manner. To do so, let me introduce a function called ‘trans’, which takes English words and outputs their Spanish translation. The choice of languages does not really matter, at least not yet.

An example of the application of the function trans would be:

trans(‘cat’) = ‘gato’

We will try to approximate it using neural networks. Also, note that we already had a different function, which takes as input words, namely vec. However, the outputs were not words, but vectors.

Word Vectors Put Into Practice

Now, to compute trans we would obviously like to make use of word vectors. Unlike before, we have to deal with two languages and each will have a vector representation in a separate space. To emphasise this, when applying vec we will always use the corresponding language.

Hence,

vec(‘cat’) is the word vector for a cat in the English space,

while

vec(‘gato’) is the one in the Spanish space.

A natural next step would be to look for a function M, mapping word vectors from one language to word vectors of the other language. It would be as good as finding trans itself, as we would have the relation

trans = vec^-1 M vec

Here the ‘-1’ just means that one takes the inverse of a function. The inverse of the function vec, denoted by vec^-1, takes as input a word vector, and gives back the actual word, for example, vec^-1(x) = ‘dog’, where x is the word vector corresponding to a dog, i.e.:

vec(‘dog’) = x

The circle stands for the composition of functions, that is, successive application. This is nicely illustrated by an example

‘perro’ = trans(‘dog’) = vec^-1(M(vec(‘dog’)))

Hence, one takes the word ‘dog’, applies vec to get the corresponding word vector, transforms that word vector with M to the Spanish word vector, and finally applies vec^-1 to get the Spanish translation.

The Linear Structure

So far, so good. But how do you find such an M? And why did I use the letter ‘M’ in the first place? The answer lies in the linear structure. Remember from the last few blogs, many semantic and syntactic relations were captured by addition and subtraction.

Think of the Semantic-Syntactic Word Relationship test from last time. To make use of this extra structure, we would like to preserve it, when applying the transformation M. Specifically, it means that we want the following identity to hold true

M(x+y) = M(x) + M(y)

whenever x and y are word vectors (remember that x and y lie in the English space, while M(x) and M(y) are vectors in the Spanish space).

This property is called linearity as it preserves the linear structure, i.e. addition and subtraction. This is quite a strong restriction, but the number of possibilities is still immense. Don’t worry – I’ll explain more about it in the next blog.

Why Is Linearity So Useful?

Vocabularies can contain hundreds of thousands of words, but defining M for each English word and the Spanish translation would be quite tedious. You might say, that it is precisely what dictionaries do and so this task can be manageable.

However, there are a few problems you might encounter when looking up words in a dictionary.

New words pop into existence every day, and it’s extremely difficult to keep track of them.
If you have multiple candidates, it is not clear how to choose the best one.
There might be mistakes, which go unnoticed, especially if the word is rare.

It turns out, that by using M and its linearity, all those problems can be to a large degree eliminated. But first, let me explain why we actually do not even need to tell our function M what the right translation should be for each word.

Consider the following example

vec(‘man’) – vec(‘woman’) = vec(‘king’) – vec(‘queen’)

which is equivalent to

vec(‘queen’) = vec(‘king’) – vec(‘man’) + vec(woman’)

a simple analogy reasoning task, like the one we have seen in part seven.

Let’s assume that we have told M that the Spanish translation of ‘man’ is ‘hombre’, of ‘woman’ is ’mujer’ and of ‘king’ is ‘rey’. Using linearity, M already knows what the right translation of ‘queen’ should be:

M(vec(‘queen’)) = M(vec(‘king’) – vec(‘man’) + vec(woman’))

= M(vec(‘king’)) – M(vec(‘man’)) + M(vec(‘woman’))

= vec(‘rey’) – vec(‘hombre’) + vec(‘mujer’)

This might not seem like a big gain, as we had to translate three of the four words for M, so that it could get the last one right. However, it turns out that to specify M we only need to translate manually no more than m times n words, where m is the dimension of word vectors in the target language and n is the dimension of word vectors in the source language.

As the size of vocabulary can reach one million and the dimension of word vectors is in the hundreds, this means that we have to translate just a small percentage of all words, and the rest follows by linearity as above.

I hope that I have convinced you that the vector space structure of word vectors is something quite useful for machine translation. By the way, the famous ‘M’ stands for matrix! Next time I will explain what this means and how one can tackle the three problems of ordinary dictionaries I have stated above.