The first post in this series is here. The previous post in this series is here.  The next post in this series is here.

Lately, we started a mini-series, in which we explain the scientific foundations of neural network technologies. All that to explain what a Neural Network actually is, then dive into their use of natural language processing and consecutively, neural machine translation. The purpose of all that is to really show you what is going on behind the curtain.

Part five of this series served to explore a new neural architecture, which will help us in natural language processing tasks.

A part of this exploration was getting acquainted with the idea of a word vector and a glimpse on an early paper on how to implement it by Rumelhart, Hinton, and Williams, from 1986. If you do not remember the main ideas, go back and read it again, otherwise, today’s blog will not make much sense.  In 1986, the computing power was not available to actually use such an implementation, so we need to fast forward to the year 2003 and a paper by Bengio, Ducharme, Vincent, Jauvin titled

A Neural Probabilistic Language Model

The model described is very similar to the one I outlined last time. One constructs a neural network with an input layer consisting of one neuron per word in the vocabulary, and the same goes for the output layer. The task is predicting a word from a fixed number of preceding words (a so-called window). Hence one trains by going through the corpus, and if one encounters a window “…it was a sunny day…”, then the input consists of ‘it’, ‘was’, ‘a’, ‘sunny’ and the output should be the distribution of possible next words. In that case, ‘day’ should have the largest output.

There is also a statistical learning version of this task, called the n-gram method, where the ‘n’ stands for the number of words in one window. In the above example, we would have n = 5. This will make it possible to directly compare the performance of both methods later on.

The Key: Word Vectors

As explained last time, predicting words from their context is just a dummy-task. What we actually want are the word vectors. In the case of the above model, these can be found in the first hidden layer. As the number of input words is either 3 or 5, the first hidden layer must also contain 3- or 5-word vectors of the size 30, 60 or 100. There is also a second hidden layer containing 0, 50 or 100 neurons. All these numbers are examples of hyper-parameters, which are not learned by the machine and have to be experimented with.

You might have noticed that the second hidden layer is of comparative size as the first one. How come the essential data compression does not happen in this layer? Well, the authors do not comment on this and one cannot exclude this possibility. But it is likely that at this stage a lot of information about the initial input words is already lost. Remember, the next layer is already the output layer containing some distribution of all the words in the vocabulary. This is the information that must be still there in the second hidden layer. Hence, it makes more sense to place the hidden layer containing the word vectors right at the start.

If you read the original paper, you will notice that the authors put a lot emphasis on the complexity of the model, hence the number of arithmetical operations needed. This is quite often an issue with neural networks. As a result, the training corpus had a size of (only) 1 – 15 million words. Today we have faster computers and better algorithms, so training data may contain many billions of words.

Neural Network: Success Rate

How well did this early neural network do? It is not at all clear how to measure its performance. The authors used so-called ‘perplexity’, which is closely related to the information-theoretic concept of entropy introduced by Claude Shannon. It measures how well the model can predict word statistics that it has not seen during training. It can be defined both for the n-gram, as well as for neural nets. In the latter case, it is also equal to the cost function. It turns out, that the perplexity was significantly lower for neural nets. That means that the word vectors had indeed more predictive power than the n-gram model. Certainly, a very promising result.

Assessment of Performance

At this point, it is worthwhile to say a bit more about the assessment of performance, especially the distinction between intrinsic and extrinsic evaluations. Perplexity is intrinsic, which means that it is a theoretical construct designed to capture some particular performance of a network into a single real number. In principle, the cost function can always be used as an intrinsic evaluation, if one remembers to plug in test data instead of training data. However, at the end of the day, a number is not what we want.

We want neural networks to be useful in practice.

Here extrinsic evaluations enter the field. The most significant extrinsic evaluation is consumer satisfaction. Unfortunately, there is no simple formula to compute it yet.

However, in the next blogs, we will discover other intrinsic evaluations, which give us a much clearer picture of the power of word vectors than just a single number. It will turn out, that it is the linear structure of vector spaces, which holds the key to capturing not just statistics but also semantics. You might want to brush up your high school linear algebra for the next blog, then 😉

Stay tuned!