How To Measure Machine Translation Quality?

How To Measure Machine Translation Quality?

Share on social media:

Is there a way to measure machine translation quality for many different language combinations, such as English to Spanish, or German to Japanese?

We can try to do so by focusing on the differences in machine translation engines and trying to evaluate the general quality of machine-translated content. That’s what we will do in this blog. Let’s go! 

 Which translation engines are out there?

Let’s start with a shortlist of machine translation engines:

  • Google Translate
  • Microsoft Translator
  • IBM Watson
  • Yandex Translate
  • DeepL
  • Amazon Translate

While there are dozens of machine translation products out there, such as Systran, Moses, etc., along with products built for specific niches or a smaller number of language combinations, today we will focus on major consumer machine-translation products.

 The types of machine translation engines

Now that we have the list, we also need to briefly explain the types of machine translation engines, at least the major types. We did cover this in a previous blog article, so just to repeat, there are three major types of machine translation:

  • Rule-based
  • Statistical
  • Neural

If we now join all the information into one small table, we get the following:


As you can see, the statistical and neural approaches to machine translation prevail. For most people, the equivalent to machine translation is Google Translate, due to a large number of languages it supports, as well as the long (and not always as successful) history.

What does this all mean to us, the clients and consumers? Well, this is where the third factor comes into play, the assessment of quality.

 How to measure machine translation quality?

The debates on how to assess machine translation quality continue. Scientists, scholars, and translators constantly work on perfecting the way to automatically check machine translation, making it as close as possible to translations delivered by humans.

Given the fact that machine translation products come in different forms and sizes, you cannot really compare statistical and neural engines 1:1. Another problem is that many different groups have developed different ways to measure the quality of translations.

The main algorithm used is BLEU (bilingual evaluation understudy). Machine-translated sentences are compared to a set of good quality reference translations. These are then scored with a number between 0 and 1. The closer the translation is to 1, the better it is, compared to a human translation. There are still other metrics, such as NEST (based on BLEU), WER (word error rate), METEOR, LEPOR, or ChrF.

Subjectively speaking, Google Translate has proven to be the most versatile machine-translation product, supporting the most languages, and providing the best results for us and our clients. Google itself has actually also evaluated their product, and they also acknowledge the differences between various language combinations:

A graphic from Google Research highlighting the 2016 accuracy levels of Google Translate. Source:

Machine translation quality – not that terrible, after all

What you need to understand if you want to translate your content is the following:

  • There are many machine translation products out there
  • All of them use different approaches to machine translation (like neural or statistical)
  • Machine translation quality also largely depends on the language combination and the content
  • All of the machine translation products score differently, depending on the metric used
  • BLEU is still accepted as the main metric for machine translation quality

It’s a brave new world out there! Reach out to test out machine translation in your localization workflow – we will gladly help you choose the best approach for your business.

Translation experience made for you
Tailored solutions trusted by 100s of companies