Please ensure Javascript is enabled for purposes of website accessibility

Machine translation

December 20, 2022
  > All Languages
Human body with electronic circuitry on their skin

Technology has improved productivity but hasn’t overcome the complexities of human language.

Introduction

There have been great technical minds working on machine translation for over twenty-five years. The dream of creating a machine able to understand and re-express human language has yet to be realized, despite the incredible progress in computer science.

Languages are extremely complex and involve unspoken knowledge and uncharted thought processes. Artificial intelligence simply cannot match a human’s ability to understand context and nuances in communication.

Increasingly powerful computers have enabled us to amass large bodies of data that can be managed by rule-based and/or statistical machine translation (“MT”) computer programs.

Google Translate is a well-known example of statistical machine translation: it gathers text which appears to be parallel between two languages and then applies statistical analysis to decide which parts of Language A and Language B correspond to one another. Quality is higher in popular languages due to the high volume of available content to index. Advances in quality have reached a plateau, but future improvements are expected primarily as a result of faster processing speeds and the continual growth of indexed content.

Depending on the language pairing, a free Google translation may be able to give you “some idea” about what the source text says, but you would be ill-advised to use it for anything else.

There are three main types of machine translation in use today:

  1. Basic rule-based translation that takes into consideration spelling and grammar of the source and target languages. Different tools are utilized based on specific language combinations and domains;
  2. Statistical translation based on probabilities and densities created from large bodies of data;
  3. Hybrid translation that uses advanced Computer Aided Technology (CAT) tools to train rules-based and/or statistical systems to output better quality translations automatically.

This document was designed to improve your knowledge of machine translation and how it can be used effectively in business or for personal use.

Background

Human knowledge is unrivalled

Whether you realize it or not, you use several different types of knowledge to fully understand a single message in your own language. A machine translation (MT) system would need similar abilities to understand a message before it could ever re-express it in a different language that has its own characteristics.

Due to the complexities of languages, MT has not yet matched the quality of work produced by professional human translators.

The linguistic knowledge that humans use

  • Common sense knowledge – often gained by past experiences and intuition;
  • Morphological knowledge – formation of words in source and target languages;
  • Phonological knowledge – sound systems of languages;
  • Pragmatic knowledge – what the words mean in context;
  • Semantic knowledge – what the words and sentences mean independent of context;
  • Syntactic knowledge – rules and constraints that apply to words when forming sentences.

Why the marketplace still wants MT

Despite the challenges with matching human quality translation, the market for MT is expected to grow over 20% a year and reach USD $983.3 million by 2022 (Hexa Research, November 2015). They expect market pressures and investments in technology to advance to the point that MT systems will eventually produce translations with minimal errors and improved grammatical coherence. People have been making similar predictions for decades to no avail, but we are getting closer.

Hexa foresees the business model for growth in MT adoption being based on Software-as-a-Service where services are hosted in the cloud and accessed on the desktop or mobile device using a secure connection. This model follows one of the major trends in business communication which is the integration of the translation process in project plans.

Since the volume of business content that needs to be translated is growing uncontrollably, machine translation has become imperative to make content available in regional languages for users worldwide, but it works better in some domains or types of text than others.

Despite the demand for MT and its cost benefits, human translation is still preferred.

Potential business benefits

Machine translation (MT) can be helpful to your business in areas where speed is important and quality is not essential. Our world is driven by constant messaging and collaboration that is generating massive amounts of information, a lot of which does not require professional, high-quality translation. Here are some day-to-day challenges that machine translation can help overcome.

Lower overhead
Customer service departments often need to provide multilingual support information for policies, processes or product descriptions. For minimum quality needs, translations can be proofread/revised by a bilingual subject matter expert.

Worldwide communication
Employees in different countries often need to communicate frequently by email and work collaboratively on internal documents, presentations, training materials and other content. “Internal use” is key here.

Improved productivity
When staff needs to translate more basic content with the same resources, you can add MT to the workflow. Productivity gains could be as much as 30% when properly utilized, but proper professional post-editing is required.

E-commerce and social media
Websites and social media require native-language content to drive global sales and engagement. Content shelf life may be short, so on-demand low-cost versions are needed. It should be evident that this is MT-generated content.

Information security
Secure enterprise-grade MT solutions protect your information. It is very tempting for staff to use free translation services readily available on the internet. What starts out as an innocent attempt to save time and money could be very costly should sensitive information fall into the wrong hands.

Rule-based machine translation

The simplest explanation of rule-based translation (RBMT) is that it uses a large collection of manually developed rules to map a re-structuring framework from the source to the target language.

RBMT considers spelling and grammar of the source and target language, and uses specific tools depending on language combinations and domains. The systems rely on bilingual dictionaries for each language pairing and lexicons that need to be edited and manipulated by users to improve translation.

Software programs and computers apply complex rules and transfer the grammatical structure of the source language into the target language. A specific tool may work well in one language or domain, but not another.

Quality

Well developed RBMT systems can deliver reasonably good automated translations with predictable results.

Quality is dependent on the investment made in the development and ongoing improvement of the system. Highly skilled professionals are required to perform this work which is costly and time-consuming (manual process).

As more and more rules are added over time, RBMT systems can generate ambiguity that can lead to the degradation of quality.

Limitations

RBMT requires an incredible amount of knowledge engineering. Specifically, it requires human experts with a deep understanding of linguistics who can effectively use technology to compute an abstract representation.

It is difficult to scale RBMT systems and to develop a vast amount of appropriate grammatical and lexical resources for a specific type of text.

Development during the early years

The history of rule-based machine translation (RBMT) clearly shows the difficulty with getting machines to understand the context of source material and re-express it in other languages.

1945-1954

The first significant point in the development of (RBMT) came in 1954 when IBM and Georgetown University demonstrated that wartime code breaking concepts and sophisticated information theories could be computerized to establish natural language principles. The demonstration used simple vocabulary and grammar so impressively that it kick-started funding of RBMT development around the world.

1955-1965
RBMT systems consisted of bilingual dictionaries where source text had one or more equivalents in a target language, and basic rules for producing the correct order of words in the translation. Researchers quickly ran into road blocks as rules for syntactic ordering were complex and there were too many exceptions and variables (linguistic nuances). Translated documents were helpful only to those who needed fast translations and who could live with crude quality output.

1960-1969
In 1966, the US government released a report by the Automatic Language Processing Advisory Committee (ALPAC) which concluded that RBMT was slow, inaccurate and twice as expensive as human translation. Government and corporate investment moved into the development of machine aids for translators (such as automated bilingual dictionaries), and basic research in computational linguistics.

The 1970s
Strong demand for RBMT came from multinational trade communities. The market wanted low-cost machine-aided translation systems that could process administrative and technical documentation into, and out of, a multitude of languages.

The 1980s
Different advances from different countries along with the advent of mainframe computer systems gave engineers incredible processing power and enabled the creation of “indirect” translation where basic translations were enhanced by an intermediary representation such as “knowledge bases”. Microcomputers and word processing software quickly drove the market toward cheaper desktop systems that could be interconnected with larger hosts.

The 1990s
The early 1990s experienced an increase of activity in practical applications, translator workstations, controlled language and domain-restricted systems, and the integration of translation components in multilingual information systems. In the latter part of the 1990s, software companies brought RBMT to desktop PCs.

Statistical machine translation

How can a machine learn?

Statistical machine translation (SMT) looks at the translation of natural language as a machine learning problem. Essentially, a powerful computer can “learn” how to apply statistically generated logic so that a string of source text words in one language will land on a well formed string of words in the target language.

The statistics are based on maximum-likelihood models and are often based on word order and re-ordering.

Word order differences account for more variation in SMT performance than any other factor, so statistically predicting the words in the translation and deciding on their order is essential.

How SMT works

In statistical translation, the basic idea is that every string of source text has a translation in a target language. A trained professional will use SMT technologies to examine a large body of related bilingual translation (corpus of texts) and assign a “probability weighting” to every pair of strings based on the likelihood that a human translator, when presented with a specific source text string, would produce a translation containing a specific string in the target language.

The probability weightings provide the framework for a table that associates a real number between zero and one with every possible pairing of a source and target language string. The number of possible permutations is incredibly large, so the table will be enormous and require a powerful computer to quickly and accurately produce translations.

Pros and cons

On the upside, the SMT system can quickly learn how to translate automatically from actual data. And the more data, the better. The learning model can also be applied to other corpora, so you can build the model once and use it across multiple languages while keeping them independent of one another.

On the downside, it is difficult to model “disorganized” word associations and complex translation phenomena. SMT is built on parallel corpora not language knowledge.

Complexity

SMT systems must integrate three computational challenges:

  • Language model probability
    • Conditional on whether one, two, three or more words tend to follow or precede a certain word. The larger the corpus, the more accurate the estimates will be.
  • Translation model probability
    • Based on fertility (number of words generated from a source word), distortion (predicts the target word’s position), and translation probabilities (number of phrases or sentences likely to be produced)
  • Search method that maximizes the quality of the translated product
    • A decoder for phrase-based, hierarchical and syntax-based translation

Popularity

Since 2000, SMT has grown in popularity and now dominates machine translation research. The main reasons for success are:

  • The availability of large monolingual and bilingual corpora;
  • Availability of open-source software for performing basic SMT processes;
  • Availability of widely accepted metrics for evaluating systems.

SMT quality and usage

The quality of an SMT product is considered to be lower than human translation. Quality goes up in proportion to the size of the parallel corpus used.

Google Translate (GT) is a widely known SMT system that can help a reader understand the general content of a foreign language text but it will not produce accurate translations and tends to repeat verbatim the same word it’s expected to translate.

Individuals access GT from their mobile phones or personal computers when translation quality and data security are not important. GT performs at its best when English is the target language and the source language is from European Union (EU).

Businesses use SMT when the scale and volume of work is too large to be handled by human translators, and their type of content or usage allows for it.

Hybrid machine translation

Hybrid machine translation (HMT) integrates the best characteristics of RBMT and SMT to deliver the ideal combination of quality, speed, productivity and cost-effectiveness.

RBMT provides predictable and consistent translations, cross-domain utility, and high efficiency. SMT components learn from actual monolingual and

­­­multilingual corpora and improve translation quality within specified domains.

The HMT system is augmented by Computer Aided Technology (CAT) tools specifically designed for use by human translators to save time revising the translations proposed by the HMT engines. Post-editing results are re-integrated into the software through the dictionary or by additional training of the system so it never makes the same mistake twice.

RBMT and SMT have complementary properties

RBMT AND SMT HAVE COMPLEMENTARY PROPERTIES

RBMT and SMT both have pros and cons which is why “hybrid” models have emerged. Sometimes rule-based MT engines are used to enrich the lexical resources available to a SMT decoder. In other instances, parts of the SMT infrastructure are used, together with linguistic processing and manual validation, to extend the lexicon of a RBMT engine. Common types of HMT:

Parallel multi-engine

This model involves running RBMT and SMT sub-systems in parallel to create a larger architecture. The final output is generated by combining the output of the two sub-systems.

Serial multi-pass

Translations are first performed using a rules-based engine and then statistics are used to smooth/refine the output. This approach attempts to improve a lower quality output from the RBMT engine.

Statistics-guided

This approach uses statistical data to generate lexical and syntactic rules that pre-process data. A second set of rules are used to post-process the statistical output to perform functions such as normalization. The accuracy of the translation depends on the similarity of the input text to the training corpus.

HMT quality and usage

The gap in quality between human translation and HMT is still quite wide. HMT does deliver some quality improvements over legacy stand-alone RBMT and SMT systems, but HMT’s complexity comes with higher costs.

HMT systems are used by both translation service providers and large enterprise customers as it enables organizations to manage data, reuse translated content, as well as create domain-specific terminology. HMT has customization capabilities to “purpose-build” translation models that will improve the quality of specific domains and meet the communications demands of today’s digital world. For example, HMT can quickly learn to understand foreign language information in e-mails, web pages, presentations and corporate correspondence.

Advancements in new SMT technologies such as Language Transformation (data pre-processing), Language Optimization Technologies and Terminology Management Solutions are achieving the same quality improvements offered by HMT while reducing the need for legacy technology.

Summary of machine translation

  1. Languages are so complex that MT has not been able to match the quality of professional human translation.
  2. MT is useful in a variety of situations where quality is not a concern and speed is important.
  3. MT continues to grow in popularity due to the globalization of our world, content creation on a massive scale and the limited speed of human translation.
  4. Rule-based MT can deliver reasonably good automated translations with predictable results but is difficult to scale and requires a large amount of knowledge engineering. Software generally works well in one language combination or domain, rather than several.
  5. Statistical MT can “learn” to translate from bilingual data available in immense volumes. There is no language knowledge, so source and target language combinations are high.
  6. Hybrid MT integrates the best characteristics of rule-based and statistical MT, but is more costly due to high complexity.
  7. Regardless of MT used, post-editing is required to produce quality work. The amount of post-editing required still most often necessitates use of more traditional translation processes.