Machine Translation @ the NCLT & the CNGL

Dublin City University, Ireland

Home People Projects Publications Events Announcements Links
-- current projects --
  Confident MT

-- completed projects --
  Sign language

  DVD subtitling
  EBMT & Marker

  Hybrid EBMT-SMT
Title:Confident MT: Estimating Translation Quality for Improved Statistical Machine Translation
Duration:November 2011 -- November 2014
Funded by:IRCSET (Irish Research Council for Science, Engineering and Technology)
People:Rasoul Samad Zadeh Kaljahi, Raphael Rubino, Jennifer Foster, Johann Roturier, Fred Hollowood

The commercial demand for high quality Machine Translation (MT) is obvious. For localization purposes, a software company such as Symantec needs to deliver helpful content to its customers in their native languages. However, MT evaluation via automatic metrics is only possible when a reference translation is available. In the more realistic setting where no such reference is available, reliable techniques for estimating the quality of translation system output are needed.

As more and more customers move away from traditional call centres and corporate websites in favour of self-service via dedicated discussion forums, there is a growing need for machine translation of User-Generated Content (UGC). Because UGC is an unedited mix of writing styles containing spelling mistakes, abbreviations and non-standard punctuation, it poses a particular challenge for Natural Language Processing (NLP) tools that have been trained on well-formed text.

The aim of the Confident MT project is to develop Confidence Estimation (CE, or QE for Quality Estimation) methods to measure the reliability of MT output in the context of UGC about Symantec products. The CE methods will be applied across a range of MT systems (such as Rule-Based, Example-Based, Phrase-Based SMT and Syntax-Enhanced SMT) and the results will be used to inform the optimal combination of MT systems.

We have created the SymForum data set for quality estimation of machine translation in the context of this project. The data set is available for downloading here. It is described in detail in the following publication which should be cited if you use the data set in your research:

Rasoul Kaljahi, Jennifer Foster, Johann Roturier, 2014, Syntax and Semantics in Quality Estimation of Machine Translation, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-8), Doha, Qatar.


Last update: October 23 2014
Related Sites: NCLT | School of Computing | School of Applied Languages and Intercultural Studies | Dublin City University