Machine Translation @ the NCLT & the CNGL

Dublin City University, Ireland

Home People Projects Publications Events Announcements Links
-- current projects --
  Confident MT

-- completed projects --
  Sign language

  DVD subtitling
  EBMT & Marker

  Hybrid EBMT-SMT
Title:Confident MT: Estimating Translation Quality for Improved Statistical Machine Translation
Duration:November 2011 -- November 2014
Funded by:IRCSET (Irish Research Council for Science, Engineering and Technology)
People:Rasoul Samad Zadeh Kaljahi, Raphael Rubino, Jennifer Foster, Johann Roturier, Fred Hollowood

The commercial demand for high quality Machine Translation (MT) is obvious. For localization purposes, a software company such as Symantec needs to deliver helpful content to its customers in their native languages. However, MT evaluation via automatic metrics is only possible when a reference translation is available. In the more realistic setting where no such reference is available, reliable techniques for estimating the quality of translation system output are needed.

As more and more customers move away from traditional call centres and corporate websites in favour of self-service via dedicated discussion forums, there is a growing need for machine translation of User-Generated Content (UGC). Because UGC is an unedited mix of writing styles containing spelling mistakes, abbreviations and non-standard punctuation, it poses a particular challenge for Natural Language Processing (NLP) tools that have been trained on well-formed text.

The aim of the Confident MT project is to develop Confidence Estimation (CE, or QE for Quality Estimation) methods to measure the reliability of MT output in the context of UGC about Symantec products. The CE methods will be applied across a range of MT systems (such as Rule-Based, Example-Based, Phrase-Based SMT and Syntax-Enhanced SMT) and the results will be used to inform the optimal combination of MT systems.

ConfidentMT Datasets

We have created two datasets as part of the ConfidentMT project:
  • SymForum: an English/French data set for quality estimation of machine translated Norton forum text
  • Foreebank: an English/French data set for evaluating syntactic parser accuracy on Norton forum text and for measuring the effect of grammatical noise on parsing
Note, to access both data sets on the Symantec website, please click on the + icon next to the title of the paper describing the dataset.


The data set is available for downloading here. It is described in detail in the following publication which should be cited if you use the data set in your research:

Rasoul Kaljahi, Jennifer Foster, Johann Roturier, 2014, Syntax and Semantics in Quality Estimation of Machine Translation, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-8), Doha, Qatar.  Paper.


The Foreebank data set is split into two components: the DCU side, which contains phrase structure trees without their leaves, is available for downloading here; the Symantec side, which contains the sentences themselves, is available for downloading here. A script for combining the yields with their trees is contained in the DCU side of the data set.

The Foreebank data set is described in detail in the following publication which should be cited if you use it in your research:

Rasoul Kaljahi, Jennifer Foster, Johann Roturier, Corentin Ribeyre, Teresa Lynn and Joseph Le Roux, 2015. Foreebank: Syntactic Analysis of Customer Support Forums. In EMNLP.  Paper. Poster.


Last update: October 23 2014
Related Sites: NCLT | School of Computing | School of Applied Languages and Intercultural Studies | Dublin City University