The commercial demand for high quality Machine Translation (MT) is obvious. For localization purposes, a software company such as Symantec needs to deliver helpful content to its customers in their native languages. However, MT evaluation via automatic metrics is only possible when a reference translation is available. In the more realistic setting where no such reference is available, reliable techniques for estimating the quality of translation system output are needed.
As more and more customers move away from traditional call centres and corporate websites in favour of self-service via dedicated discussion forums, there is a growing need for machine translation of User-Generated Content (UGC). Because UGC is an unedited mix of writing styles containing spelling mistakes, abbreviations and non-standard punctuation, it poses a particular challenge for Natural Language Processing (NLP) tools that have been trained on well-formed text.
The aim of the Confident MT project is to develop Confidence Estimation (CE, or QE for Quality Estimation) methods to measure the reliability of MT output in the context of UGC about Symantec products. The CE methods will be applied across a range of MT systems (such as Rule-Based, Example-Based, Phrase-Based SMT and Syntax-Enhanced SMT) and the results will be used to inform the optimal combination of MT systems.
We have created the SymForum data set for quality estimation
of machine translation in the context of this project.
The data set is available for downloading here.
It is described in detail in the following publication which should be cited
if you use the data set in your research:
Rasoul Kaljahi, Jennifer Foster, Johann Roturier, 2014,
Syntax and Semantics in Quality Estimation of Machine Translation,
Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-8), Doha, Qatar.