National Centre for Language Technology

Dublin City University, Ireland

National Centre for Language Technology


Centre for Next Generation Localisation

School of Computing

School of Applied Languages and Intercultural Studies

School of Electronic Engineering


NCLT Seminar Series








Research Groups


NCLT Seminar Series 2006/2007

The NCLT seminar series takes place on Wednesdays from 4-5 pm in Room L2.21 (School of Computing).

The schedule of presenters for the 2006/2007 series (Semester 2) is as follows:

March 21st
Joachim Wagner Automatic Grammaticality Judgments
March 28th
Ines Rehbein Annotation Schemes and Parser Evaluation for German
April 4th
Andy Way Some Current Trends in Machine Translation
April 11th
Conor Cafferkey Exploiting Multi-Word Units in Probabilistic Treebank-Based Parsing and Generation
April 18th
Masanori Oya Zero pronoun identification in Japanese corpus
April 25th
Sharon O'Brien The Link Between Controlled Language & Post-Editing: An Empirical Investigation of Technical, Temporal and Cognitive Effort
May 9th
John Tinsley and Ventsislav Zhechev Robust Language Pair-Independent Sub-Tree Alignment
May 16th
Fred Jelinek Language Modeling by Random Forests
May 23th
Sisay Adafre Estimating Importance Features for Fact Mining (With a Case Study in Biography Mining)
May 30th
Mary Hearne Shortest Derivation Estimation for DOP
June 6th
Yuqing Guo Non-Local Dependency Recovery for Chinese
June 14th
Ann Devitt, Yanjun Ma & Nicolas Stroppa, Jennifer Foster, Joachim Wagner, Ines Rehbein, Conor Cafferkey & Deirdre Hogan, Yuqing Guo, and Karolina Owczarzak NCLT Special: Warm-Up for Prague
Presentations and Poster Session
July 4th
Johann Roturier How useful is machine-translated technical documentation? Let's ask users!
July 11th
Yvette Graham, Joachim Wagner, Jennifer Foster Dry-run for the LFG conference/ParGram
(View titles)
July 18th
Gearóid Ó Donnchadha A feature valuation approach to the prohibition on two definite determiners in genitive noun phrases in Irish

Automatic Grammaticality Judgments

Joachim Wagner

In this talk I will present an evaluation of four approaches to automatic grammaticality judgements. Such judgements can be used to automatically grade essays or to trigger a computationally expensive error analysis. The first approach follows the traditional view that the grammar determines grammaticality. The test corpus is parsed with the XLE parser and "starred" sentences are classified as ungrammatical. The second approach is similar. Here we prune a PTB-trained PCFG so that it rejects ungrammatical input. Thirdly, n-gram methods are considered. If a sentence contains an n-gram below a certain frequency threshold, it is rejected. Finally, my own approach (developed in collaboration with Jennifer Foster) is included in the evaluation. The approach compares the actual probability output of a statistical parser with a probability estimated from a reference corpus of grammatical sentences in order to judge the grammaticality.

View slides


Annotation Schemes and Parser Evaluation for German

Ines Rehbein

A long-standing and unresolved issue in the parsing literature is whether parsing less-configurational languages is harder than e.g. parsing English. German is a case in point. Results from Dubey and Keller (2003) suggest that state-of-the-art parsing scores for German are generally lower than those obtained for English, while recent results from Kuebler et al. (2006) raise the possibility that this might be an artifact of encoding schemes and data structures of treebanks, which serve as training resources for probability parsers. In the talk I present new experiments to test this claim. We use the PARSEVAL metric, the Leaf-Ancestor metric as well as a dependency-based evaluation, and present complimentary approaches measuring the effect of controlled error insertion on treebank trees and parser output. We also provide extensive cross-treebank conversion. The results of the experiments show that, contrary to Kuebler et al. (2006), the question whether or not German is harder to parse than English remains undecided.

View slides


Some Current Trends in Machine Translation

Andy Way

For reasons that I can't quite recall, I've been the track coordinator for EACL-06 and ACL-07. Given that, I have some interesting (I think!) statistics on the trends from one conference to the other regarding topics of papers, and country of origin of those same papers. I'll then finish with some predictions arising from those figures.

View slides


Exploiting Multi-Word Units in Probabilistic Treebank-Based Parsing and Generation

Conor Cafferkey

I present the results of several experiments using multi-word units (MWUs) as a means to impose constraints on both probabilistic parsing and surface generation with automatically-acquired (treebank-based) grammars. In the case of surface realisation from LFG f-structures with automatically-acquired treebank-based LFG approximations modest but significant gains in accuracy can be made. Experiments integrating the same MWUs in treebank-based probabilistic parsing yielded smaller, but still statistically significant gains. I analyse the results and offer a number of explanations as to why the gains achieved are smaller than might be naively expected.


Zero pronoun identification in Japanese corpus

Masanori Oya

I talk about zero pronoun identification in Japanese corpus. Since zero pronouns appear quite often in Japanese texts, identifying them is one of the important issues in Japanese NLP, and is also required in long distance dependency (LDD) resolution at the level of f-structure representation, and in automatic case-frame extraction from a large corpus. I introduce a simple method of zero pronoun identification which uses verbal morphological features which signify transitivity of a verb, along with the probability of the cooccurrence of a verb and nouns which are attached with certain case-marking particles. I will show and analyze the result of applying the method to 500 sentences randomly chosen from Kyoto Text Corpus, and the parsing output of the same sentences by the Japanese dependency parser which does not take zero pronouns into account, and try to explain the advantage and drawback of the method, and possible ways to improve its performance.

View slides


The Link Between Controlled Language & Post-Editing: An Empirical Investigation of Technical, Temporal and Cognitive Effort

Sharon O'Brien

Studies on Controlled Language (CL) suggest that by removing certain linguistic features that are known to be problematic for Machine Translation (MT) from a source text, the MT output can be improved. A further assumption is that an improvement in MT output will result in lower post-editing effort. With the ever-increasing emphasis in the translation industry on higher volumes and faster throughput, it is not surprising that this assumption is of interest to those who manage multi-lingual high-volume translation projects. Increasingly, translation service providers are asked to provide post-editing services in addition to their traditional translation/localisation services. The expectation is that post-editing will be faster than human translation and that, therefore, post-editing should not cost as much as translation. However, the assumption that CL reduces post-editing effort has not been tested empirically. It is worthy of closer inspection, not least because CLs can cover a broad range of linguistic features (OBrien 2003). This paper presents results from a study designed to test the assumed link between CL and post-editing effort by measuring the technical, temporal and cognitive post-editing effort (Krings 2001) for English sentences in a user manual that have been translated into German using an MT system and that have been subsequently post-edited by nine professional translators. In this study, the linguistic features known to be problematic for MT are called negative translatability indicators or NTIs for short. The post-editing effort for sentences containing NTIs is compared with the post-editing effort for sentences where all known NTIs have been removed. In addition, relative post-editing effort (Krings 2001) a comparison of post-editing effort and translation effort - is measured. A comparison will be made between NTIs that generate a high-level of post-editing effort and those that generate a lower level of post-editing effort. The methodologies employed include the use of the keyboard monitoring tool, Translog (Jakobsen 1999, Hansen 2002), and Choice Network Analysis (Campbell 1999).

Campbell, Stuart (1999), A Cognitive Approach to Source Text Difficulty in Translation, in Target, 11:1, pp. 33-63.

Hansen, Gyde (ed) (2002), Empirical Translation Studies Process and Product, Copenhagen Studies in Language 27, Copenhagen: Samfundslitteratur.

Jakobsen, Arnt Lykke (1999), Logging Target Text Production with Translog, in Hansen, Gyde (ed), Probing the Process in Translation: Methods and Results, Copenhagen Studies in Language 24, Copenhagen: Samfundslitteratur, pp. 9-20.

Krings, Hans P. (2001), Repairing Texts: Empirical Investigations of Machine Translation Post-Editing Processes, Kent/Ohio: Kent State University Press, edited/translated by G. S. Koby et al.

OBrien, Sharon (2003), Controlling Controlled English An Analysis of Several Controlled Language Rule Sets, in Proceedings of EAMT/CLAW 2003, Dublin: Dublin City University, pp. 105-114.

View slides


Robust Language Pair-Independent Sub-Tree Alignment

John Tinsley and Ventsislav Zhechev

Data-driven approaches to machine translation (MT) achieve state-of-the-art results. Many syntax-aware approaches, such as Example-Based Machine Translation and Data-Oriented Translation, make use of tree pairs aligned at sub-sentential level. Obtaining sub-sentential alignments manually is time-consuming and error-prone, and requires expert knowledge of both source and target languages. We propose a novel, language pair-independent algorithm which automatically induces alignments between phrase-structure trees. We evaluate the alignments themselves against a manually aligned gold standard, and perform an extrinsic evaluation by using the aligned data to train and test a DOT system. Our results show that translation accuracy is comparable to that of the same translation system trained on manually aligned data, and coverage improves.

View slides


Language Modeling by Random Forests

Fred Jelinek

Automatic Speech Recognition is based on several components: signal processor, acoustic model, language model, and search. In this talk, we explore the use of Random Forests (RFs) in language modeling, the problem of predicting the next word based on words already seen. The goal is to develop a new language model smoothing technique based on randomly grown Decision Trees (DTs). This new technique is complementary to many of the existing techniques dealing with data sparseness.
Random forests were studied by Breiman in the context of classification into a relatively small number of classes. We study their application to n-gram language modeling which could be thought of as classification into a very large number of classes. Unlike regular n-gram language models, RF language models have the potential to generalize well to unseen data, even when histories are long (>4). We show that our RF language models are superior to regular n-gram language models in reducing both the entropy and the word error rate in a large vocabulary speech recognizer.


Estimating Importance Features for Fact Mining
(With a Case Study in Biography Mining)

Sisay Adafre

We present a transparent model for ranking sentences that incorporates topic relevance as well as an aboutness and importance feature. We describe and compare five methods for estimating the importance feature. The two key features that we use are graph-based ranking and ranking based on reference corpora of sentences known to be important. Independently those features do not improve over the baseline, but combined they do. While our experimental evaluation focuses on informational queries about people, our importance estimation methods are completely general and can be applied to any topic.

Show Slides


How useful is machine-translated technical documentation? Let's ask users!

Johann Roturier

Previous studies suggest that the application of Controlled Language (CL) rules can significantly improve the readability, consistency, and machine-translatability of technical documentation. One of the justifications for the application of CL rules is that they can reduce the post-editing effort required to bring Machine Translation (MT) output to acceptable quality. In certain situations, however, post-editing services may not always be a viable solution. Web-based information is often expected to be made available in real-time to ensure that its access is not restricted to certain users based on their locale. Uncertainties remain with regard to the actual usefulness of MT output for such users, as no empirical study has examined the impact of CL rules on the usefulness and comprehensibility of MT technical documents from a Web user's perspective. This presentation focuses on the results of an online experiment conducted at Symantec, a leader in Internet security technology. Using a customer satisfaction questionnaire, a set of machine-translated technical support documents was published and randomly evaluated by genuine French and German users. The findings indicate that the introduction of CL rules can have a significant impact on the comprehensibility of German MT documents.

Show Slides


Using F-structures in Automatic Machine Translation Evaluation

Karolina Owczarzak, Yvette Graham, Josef van Genabith and Andy Way

C-Structures and F-Structures for the British National Corpus

Joachim Wagner, Djamé Seddah, Jennifer Foster and Josef van Genabith

A Comparative Evaluation of Deep and Shallow Approaches to the Automatic Detection of Common Grammatical Errors

Joachim Wagner, Jennifer Foster and Josef van Genabith


A feature valuation approach to the prohibition on two definite determiners in genitive noun phrases in Irish

Gearóid Ó Donnchadha

The objective of this talk is to explain the prohibition on two determiners in genitive noun phrases in Irish using the frameworks of the Minimalist Program and Distributed Morphology. I will first give a brief overview of Generative Syntax, the Minimalist Program and Distributed Morphology. This will be followed with a recap of previous work on Irish noun phrases involving the DP Hypothesis. I will then introduce the notion of feature valuation in Distributed Morphology which includes a particular view of nominalisation. These concepts provide the framework for an elegant explanation of Determiner-Noun agreement, Genitive case assignment and Definiteness agreement. The prohibition on two determiners in genitive noun phrases in Irish follows naturally from this explanation.

Show Slides

Dublin City University   Last update: 1st October 2010