National Centre for Language Technology

Dublin City University, Ireland

National Centre for Language Technology


Centre for Next Generation Localisation

School of Computing

School of Applied Languages and Intercultural Studies

School of Electronic Engineering


NCLT Seminar Series








Research Groups


NCLT Seminar Series 2004/2005

Johann Roturier will present the next seminar entitled "Controlled Language And Its Impact On Translation Automation" on Wednesday March 30th at 4pm in Room L2.21.

The schedule of presenters for the 2004/2005 series is as follows:

November 3rd 2004 Gareth Jones DCU at CLEF 2004
November 10th 2004 Andrew Erritty The unscented Kalman filter and its application in vocal tract parameter tracking
November 17th 2004 Ríona Finn and Sara Morrissey RoBerT: An Irish Sign Language Translator
November 24th 2004 Mary Hearne Data-Oriented Models of Parsing and Translation
December 1st 2004 Michael Burke Treebank-Based Acquisition of a Chinese Lexical-Functional Grammar
December 1st 2004 Bin Wang NLP progress at ICT
December 8th 2004 Thomas Koller Developing a plurilingual ICALL System for Romance languages aimed at advanced learners
December 15th 2004 Nano Gough Linguistics-lite Example Based Machine Translation
January 12th 2005 David Dorran Audio Time-Scale Modification: keeping control of the tempo.
January 26th 2005 Jennifer Foster Good Reasons for Noting Bad Grammar: Empirical Investigations into the Parsing of Ungrammatical Written English
February 9th 2005 Anna Khasin and Bart Mellebeek TransBooster: boosting the performance of Machine Translation by reducing complex sentences to simple structures.
February 16th 2005 John Hopwood Intuitive creative language learning in three-dimensional environments.
March 23th 2005 Declan Groves Towards Example-Based SMT
March 30th 2005 Johann Roturier Controlled Language And Its Impact On Translation Automation

DCU at CLEF 2004

The Cross-Language Evaluation Forum (CLEF) organises an annual workshop comparing information retrieval systems on a range of European language retrieval tasks. DCU participated in a number of the tasks at CLEF 2004 including French and Russian retrieval, bilingual and multilingual retrieval, and cross-language image retrieval.
In this talk I will begin with a brief introduction to CLEF, and then give a summary of results from our participation in CLEF 2004. I will compare our results with those of other particpants, and attempt to draw some preliminary conclusions.

The unscented Kalman filter and its application in vocal tract parameter tracking

Kalman filtering (KF) is a probabilistic technique for producing optimal estimates of a system's hidden state given noisy measurements of the system. This technique can be applied to the problem of tracking vocal tract (VT) parameters from an acoustic speech signal. However, the KF requires that the state be linearly related to the measurements. This limits the KF to tracking linear prediction coefficients which are prone to instabilities.
The unscented Kalman filter (UKF) is a recently proposed nonlinear filtering technique which has been shown to be more accurate, more stable and easier to implement than the extended Kalman filter (a popular nonlinear extension of the KF). In this talk, I will provide an overview of the UKF technique and describe how we apply it to the problem of tracking VT parameters which are nonlinearly related to the speech signal. I will also present the results of applying this approach to speech. I shall conclude by outlining our plans for future work.

RoBerT: An Irish Sign Language Translator

"RoBerT" is an online machine translation system. It is bilingual and unidirectional, translating English into Irish Sign Language (ISL) for the domain of weather reports. The system is based on the transfer method, with special emphasis on robustness. This approach is rule-based and indirect, comprising three stages: analysis, transfer and generation. To ensure grammaticality of the user's input and output, the data is parsed according to the English and ISL grammars respectively. Between these parsing stages, the sentence structures are altered through the application of language-dependent transfer rules. The final translation, a playlist of the appropriate ISL videos, is generated from the output. In this presentation, we will present the principal modules of the system, discuss the animation process and demonstrate the translator in action.

Data-Oriented Models of Parsing and Translation

The merits of combining the positive elements of the rule-based and data-driven approaches to MT are clear: a combined model has the potential to be highly accurate, robust, cost-effective to build and adaptable. While the merits are clear, however, how best to combine these techniques into a model which retains the positive characteristics of each approach, while inheriting as few of the disadvantages as possible, remains an unsolved problem. One possible solution to this challenge is the Data-Oriented Translation (DOT) model originally proposed by Poutsma(1998, 2000, 2003), which is based on Data-Oriented Parsing (DOP) (e.g. (Bod, 1992; Bod et al., 2003)) and combines examples, linguistic information and a statistical translation model.
In my (recently submitted) thesis, the main issues I address are:
- how the DOT model of translation relates to the other main MT methodologies currently in use;
- how best to implement the DOT model (given that it inherits the computational complexity associated with DOP) so that it can be subjected to empirical assessment;
- whether the positive characteristics of the model identified on a theoretical level are also in evidence when empirical evaluation is carried out.
In this talk, I will describe the evolution of my research and present my main findings.

Treebank-Based Acquisition of a Chinese Lexical-Functional Grammar

Scaling wide-coverage, constraint-based grammars such as Lexical-Functional Grammars (LFG) (Kaplan and Bresnan, 1982; Bresnan, 2001) or Head-Driven Phrase Structure Grammars (HPSG) (Pollard and Sag, 1994) from fragments to naturally occurring unrestricted text is knowledge-intensive, time-consuming and (often prohibitively) expensive. A number of researchers have recently presented methods to automatically acquire wide-coverage, probabilistic constraint-based grammatical resources from treebanks (Cahill et al., 2002, Cahill et al., 2003; Cahill et al., 2004; Miyao et al., 2003; Miyao et al., 2004; Hockenmaier and Steedman, 2002; Hockenmaier, 2003), addressing the knowledge acquisition bottleneck in constraint-based grammar development. Research to date has concentrated on English and German. In this paper we report on an experiment to induce wide-coverage, probabilistic LFG grammatical and lexical resources for Chinese from the Penn Chinese Treebank (CTB) (Xue et al., 2002) based on an automatic f-structure annotation algorithm. Currently 96.751% of the CTB trees receive a single, covering and connected f-structure, 0.112% do not receive an f-structure due to feature clashes, while 3.137% are associated with multiple f-structure fragments. From the f-structure-annotated CTB we extract a total of 12975 lexical entries with 20 distinct subcategorisation frame types. Of these 3436 are verbal entries with a total of 11 different frame types. We extract a number of PCFG-based LFG approximations. Currently our best automatically induced grammars achieve an f-score of 81.57% against the trees in unseen articles 301-325; 86.06% f-score (all grammatical functions) and 73.98% (preds-only) against the dependencies derived from the f-structures automatically generated for the original trees in 301-325 and 82.79% (all grammatical functions) and 67.74% (preds-only) against the dependencies derived from the manually annotated gold-standard f-structures for 50 trees randomly selected from articles 301-325.

NLP progresses at ICT

In this talk, I will introduce some work on NLP in our group. This work includes Chinese word segmentation (there are no separators between words in Chinese sentences) & part-of-speech tagging, syntactic parsing (including full parsing and shallow parsing), semantic analysis, Chinese language grammar theory, and Chinese-English Machine Translation systems based on rules and corpora.
Concretely, we combine word segmentation and pos in one muti-level HMM frame. For syntactic parsing, an inverse-role parser which combines the advantages of LR and chart parsers is put forward. PCFG parser and maximum entropy based shallow parser are also investigated in our work. Machine learning methods are introduced in our semantic analysis work, especially for word sense disambiguation. We have built a rule-based CEMT system which contains about 400 rules and uses LFG (Lexical Functional Grammar) for node description. Corpus-based (including template-based and example-based methods) have also been investigated along with some relevant techniques such as bilingual corpora alignment, sentence matching, and template generation. Another work is Chinese grammar for NLP, we introduce category grammar for Chinese with some revision from original grammar due to Chinese characteristic.

Developing a plurilingual ICALL System for Romance languages aimed at advanced learners

Research in plurilingual teaching and learning of Romance languages has shown that a combined approach to teaching Romance languages is very promising. It can exploit the similarities between these languages in many ways in order to teach them contrastively. Thus far several European projects have been devoted to plurilingual teaching of Romance languages. However, materials for plurilingual learning of Romance languages almost exclusively focus on receptive skills and lack any kind of intelligent automatic analysis on learner input as well as flexible and dynamic feedback.
My research goal is the design, implementation, deployment and evaluation of a plurilingual ICALL (Intelligent Computer-Assisted Language Learning) software system ESPRIT for French, Italian and Spanish. I investigate how Natural Language Processing (NLP) tools can enhance the simultaneous teaching and learning of these languages.
These NLP tools comprise input analysis modules, animated grammar presentations, different kind of corpora and intelligent corpora and text tools. I investigate whether the integration of a plurilingual error-sensitive parser and a modular input template as analysis modules helps to provide useful language production activities and to dynamically obtain flexible and precise feedback. Animated grammar presentations dynamically present grammatical properties and processes and support a high degree of interactivity. To provide rich input, I investigate the use of existing corpora and the creation of small specialised corpora. Intelligent corpora and text tools dynamically provide useful information on corpora and unrestricted texts respectively.
A continuous evaluation process helps to avoid technically driven development.

Linguistics-lite Example Based Machine Translation

The development of large-scale rules and grammars for a Rule-Based Machine Translation (RBMT) system is labour-intensive, error-prone and expensive. Current research in Machine Translation (MT) tends to focus on the development of corpus-based systems which can overcome the problem of knowledge-acquisition.
Corpus-Based Machine Translation (CBMT) can take the form of Statistical Machine Translation (SMT) or Example-Based Machine Translation (EBMT). Despite the benefits of EBMT, SMT is currently the dominant paradigm and many systems classified as example-based integrate additional rule-based and statistical techniques. The benefits of an EBMT system which does not require extensive linguistic resources and can produce reasonably intelligible and accurate translations cannot be overlooked. We show that our EBMT system can outperform an SMT system trained on the same data.
We apply the Marker Hypothesis: a psycholinguistic theory which states that all natural languages are `marked' for complex syntactic structure at surface form by a closed set of specific lexemes and morphemes. We use this technique in different environments to segment aligned English, French phrases and sentences. Consequently, we apply an alignment algorithm which can deduce smaller aligned chunks and words. We generalise these alignments by replacing certain function words with an associated tag. As such, we cluster on marker words and add flexibility to our matching process. In a post hoc stage we treat the World Wide Web as a large corpus and validate and correct instances of determiner-noun and noun-verb boundary friction.
We show that scaling-up data in a marker-based EBMT system improves the quality of our translations. We also report on the benefits of extracting lexical equivalences from the corpus using Mutual Information.

Audio Time-Scale Modification: keeping control of the tempo.

Audio time-scale modification is an effect that alters the duration of an audio signal without affecting its pitch or timbre. In other words, the duration of the original signal is increased or decreased but the perceptually important features of the original signal remain unchanged; in the case of speech, the time-scaled signal sounds as if the original speaker has spoken at a quicker or slower rate; in the case of music, the time-scaled signal sounds as if the musicians have played at a different tempo.
This talk outlines the techniques employed to achieve such an effect and discusses the common pitfalls associated with them.

Good Reasons for Noting Bad Grammar: Empirical Investigations into the Parsing of Ungrammatical Written English

This talk is concerned with the parsing of ungrammatical written English sentences. A 20,000 word corpus was developed which consists of ungrammatical sentences which were noticed while reading a variety of English texts. Each sentence in this corpus was corrected, producing a second corpus of grammatical sentences. In this talk I argue that the compilation of such a corpus is a useful computational linguistic resource, outline the methodological decisions which were made in compiling the corpus, present the results of a small questionnaire study which was used to investigate the reliability of the corpus data, and briefly describe three parsing applications of the corpus. The first is a parser which uses a bottom-up active chart parsing algorithm and an error grammar to parse ungrammatical sentences. The error grammar is derived from a conventional grammar and the differences between the sentences in the ungrammatical corpus and the corrected grammatical corpus are used to inform this derivation process. The second application is rooted in the linguistic framework of typed feature structures. An extended notion of a typed feature structure is presented which allows the inconsistent information contained in an agreement error to be stored. A form of relaxed unification is also defined which operates on these feature structures so that sentences containing an agreement error can be parsed. This idea was tested on corpus sentences by modifying the parser in the Linguistic Knowledge Base, a widely-used natural language parser/generator which employs typed feature structures as linguistic objects. The third application is a parser evaluation method which measures a parser's ability to parse ungrammatical sentences by comparing the parses it produces for the ungrammatical sentences from the corpus to the parses it produces for the equivalent grammatical sentences in the corrected corpus. This method is flexible enough to be applied to any type of parser, regardless of the linguistic framework used to encode analyses. The method was applied to two wide-coverage probabilistic parsers, and the results of the evaluation are presented.

TransBooster: boosting the performance of Machine Translation by reducing complex sentences to simple structures.

Most of the freely available, wide-coverage Machine Translation systems on the Internet are based on a rather simple architecture which is often unable to correctly interpret complex sentences. Our project aims at boosting the quality of MT engines by reducing those complex structures to simple sentences, which, embedded within a minimal context, are to be spoon-fed to the MT system.
In this talk we will give an overview of our sentence-reduction algorithm and give a first comparison between TransBooster and the MT system Logomedia on Penn II Treebank sentences.

Intuitive creative language learning in three-dimensional environments.

Drawing upon our work across a range of educational and cultural projects, I propose to show examples of the different ways that virtual reality software has been used with language learners in order to enhance their understanding and their abilities to use new technologies. My aim is to create intuitive three-dimensional learning spaces into which the new tools and resources widely available on the internet can be integrated to allow learners to learn effectively and creatively.
The principal software applied involves variations of a package called "Creative VR" that we have developed in conjunction with users over a 4 year period, but which was based on work done in virtual reality companies in Leicester, initially for the arcade and corporate markets, subsequently for educational and cultural projects. The software helps learners to design and learn from their own designer learning environments and can be used to complement personalized learning approaches. Integrated into the three-dimensional galleries are such tools as Mp3s and wavs, videos (avis mpgs), documents (pdf word powerpoint etc) and internet sites and blogs .Multi-user communication via voice/ip interaction is used to share projects.
The presentation will show such projects as: primary school language learning in Liverpool, vr interfaces for video-conferencing, enhancing exchange trips, Black Country Pathfinder European languages portfolio project,VR as a communication tool for the deaf and BSL or ISL users, a VR Ulysses.
At some point the following topics may well come up ! Role-play games for languages (such as the Sims, Broken Sword, Monkey Island) Student expectations and what gets in the way of exciting language education in England. (Different systems in Wales and Scotland).The future of applications using voice/ip technology.Blogs and digital video .Why VR got a bad name. Some new VR applications with a possible use for language learning.
About the speaker: John Hopwood was a Head of Modern Languages for 16 years in a secondary school for14-18 year olds in the English state sector. From the mid1990s onwards he worked with Dr Al Humrich (founder of Virtuality VR company) in Educality Ltd, designing VR applications for language learning and heritage. He works as an ICT/languages consultant at St Julie's High School in Liverpool and is a regular guest lecturer at Leicester University Museum Studies department. Via St Julie's he works regularly with CILT (National Languages Centre, UK) and the Specialist Schools' Trust.

Towards Example-Based SMT

Statistical Machine Translation (SMT) typically takes as its basis a noisy channel model in which the target language sentence T is distorted by the channel into the source language sentence S. To recover the original target language it makes use of a language model Pr(T) and translation model Pr(S|T).
The translation models (TM) in traditional SMT systems try to model word-to-word correspondences between source and target words. The model is often further restricted that each source word is assigned exactly one target word. One criticism of the IBM-style TM is that is does not model structural or syntactic aspects of the language and fails to capture dependencies between groups of words (e.g. complex nouns), meaning that language pairs with very different word order are not modelled well by these TMs.
This has led to the recent development of phrase-based SMT systems that make use of phrase translations in their TMs to improve the overall quality of translation output. A number of methods have been proposed for the extraction of phrase translation pairs for use in SMT, with those based on word-to-word alignments proving to be the most successful. EBMT systems make effective use of both phrasal and lexical correspondences. The marker-based approach to EBMT has been proven to be extremely successful in extracting phrase translation pairs, as well as lexical correspondences, using a linguistic-light approach.
In this talk I will discuss current methods of phrase-based SMT and outline preliminary research into the integration of Example-Based and Statistical Machine Translation.

Controlled Language And Its Impact On Translation Automation

Johann Roturier is currently involved in a research project whose objective is to automate the translation process of technical documents in the field of computer security. Due to the time-critical nature of this type of communication, which needs to be promptly distributed in a number of languages, MT presents itself as a prospective candidate. The limitations of RBMT are often epitomized by its inability to process an unrestricted input to produce consistent translations of acceptable quality. However, the quality of this output can be significantly improved if writers create documents with MT in mind (Bernth & Gdaniec 2001). Previous initiatives showed that certain Controlled Language (CL) rules must be applied on the source text to achieve this objective. By applying lexical, syntactic and semantic restrictions, CL attempts to improve the clarity of the source text so as to reduce ambiguities during the automatic translation process (Kamprath et al, 1998). In this talk, I will first introduce the concept of CL and reflect on its relevance for translation automation. The findings of a preliminary study that was conducted to assess the effectiveness of CL rules on MT output will then be presented. Finally, I will discuss the opportunity that a CL environment creates for the possible automation of the Post-Editing (PE) process when the minimal PE tasks require no linguistic analysis.

Dublin City University   Last update: 1st October 2010