National Centre for Language Technology

Dublin City University, Ireland

National Centre for Language Technology


Centre for Next Generation Localisation

School of Computing

School of Applied Languages and Intercultural Studies

School of Electronic Engineering


NCLT Seminar Series








Research Groups


NCLT Seminar Series 2007/2008

The NCLT seminar series takes place on Wednesdays from 4-5 pm in Room L2.21 (School of Computing).

The seminar will comprise a mixture of research talks and, this year, a round of tutorials based on chapters from the new (draft) edition of the Jurafsky and Martin book, Speech and Language Processing, which can be found here.

View current chapter allocation

The schedule of presenters for the 2007/2008 series (Semester 1) will be added below as they are confirmed:

October 24th 2007 John McKenna Phonetics [Chpt. 7]
October 31st 2007 Lamia Tounsi Sub-automata Research and Compression of Electronic Dictionaries
November 7th 2007 Joachim Wagner An overview of the new NCLT cluster
November 14th 2007 Ines Rehbein Hidden Markov Models [Chpt. 6 (1)]
November 21st 2007 Grzegorz Chrupala Maximum Entropy Models [Chpt. 6 (2)]
November 28th 2007 TLT'07 dry-runs John Tinsley: "Exploiting Parallel Treebanks to Improve Phrase-Based Statistical Machine Translation"

Ines Rehbein: "Why is it so Difficult to Compare Treebanks? TIGER and TüBa-D/Z revisited"
January 16th 2008 Conor Cafferkey Parsing with Context-Free Grammars [Chpt. 13]
January 23rd 2008 Jennifer Foster Statistical Parsing [Chpt. 14]
January 30th 2008 Mary Hearne Machine Translation [Chpt. 25]
February 6th 2008 Patrik Lambert Minimum-Translation-Error Discriminative Alignment Training
February 13th 2008 Yanjun Ma Generative Word Alignment Models
February 20th 2008  
February 27th 2008 Ventsi Zhechev  


Dr. John McKenna

In presenting J&M's Chapter 7 on phonetics, we will quickly run through articulatory phonetics and phonology, only pausing where necessary (as dictated by those in attendance). I intend to spend more time on acoustic phonetics as this is almost a prerequisite for appreciating the technical aspects of both ASR (automatic speech recognition) and waveform generation in speech synthesis. In summary, we'll look more closely at understanding time and frequency representations of the speech signal and how digital signal processing plays its part. I plan to conduct the session in a fairly informal and interactive fashion. The primary presentation source will be the pdf of the chapter - so please bring a printout along - but I will have some supporting material.

Sub-automata Research and Compression of Electronic Dictionaries

Dr. Lamia Tounsi

Acyclic finite state automata are widely used in Natural Language Processing in order to represent and store huge data such as dictionaries. Our work deals with the study of internal structure of acyclic automaton; more precisely we are interested in finding structures inside a finite state automaton, which we call sub-automata. Thus, we propose a O(n3) algorithm to compute all subautomata of a given automaton. This study can be used in applications whose aim is to decompose a very large FSA into smaller ones, to discover frequently occurring data and to reduce memory consumption. The second part of our work is devoted to the application of our algorithm for compression and indexing of automata that represent electronic dictionaries. Also, we propose a compression algorithm to reduce the memory required to store the automata and to preserve an effective access to data. The main propositions are, on the one hand, the application of the direct acyclic word graph, initially dedicated for indexing text, to index the subautomata, and, on the other hand, heuristic to select the most interesting substructure to factorize. The best candidates to be factorized are those which increase memory storage efficiency and reduce the size of the initial automaton.

View slides

An overview of the new NCLT cluster

Joachim Wagner

NLP research often demands resources not available on a single desktop PC. Training statistical models can be very memory-intensive, corpus processing very CPU-intensive, and some tasks require large amounts of temporary disk space. As many users share the same machines for their experiments, there have been resource conflicts in the past (for example "disk full"). To address these needs and problems, 5 new machines have been bought and organised in a cluster over the last 6 months. The resources of the cluster are managed centrally and allocated exclusively for experiments. In this talk I will give an overview of the cluster, show how to use it and outline the plan for integration the old machines into the cluster and adding more new machines.

View slides

Hidden Markov Models

Ines Rehbein

In my presentation I will focus on the first part of (Jurafsky & Martin, 2007: Chapter 6): Hidden Markov Models. HMMs are probabilistic sequence classifiers which can be used to compute a possible probability distribution over possible labels and are applied to a wide range of NLP tasks such as speech recognition, tagging, chunking, word sense disambiguation, and so forth. I will talk about different aspects of applications (evaluation, decoding, training), and introduce the Forward, Viterbi and Forward-Backward algorithm.

View slides

Maximum Entropy Models

Grzegorz Chrupala

This talk gives a short introduction to Maximum Entropy models and their use for classification (i.e. document classification) and sequence labeling (i.e. POS-tagging). Maximum entropy models belong to the family of log-linear, or exponential, classifiers. They solve multi-class classification problems using multinomial logistic regression. MaxEnt is based on the idea of building a probabilistic model which satisfies constraints learned from the training data but otherwise makes no additional assumptions. That is a MaxEnt model is the most uniform distribution which is consistent with the constraints. MaxEnt models in themselves are classifiers. Markov models based on Maximum Entropy, or MEMMs can be used for sequence labeling. They work by using the Viterbi algorithm to find the best sequence of labels given the conditional probability distributions for each element in the sequence. Their advantage in comparison to HMM is the ability to condition on arbitrary features. For example for POS-tagging features such as suffixes, capitalization or surrounding punctuation can be used which are difficult to encode in HMMs.

View slides

Exploiting Parallel Treebanks to Improve Phrase-Based Statistical Machine Translation

John Tinsley

We use existing tools to automatically build two parallel treebanks from existing parallel corpora. We then show that combining the data extracted from both the treebanks and the corpora into a single trans- lation model can improve the translation quality in a baseline phrase- based statistical machine translation system.

View paper

Why is it so Difficult to Compare Treebanks? TIGER and TüBa-D/Z revisited

Ines Rehbein

This paper is a contribution to the ongoing discussion on treebank annotation schemes and their impact on PCFG parsing results. We provide a thorough comparison of two German treebanks: the TIGER treebank and the TšuBa-D/Z. We use simple statistics on sentence length and vocabulary size, and more rened methods such as perplexity and its correlation with PCFG parsing results, as well as a Principal Components Analysis. Finally we present a qualitative evaluation of a set of 100 sentences from the TšuBa- D/Z, manually annotated in the TIGER as well as in the TšuBa-D/Z annotation scheme, and show that even the existence of a parallel subcorpus does not support a straightforward and easy comparison of both annotation schemes.

View paper


Parsing with Context Free Grammars

Conor Cafferkey

I will present an overview of the chapter "Parsing with Context-Free Grammars" from the new edition of Jurafsky and Martin's "Speech and Language Processing". The chapter covers full parsing with CFGs -- including CKY, Earley and agenda-based (chart) parsing -- as well as partial parsing (in particular machine learning-based base-phrase chunking). Since much of the material should be quite familiar to many people I will put a particular focus on the new additions to the chapter, and will provide some additional examples not covered in the book.

View slides

Statistical Parsing

Jennifer Foster

I will present an overview of statistical parsing, based on a draft Chapter 14 of the new edition of Jurafsky and Martin's ``Speech and Language Processing''. The chapter covers the following topics: PCFGs, using PCFGs for syntactic disambiguation and language modelling, probabilistic CKY, obtaining rule probabilities, PCFG limitations, lexicalised history-based generative parsing, discriminative parsing, parser evaluation and the human parsing mechanism. I will cover all but the last two topics and also include a very brief overview of the dependency parsing field.

View slides

Machine Translation

Mary Hearne

In this talk, I will present an overview of the Jurafsky & Martin chapter on Machine Translation (Chapt. 25, 2007). I will briefly outline the history of MT as a field of research and where the different approaches fit in. However, I will focus most of the talk on the basics of word- and phrase-based statistical MT, which is the dominant approach to MT both in the research arena and in Jurafsky & Martin's chapter. I aim to outline how the IBM word-alignment models (implemented in Giza++) work, how phrase-pair induction currently works and the basics of decoding for SMT. However, it's a long chapter and each of these topics involves a lot of detail so I'm not sure how much material I'll get through -- Yanjun will also present some of this chapter later on in the seminar series.

Minimum-Translation-Error Discriminative Alignment Training

Patrik Lambert

In present Statistical Machine Translation (SMT) systems, alignment is trained in a previous stage as the translation model. Consequently, alignment model parameters are not tuned in function of the translation task, but only indirectly. The speaker will present a framework for discriminative training of alignment models with automated translation metrics as maximisation criterion. Thus, no link labels at the word level are needed. First the n-gram-based machine translation system will be introduced. Then difficulties of word alignment evaluation and its correlation with machine translation quality will be discussed. After this, the alignment system will be described. Finally, the speaker will present the minimum-translation-error alignment training method on small corpora, and its extension for large corpora (the alignment model coefficients are tuned over a small part of the corpus and used to align the whole corpus).

Generative Word Alignment Models

Yanjun Ma

This talk will focus on generative word alignment models in Statistical Machine Translation (SMT). I will firstly give a formal definition of word alignment, introduce the evaluation methods and mainstream approaches for this task. Then this talk will first cover HMM word alignment models including first-order HMM word alignment model (Vogel, Ney and Tillman 96) and zero-order word alignment model (IBM model 1 and IBM model2). I will also go through how to use EM algorithm for unsupervised parameter estimation (cf. Mary's talk). Then, fertility-based models including IBM model 3, 4, 5 will follow. I will introduce heuristics (hill-climbing) for the approximate parameter estimation of these more complicated models. Advantages and limitations for each of these models will also be illustrated during the talk. Finally, I will give a brief introduction to the implementation of these models - GIZA++ and its limitations in use. An overview of some recent attempts to improve generative models will end this talk.

Dublin City University   Last update: 1st October 2010