Detecting Grammatical Errors with Treebank-Induced, Probabilistic Parsers

Joachim Wagner (2012): Detecting Grammatical Errors with Treebank-Induced, Probabilistic Parsers. PhD Thesis, Dublin City University, Dublin, Ireland.

Abstract

Today's grammar checkers often use hand-crafted rule systems that define acceptable language. The development of such rule systems is labour-intensive and has to be repeated for each language. At the same time, grammars automatically induced from syntactically annotated corpora (treebanks) are successfully employed in other applications, for example text understanding and machine translation. At first glance, treebank-induced grammars seem to be unsuitable for grammar checking as they massively over-generate and fail to reject ungrammatical input due to their high robustness. We present three new methods for judging the grammaticality of a sentence with probabilistic, treebank-induced grammars, demonstrating that such grammars can be successfully applied to automatically judge the grammaticality of an input string. Our best-performing method exploits the differences between parse results for grammars trained on grammatical and ungrammatical treebanks. The second approach builds an estimator of the probability of the most likely parse using grammatical training data that has previously been parsed and annotated with parse probabilities. If the estimated probability of an input sentence (whose grammaticality is to be judged by the system) is higher by a certain amount than the actual parse probability, the sentence is flagged as ungrammatical. The third approach extracts discriminative parse tree fragments in the form of CFG rules from parsed grammatical and ungrammatical corpora and trains a binary classifier to distinguish grammatical from ungrammatical sentences. The three approaches are evaluated on a large test set of grammatical and ungrammatical sentences. The ungrammatical test set is generated automatically by inserting common grammatical errors into the British National Corpus. The results are compared to two traditional approaches, one that uses a hand-crafted, discriminative grammar, the XLE ParGram English LFG, and one based on part-of-speech n-grams. In addition, the baseline methods and the new methods are combined in a machine learning-based framework, yielding further improvements.

Software

I'm sorry to inform you that the source code of the software used in my error detection research is currently not available.

Maybe I can help you in another way? What are you trying to do with the source code? For example,

If you explain your project, maybe there is a way to achieve your goal without accessing the source code. Please contact me on jwagner@computing.dcu.ie.

Introduction

This thesis is concerned with the task of automatic grammaticality judgements, i. e. detecting whether or not a sentence contains a grammatical error, using probabilistic parsing with treebank-induced grammars. A classifier capable of distinguishing between syntactically well-formed and syntactically ill-formed sentences has the potential to be useful in a wide range of applications:

In the area of computer-assisted language learning (CALL), the use of parsing technology often faces scepticism as systems using traditional, hand-crafted grammars fell short of expectations (Borin, 2002; Nerbonne, 2002). However, today's dominant parsing technology uses a different type of grammar: grammars that have been automatically induced from treebanks, i. e. text annotated with syntactic structures. Given sufficiently large treebanks, such grammars tend to be highly robust to unexpected input and achieve wide coverage of unrestricted text (van Genabith, 2006). The robustness also covers grammatical errors. Almost all input is parsed into a (more or less plausible) parse tree, meaning that parsability cannot be used as a criterion for grammaticality. In this thesis, we present three new methods for judging the grammaticality of a sentence with probabilistic, treebank-induced grammars, demonstrating that such grammars can be successfully applied to automatically judge the grammaticality of an input string. Our best-performing method exploits the differences between parse results for grammars trained on grammatical and ungrammatical treebanks. This method combines well with n-gram and deep grammar-based approaches, as well as combinations thereof, in a machine learning-based framework. In addition, voting classifiers are proposed to tune the accuracy trade-off between finding all errors and not overflagging grammatical sentences as ungrammatical.

Outline of Chapters

Chapter 2: Research on methods for detecting grammatical errors in English text using probabilistic parsing draws from a wide range of fields including computational linguistics, computer-assisted language learning and machine learning. Chapter 2 aims to provide an overview of the relevant fields and of the main concepts of data-driven error detection.

Chapter 3: Statistical methods including those we will employ rely heavily on data for training and evaluation. The first half of Chapter 3 presents the corpora we use, including preprocessing, part-of-speech tagging and error annotation. The second part of Chapter 3 deals with evaluation. We discuss the implications of evaluating with accuracy on two test sets (grammatical and ungrammatical test data) both on development and final testing.

Chapter 4: The APP/EPP method for using the parse probability assigned by a proba- bilistic, treebank-induced grammar (APP) for grammaticality judgements occupies a full chapter as it requires the development of a new data-driven model that we call the EPP model. In addition, Chapter 4 provides empirical evidence that parse probabilities reflect grammaticality and provides necessary background on probabilistic parsing.

Chapter 5: We evaluate four basic methods for automatically evaluating the grammaticality of a sentence that share the property that they do not rely on machine learning to set their parameters: (a) parsing with a hand-crafted precision grammar, (b) flagging unattested or rare part-of-speech n-grams, (c) pruning rare rules from PCFGs in an at- tempt to make them more discriminative, and (d) the distorted treebank method that compares parse results with vanilla and error-distorted treebank grammars.

Chapter 6: Three methods of Chapter 5 (leaving out the unsuccessful PCFG pruning method) are developed further using machine learning in Chapter 6. First, we extract features for learning and show the effect of machine learning on each basic method. Then we combine feature sets, building stronger methods. Finally, we propose to tune the accuracy trade-off of machine learning-based methods in a voting scheme.

Chapter 7: We compare results of Chapters 4 to 6, expand the evaluation to a break- down by error type and test selected methods on real learner data. We observe varied performance and draw conclusions for future work involving our methods.

Chapter 8: The final chapter summarises the contributions, main results and lessons learned. We discuss implications for future work on error detection methods, not limited to the methods we present.

Keywords

grammar checker, error detection, natural language processing, probabilistic grammar, precision grammar, decision tree learning, ROC curve, voting classifier, n-gram language models, learner corpus, error corpora


<< back
2014-05-26T12:27:51+0100 Mon May 26 12:27:51 IST 2014
© 2012, 2014 Joachim Wagner jWAGnEr@COMPUTiNg.dcu.IE