Tài liệu Báo cáo khoa học: "Lemmatisation as a Tagging Task" pdf

5 385 0
  • Loading ...
1/5 trang

Thông tin tài liệu

Ngày đăng: 19/02/2014, 19:20

Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 368–372,Jeju, Republic of Korea, 8-14 July 2012.c2012 Association for Computational LinguisticsLemmatisation as a Tagging TaskAndrea GesmundoDepartment of Computer ScienceUniversity of Genevaandrea.gesmundo@unige.chTanja Samardˇzi´cDepartment of LinguisticsUniversity of Genevatanja.samardzic@unige.chAbstractWe present a novel approach to the task ofword lemmatisation. We formalise lemmati-sation as a category tagging task, by describ-ing how a word-to-lemma transformation rulecan be encoded in a single label and how aset of such labels can be inferred for a specificlanguage. In this way, a lemmatisation sys-tem can be trained and tested using any super-vised tagging model. In contrast to previousapproaches, the proposed technique allows usto easily integrate relevant contextual informa-tion. We test our approach on eight languagesreaching a new state-of-the-art level for thelemmatisation task.1 IntroductionLemmatisation and part-of-speech (POS) taggingare necessary steps in automatic processing of lan-guage corpora. This annotation is a prerequisitefor developing systems for more sophisticated au-tomatic processing such as information retrieval, aswell as for using language corpora in linguistic re-search and in the humanities. Lemmatisation is es-pecially important for processing morphologicallyrich languages, where the number of different wordforms is too large to be included in the part-of-speech tag set. The work on morphologically richlanguages suggests that using comprehensive mor-phological dictionaries is necessary for achievinggood results (Hajiˇc, 2000; Erjavec and Dˇzeroski,2004). However, such dictionaries are constructedmanually and they cannot be expected to be devel-oped quickly for many languages.In this paper, we present a new general approachto the task of lemmatisation which can be used toovercome the shortage of comprehensive dictionar-ies for languages for which they have not been devel-oped. Our approach is based on redefining the taskof lemmatisation as a category tagging task. Formu-lating lemmatisation as a tagging task allows the useof advanced tagging techniques, and the efficient in-tegration of contextual information. We show thatthis approach gives the highest accuracy known oneight European languages having different morpho-logical complexity, including agglutinative (Hungar-ian, Estonian) and fusional (Slavic) languages.2 Lemmatisation as a Tagging TaskLemmatisation is the task of grouping together wordforms that belong to the same inflectional morpho-logical paradigm and assigning to each paradigm itscorresponding canonical form called lemma. For ex-ample, English word forms go, goes, going, went,gone constitute a single morphological paradigmwhich is assigned the lemma go. Automatic lemma-tisation requires defining a model that can determinethe lemma for a given word form. Approaching itdirectly as a tagging task by considering the lemmaitself as the tag to be assigned is clearly unfeasible:1) the size of the tag set would be proportional to thevocabulary size, and 2) such a model would overfitthe training corpus missing important morphologi-cal generalisations required to predict the lemma ofunseen words (e.g. the fact that the transformationfrom going to go is governed by a general rule thatapplies to most English verbs).Our method assigns to each word a label encod-368ing the transformation required to obtain the lemmastring from the given word string. The generic trans-formation from a word to a lemma is done in foursteps: 1) remove a suffix of length Ns; 2) add anew lemma suffix, Ls; 3) remove a prefix of lengthNp; 4) add a new lemma prefix, Lp. The tupleτ ≡ Ns, Ls, Np, Lp defines the word-to-lemmatransformation. Each tuple is represented with alabel that lists the 4 parameters. For example, thetransformation of the word going into its lemma isencoded by the label 3, ∅, 0, ∅. This label can beobserved on a specific lemma-word pair in the train-ing set but it generalizes well to the unseen wordsthat are formed regularly by adding the suffix -ing.The same label applies to any other transformationwhich requires only removing the last 3 charactersof the word string.Suffix transformations are more frequent than pre-fix transformations (Jongejan and Dalianis, 2009).In some languages, such as English, it is sufficientto define only suffix transformations. In this case, allthe labels will have Npset to 0 and Lpset to ∅. How-ever, languages richer in morphology often requireencoding prefix transformations too. For example,in assigning the lemma to the negated verb forms inCzech the negation prefix needs to be removed. Inthis case, the label 1, t, 2, ∅ maps the word nevˇedˇelto the lemma vˇedˇet. The same label generalises toother (word, lemma) pairs: (nedok´azal, dok´azat),(neexistoval, existovat), (nepamatoval, pamatovat).1The set of labels for a specific language is inducedfrom a training set of pairs (word, lemma). For eachpair, we first find the Longest Common Substring(LCS) (Gusfield, 1997). Then we set the value ofNpto the number of characters in the word that pre-cede the start of LCS and Nsto the number of char-acters in the word that follow the end of LCS. Thevalue of Lpis the substring preceding LCS in thelemma and the value of Lsis the substring follow-ing LCS in the lemma. In the case of the examplepair (nevˇedˇel, vˇedˇet), the LCS is vˇedˇe, 2 charactersprecede the LCS in the word and 1 follows it. Thereare no characters preceding the start of the LCS in1The transformation rules described in this section are welladapted for a wide range of languages which encode morpho-logical information by means of affixes. Other encodings can bedesigned to handle other morphological types (such as Semiticlanguages). 0 50 100 150 200 250 300 350 0 10000 20000 30000 40000 50000 60000 70000 80000 90000label set sizeword-lemma samplesEnglishSloveneSerbianFigure 1: Growth of the label set with the number of train-ing instances.the lemma and ‘t’ follows it. The generated label isadded to the set of labels.3 Label set inductionWe apply the presented technique to induce the la-bel set from annotated running text. This approachresults in a set of labels whose size convergencesquickly with the increase of training pairs.Figure 1 shows the growth of the label set sizewith the number of tokens seen in the training set forthree representative languages. This behavior is ex-pected on the basis of the known interaction betweenthe frequency and the regularity of word forms thatis shared by all languages: infrequent words tend tobe formed according to a regular pattern, while ir-regular word forms tend to occur in frequent words.The described procedure leverages this fact to in-duce a label set that covers most of the word occur-rences in a text: a specialized label is learnt for fre-quent irregular words, while a generic label is learntto handle words that follow a regular pattern.We observe that the non-complete convergence ofthe label set size is, to a large extent, due to the pres-ence of noise in the corpus (annotation errors, ty-pos or inconsistency). We test the robustness of ourmethod by deciding not to filter out the noise gener-ated labels in the experimental evaluation. We alsoobserve that encoding the prefix transformation inthe label is fundamental for handling the size of thelabel sets in the languages that frequently use lemmaprefixes. For example, the label set generated for369Czech doubles in size if only the suffix transforma-tion is encoded in the label. Finally, we observe thatthe size of the set of induced labels depends on themorphological complexity of languages, as shown inFigure 1. The English set is smaller than the Sloveneand Serbian sets.4 Experimental EvaluationThe advantage of structuring the lemmatisation taskas a tagging task is that it allows us to apply success-ful tagging techniques and use the context informa-tion in assigning transformation labels to the wordsin a text. For the experimental evaluations we usethe Bidirectional Tagger with Guided Learning pre-sented in Shen et al. (2007). We chose this modelsince it has been shown to be easily adaptable forsolving a wide set of tagging and chunking tasks ob-taining state-of-the-art performances with short ex-ecution time (Gesmundo, 2011). Furthermore, thismodel has consistently shown good generalisationbehaviour reaching significantly higher accuracy intagging unknown words than other systems.We train and test the tagger on manually anno-tated G. Orwell’s “1984” and its translations to sevenEuropean languages (see Table 2, column 1), in-cluded in the Multext-East corpora (Erjavec, 2010).The words in the corpus are annotated with bothlemmas and detailed morphosyntactic descriptionsincluding the POS labels. The corpus contains 6737sentences (approximatively 110k tokens) for eachlanguage. We use 90% of the sentences for trainingand 10% for testing.We compare lemmatisation performance in differ-ent settings. Each setting is defined by the set of fea-tures that are used for training and prediction. Table1 reports the four feature sets used. Table 2 reportsthe accuracy scores achieved in each setting. We es-tablish the Base Line (BL) setting and performancein the first experiment. This setting involves onlyfeatures of the current word, [w0], such as the wordform, suffixes and prefixes and features that flag thepresence of special characters (digits, hyphen, caps).The BL accuracy is reported in the second column ofTable 2).In the second experiment, the BL feature set isexpanded with features of the surrounding words([w−1], [w1]) and surrounding predicted lemmas([lem−1], [lem1]). The accuracy scores obtained inBase Line [w0], flagChars(w0),(BL) prefixes(w0), suffixes(w0)+ context BL + [w1], [w−1], [lem1], [lem−1]+ POS BL + [pos0]+cont.&POS BL + [w1], [w−1], [lem1], [lem−1],[pos0], [pos−1], [pos1]Table 1: Feature sets.Base + + +cont.&POSLanguage Line cont. POS Acc. UWACzech 96.6 96.8 96.8 97.7 86.3English 98.8 99.1 99.2 99.6 94.7Estonian 95.8 96.2 96.5 97.4 78.5Hungarian 96.5 96.9 97.0 97.5 85.8Polish 95.3 95.6 96.0 96.8 85.8Romanian 96.2 97.4 97.5 98.3 86.9Serbian 95.0 95.3 96.2 97.2 84.9Slovene 96.1 96.6 97.0 98.1 87.7Table 2: Accuracy of the lemmatizer in the four settings.the second experiment are reported in the third col-umn of Table 2. The consistent improvements overthe BL scores for all the languages, varying fromthe lowest relative error reduction (RER) for Czech(5.8%) to the highest for Romanian (31.6%), con-firm the significance of the context information. Inthe third experiment, we use a feature set in whichthe BL set is expanded with the predicted POS tag ofthe current word, [pos0].2The accuracy measuredin the third experiment (Table 2, column 4) showsconsistent improvement over the BL (the best RERis 34.2% for Romanian). Furthermore, we observethat the accuracy scores in the third experiment areclose to those in the second experiment. This allowsus to state that it is possible to design high qualitylemmatisation systems which are independent of thePOS tagging. Instead of using the POS information,which is currently standard practice for lemmatisa-tion, the task can be performed in a context-wise set-ting using only the information about surroundingwords and lemmas.In the fourth experiment we use a feature set con-sisting of contextual features of words, predictedlemmas and predicted POS tags. This setting com-2The POS tags that we use are extracted from the mor-phosyntactic descriptions provided in the corpus and learnedusing the same system that we use for lemmatisation.370bines the use of the context with the use of the pre-dicted POS tags. The scores obtained in the fourthexperiment are considerably higher than those in theprevious experiments (Table 2, column 5). The RERcomputed against the BL varies between 28.1% forHungarian and 66.7% for English. For this set-ting, we also report accuracies on unseen words only(UWA, column 6 in Table 2) to show the generalisa-tion capacities of the lemmatizer. The UWA scores85% or higher for all the languages except Estonian(78.5%).The results of the fourth experiment show that in-teresting improvements in the performance are ob-tained by combining the POS and context informa-tion. This option has not been explored before.Current systems typically use only the informationon the POS of the target word together with lem-matisation rules acquired separately from a dictio-nary, which roughly corresponds to the setting ofour third experiment. The improvement in the fourthexperiment compared to the third experiment (RERvarying between 12.5% for Czech and 50% for En-glish) shows the advantage of our context-sensitiveapproach over the currently used techniques.All the scores reported in Table 2 represent per-formance with raw text as input. It is important tostress that the results are achieved using a generaltagging system trained only a small manually an-notated corpus, with no language specific externalsources of data such as independent morphologicaldictionaries, which have been considered necessaryfor efficient processing of morphologically rich lan-guages.5 Related WorkJurˇsiˇc et al. (2010) propose a general multilinguallemmatisation tool, LemGen, which is tested onthe same corpora that we used in our evaluation.LemGen learns word transformations in the form ofripple-down rules. Disambiguition between multi-ple possible lemmas for a word form is based on thegold-standard morphosyntactic label of the word.Our system outperforms LemGen on all the lan-guages. We measure a Relative Error Reductionvarying between 81% for Serbian and 86% for En-glish. It is worth noting that we do not use manuallyconstructed dictionaries for training, while Jurˇsiˇc etal. (2010) use additional dictionaries for languagesfor which they are available.Chrupała (2006) proposes a system which, likeour system, learns the lemmatisation rules from acorpus, without external dictionaries. The mappingsbetween word forms and lemmas are encoded bymeans of the shortest edit script. The sets of editinstructions are considered as class labels. They arelearnt using a SVM classifier and the word contextfeatures. The most important limitation of this ap-proach is that it cannot deal with both suffixes andprefixes at the same time, which is crucial for effi-cient processing of morphologically rich languages.Our approach enables encoding transformations onboth sides of words. Furthermore, we propose amore straightforward and a more compact way ofencoding the lemmatisation rules.The majority of other methods are concentratedon lemmatising out-of-lexicon words. Toutanovaand Cherry (2009) propose a joint model for as-signing the set of possible lemmas and POS tagsto out-of-lexicon words which is language indepen-dent. The lemmatizer component is a discrimina-tive character transducer that uses a set of within-word features to learn the transformations from in-put data consisting of a lexicon with full morpho-logical paradigms and unlabelled texts. They showthat the joint model outperforms the pipeline modelwhere the POS tag is used as input to the lemmati-sation component.6 ConclusionWe have shown that redefining the task of lemma-tisation as a category tagging task and using an ef-ficient tagger to perform it results in a performancethat is at the state-of-the-art level. The adaptive gen-eral classification model used in our approach makesuse of different sources of information that can befound in a small annotated corpus, with no need forcomprehensive, manually constructed morphologi-cal dictionaries. For this reason, it can be expectedto be easily portable across languages enabling goodquality processing of languages with complex mor-phology and scarce resources.7 AcknowledgementsThe work described in this paper was partiallyfunded by the Swiss National Science Foundationgrants CRSI22 127510 (COMTIS) and 122643.371ReferencesGrzegorz Chrupała. 2006. Simple data-driven context-sensitive lemmatization. In Proceedings of the So-ciedad Espa˜nola para el Procesamiento del LenguajeNatural, volume 37, page 121131, Zaragoza, Spain.Tomaˇz Erjavec and Saˇso Dˇzeroski. 2004. Machine learn-ing of morphosyntactic structure: lemmatizing un-known Slovene words. Applied Artificial Intelligence,18:17–41.Tomaˇz Erjavec. 2010. Multext-east version 4: Multi-lingual morphosyntactic specifications, lexicons andcorpora. In Nicoletta Calzolari (Conference Chair),Khalid Choukri, Bente Maegaard, Joseph Mariani,Jan Odijk, Stelios Piperidis, Mike Rosner, and DanielTapias, editors, Proceedings of the Seventh conferenceon International Language Resources and Evaluation(LREC’10), pages 2544–2547, Valletta, Malta. Euro-pean Language Resources Association (ELRA).Andrea Gesmundo. 2011. Bidirectional sequence clas-sification for tagging tasks with guided learning. InProceedings of TALN 2011, Montpellier, France.Dan Gusfield. 1997. Algorithms on Strings, Trees, andSequences - Computer Science and Computational Bi-ology. Cambridge University Press.Jan Hajiˇc. 2000. Morphological tagging: data vs. dic-tionaries. In Proceedings of the 1st North Americanchapter of the Association for Computational Linguis-tics conference, pages 94–101, Seattle, Washington.Association for Computational Linguistics.Bart Jongejan and Hercules Dalianis. 2009. Automatictraining of lemmatization rules that handle morpholog-ical changes in pre-, in- and suffixes alike. In Proceed-ings of the Joint Conference of the 47th Annual Meet-ing of the ACL and the 4th International Joint Confer-ence on Natural Language Processing of the AFNLP,pages 145–153, Suntec, Singapore, August. Associa-tion for Computational Linguistics.Matjaˇz Jurˇsiˇc, Igor Mozetiˇc, Tomaˇz Erjavec, and NadaLavraˇc. 2010. LemmaGen: Multilingual lemmatisa-tion with induced ripple-down rules. Journal of Uni-versal Computer Science, 16(9):1190–1214.Libin Shen, Giorgio Satta, and Aravind Joshi. 2007.Guided learning for bidirectional sequence classifica-tion. In Proceedings of the 45th Annual Meeting of theAssociation of Computational Linguistics, pages 760–767, Prague, Czech Republic. Association for Compu-tational Linguistics.Kristina Toutanova and Colin Cherry. 2009. A globalmodel for joint lemmatization and part-of-speech pre-diction. In Proceedings of the 47th Annual Meetingof the ACL and the 4th IJCNLP of the AFNLP, page486494, Suntec, Singapore. Association for Computa-tional Linguistics.372 . smaller than the Sloveneand Serbian sets.4 Experimental EvaluationThe advantage of structuring the lemmatisation task as a tagging task is that it allows. (Hungar-ian, Estonian) and fusional (Slavic) languages.2 Lemmatisation as a Tagging TaskLemmatisation is the task of grouping together wordforms that
- Xem thêm -

Xem thêm: Tài liệu Báo cáo khoa học: "Lemmatisation as a Tagging Task" pdf, Tài liệu Báo cáo khoa học: "Lemmatisation as a Tagging Task" pdf, Tài liệu Báo cáo khoa học: "Lemmatisation as a Tagging Task" pdf

Gợi ý tài liệu liên quan cho bạn