Báo cáo khoa học: "Multilingual adaptations of ANNIE, a reusable information extraction tool" doc

4 216 0
Báo cáo khoa học: "Multilingual adaptations of ANNIE, a reusable information extraction tool" doc

Đang tải... (xem toàn văn)

Thông tin tài liệu

Gate Unicade Editar - Unicade Sampler-tat File Edit Options Help H IL IC[411 'Anal UrocodeirIS  9 ■ Arabic: 13.0 3 zp- J .11 JSI J S JI9 U. Armenian: tipEnni Duval& lau.tht 11.6.241 laUlnalagbula Chinese: 1 - kkl' ,3 tT• . -ftals , c. Farsi I Persian: j3 Georgian: 8063b 3368  66).5. 863033. Hebrew: - .5 pnn s5 no royipt 5m5 513- . 'PC Hindi: 4 isbi T 1tsd I  31 t 4tsi d4f Japanese: f1 , 14/37 tIt's:641 - 1 - 0 -t-tiafokt*-)(-mth „ Korean: 1-1 - L -W-21ff  VoiR. Marathi: 4t ti -1 /9113) 9I - Wa, 1: MT TA - d - ff41 - . Pashto: .41 4 9  aseP Sanskrit:  vITH I Thai: iuAttn9t. - anlei Urdu: - bcej31 Las" Yiddish: UEP1 - rn Lou DY px w5a Inv VP TS. Multilingual adaptations of ANNIE, a reusable information extraction tool Diana Maynard, Hamish Cunningham Dept of Computer Science University of Sheffield Sheffield, Si 4DP, UK diana@dcs.shef.ac.uk Abstract In this demo we will present GATE, an archi- tecture and framework for language engineer- ing, and ANNIE, an information extraction sys- tem developed within it. We will demonstrate how ANNIE has been adapted to perform NE recognition in different languages, including In- dic and Slavonic languages as well as Western European ones, and how the resources can be reused for new applications and languages. 1 Introduction GATE' is an architecture, development envi- ronment, and framework for building systems that process human language (Cunningham et al., 2002; Maynard et al., 2002). It has been in development at the University of Sheffield since 1995, and has been used for many R&D projects, including Information Extraction in multiple languages and media, and for multiple tasks and clients. GATE is available freely, as an open source system, under the GNU library licence, and has been downloaded by around 2500 sites worldwide. The core architecture and some applications developed within GATE have been previously demonstrated (Cunningham et al., 2002); however, this demonstration will fo- cus on the multilingual aspects of GATE, and adaptations of its IE system for different lan- guages. Version 2 of GATE has a large number of added features from the previous version, such as: • comprehensive multilingual support via Unicode 'This work has been supported by the Engineering and Physical Sciences Research Council (EPSRC) un- der grants GR/K25267 and GR/M31699, and by several smaller grants. • tools for performance evaluation • support for manual annotation • reusable visualisation components • database support (Oracle, PostgresQL) • support for distributed resources from the Web • comprehensive document format support (SGML, XML, HTML, RDF, email, plain text) Figure 1: Unicode text in Gate2 2 Processing Resources GATE provides a baseline set of reusable and extendable language processing components for common NLP tasks, known collectively as AN- NIE (A Nearly New Information Extraction System). These include a Unicode tokeniser, sentence splitter, POS tagger, gazetteer, seman- tic tagger, name coreferencer (orthomatcher) 219 IMMFis AnnotatiooidA Din surnarui Nr.336/1111110.2002 • Gnidul calatoriilorfa4a vita in ball Bilandul activitadii poobor buzodani in anul 111.(partea b•Polidistii au posibOtatea legislativa de aid indeplini mai blue sarcinileii misnimle cc le revin• A incep-ut cnn nou an de lupta cu infractorii: Printre primii la calatorli fara vita, 5eribedn Nadine din  a font prinsa pa aerop-ort incercand nO fuga de datorii neonorate de circa Un het din biserici font intuit tocmai de pazni-cul care l-a prins/S-a spart biblioleca Casei de culkira dar hodii RIJ s-au afire de nici o carte/Explozie la sonda 304 Grajdana •Jurnal rutier - Ultima zi din an sub semnul imprudendei Somne papilare" de Col Nicotae Rutaru • "kitrnet pas cu pas" de Adrian firlacdru4-Tanasescu•invataturi din arhivele brancovenes.ti• MO -AMA PERFOPMANTEI "Este momentui sa ne redescoperm pasiunea  p. _1We 5e4 s _BLOCK Date Romanian 20 34 kind= date, rule= DateOnlyFinal) Location Romanian 73 79 Me= LocFinal} Date ilkoutan I an 132 136 {kind= date, rure-DateOnlyFinal) Person Romanian 334 349 {rule= PersonFinal} Location Romanian 354 359 {rule= LocFinal} Money Romanian•434 _ 451 {kincl=number, rule=MoneyCurrency Person Romanian  707 726 {rule=PersonFinal} Person Romanian 754 771{ruie.Per$onFl} 'MOW and pronominal coreferencer. For more details, see (Cunningham et al., 2002). ANNIE cur- rently produces precision and recall figures for named entity recognition of around 90%, de- pending on the text type. An online demo of ANNIE is available at http://gate.ac.uk/a,nnie/index.jsp . A set of movies demonstrating document and corpus loading, processing and storing, manual an- notation of documents and corpora, creat- ing, running, saving and restoring applica- tions and viewing their results is available at http: / /www.gat e. ac. uk/ demos / movies .ht ml. 3 Multilingual support - the GATE Unicode Kit GATE is one of the few architectures to sup- port multilingual processing, using Unicode as its default text encoding. A Unicode enabled graphical user interface (GUI) needs to address two main issues: the capability to display text and the ability to enter text in other languages than the default one. It also provides a means of entering text in a variety of languages and scripts, using virtual keyboards where the language is not supported by the underlying operating platform (Java it- self does not support input in many languages covered by Unicode, although it supports Uni- code representation). Figure 1 depicts text in various scripts displayed in GATE. The facili- ties have been developed as part of the EMILLE project (Baker et al., 2002), designed to con- struct a 63 million word corpus of South Asian languages. There are currently 28 languages supported in GATE, and more are planned for the future. Since GATE is an open architecture, new virtual keyboards can easily be defined by users and added to the system. Apart from the input methods, GUK also provides a simple Unicode-aware text editor which is important because not all platforms provide one by default or the users may not know which one of the already installed edi- tors is Unicode-aware. Besides providing text visualisation and editing facilities, the GUK ed- itor also performs encoding conversion opera- tions. The editor has proved a useful tool dur- ing the development and testing of GATE in a cross-platform environment, while the ability to handle Unicode enables applications developed within GATE to be easily ported to new lan- guages. 4 The future isn't English Robust tools for multilingual information ex- traction are becoming increasingly sought af- ter now that we have capabilities for processing texts in different languages and scripts. While the default IE system is English-specific, some of the modules can be reused directly (e.g. the Unicode-based tokeniser can handle Indo- European languages), and/or easily customised for new languages (Pastra et al., 2002). So far, ANNIE has been adapted to do IE in Bulgarian, Romanian, Bengali, Greek, Spanish, Swedish, German, Italian, and French, and we are cur- rently porting it to Arabic, Chinese and Rus- sian, as part of the MUSE project 2 . Figure 3: Romanian news text annotated in GATE 4.1 NE in Slavonic languages The Bulgarian NE recogniser (Paskaleva et al., 2(02) was built using three main processing re- sources: a tokeniser, a gazetteer and a semantic grammar built using JAPE. There was no POS tagger available in Bulgarian, and consequently we had no need of a sentence splitter either. The main changes to the system were in terms 2http://www.des.shefac.uk/research/groups/n1p/muse/ 220 al Gate 17-1111 Applications - ej , BG system CI.  Language Resources = A Text EiG Processing Resources BG Names t" . ■ Unicode Tokeniser =  BG Gazetteer g Data stores Messages I di BO system  Text BO I Emol - apgslee aerompe 8 METO4HEI EBFIOna AHM,I9 e H8 crepes reproahuln m chpaagnn ca e Sanagaa Espana. tR - I'M K IHNHI . X8M 30 Mavr 2000  — D - r M Cunningham Dr. M Cunningham D-rAHrenoaa Type • Set Stag End Features Lookup ,Default, 71 78 trninorType=country, majorType=location} Lookup ,Default> 60 68 trninorType=country, majorType=location} Lookup ,Default> 37 43 trninorType=country, majorType=location} Lookup ,Default> 0 8 trninorType=country, majorType=location} Person ,Default> 210 216 tkInd=personName, rule=KallnaPersonl Person ,Default> 173 179 tkInd=personName, rule=HamishPerson} Person ,Default> 148 154 tkInd=personName, rule=HamishPerson} Person ,Default> 108 113 tkind=personName, rule.HamishPersonBet Person ,Default> 266 270 tkind=personName, rule=GaliaPersont Person ,Default> 248 253 tkind=personName, rule=GaliaPersont Person ,Default> 230 236 tkind=personName, rule=KalinaPersonl E <Default> 1—T7 Lookup F SpaceToken DA Gate 2.0alphal File Edit Tools Help Figure 2: Bulgarian named entities in GATE of the gazetteer lists (e.g. lists of first names, days of the week, locations etc. were tailored for Bulgarian), and in terms of some of the pattern matching rules in the grammar. For example, Bulgarian makes far more use of morphology than English does, e.g. 91% of Bulgarian sur- names could be directly recognised using mor- phological information. The lack of a POS tag- ger meant that many rules had to be specified in terms of orthographic features rather than parts of speech. Figure 2 shows some Bulgarian text annotated in GATE. Since the structure of the Bulgarian and Russian languages is quite similar, we antic- ipate that converting the Bulgarian system to Russian will be fairly straightforward, and will involve mostly replacing and/or updating gazetteer lists at least to obtain comparable results. 4.2 NE in Romanian The Romanian NE recogniser (Hamza et al. ; 2002) was developed from ANNIE in a simi- lar way to the Bulgarian one, using a tokeniser, gazetteer and a JAPE semantic grammar. Fig- ure 3 shows some Romanian text annotated in GATE. Romanian is a more flexible language than English in terms of word order; it is also agglu- tinative e.g. definite articles attach to nouns, making a definite and indefinite form of both common and proper nouns. As with Bulgarian, the tokeniser did not need to be modified, while the gazetteer lists and grammar rules needed some changes ; most of which were fairly minor. For both Bulgar- ian and Romanian, the modifications necessary were easily implemented by a native speaker who did not require any other specialist skills beyond a basic grasp of the JAPE language and the GATE architecture. No Java skills or other programming knowledge was necessary. The Gate Unicode kit was invaluable in enabling the preservation of the diacritics in Romanian, by saving them with UTF-8 encoding. 4.3 NE in other languages ANNIE has also been adapted to perform NE recognition on English, French and German di- alogues in the AMITIES project 3 , a screenshot of which is shown in Figure 4. Since French and German are more similar to English in many ways than e.g. Slavonic languages, it was very easy to adapt the gazetteers and grammars ac- cordingly. 3 http://www.dcs.shcf.ac.uk/nlp/amitics 221 M ArruChat  7EIEF File AMITIES: Amities, how can I help, je vous ecoute, was lcann ich ftir Sie tun? USER: Guten Tag, Ich mochte gem wissen, movie! ich out metnem Konto hate AMITIES: Darf ich bitte Ihre Kontonummer haben? USER: Mon numero est 5522333344945555 AMITIES: Pouvez-vous me donner votre adresse et votre date de naissance, s'il vous plait? USER: I live at 547 Ratchet Lane, and my date of birth is the 18th of May, 1969 AMITIES: Thank you. Here is the information about your account. Figure 4: AMITIES multilingual dialogue 4.4 Surprise languages We are currently investigating methods of adapting ANNIE to new languages with the minimum of resources and time. Our previous experiments with languages other than English have demonstrated that we can get reasonable results in around 2 person months using a na- tive speaker and hand-coded semantic tagging rules, without requiring resources such as dictio- naries or POS taggers for that language. We are also planning participation in the TIDES-based "surprise language experiment", which requires various NLP tasks such as IE, IR, summarisa- tion and MT to be carried out in a month on a surprise language, the nature of which will not be known in advance. The open and flex- ible architecture of GATE, and the separation of linguistic data from processing makes it an ideal environment within which to perform such a task. Any available linguistic resources such as dictionaries and POS taggers can be simply plugged into the model, but if these are not available we can simply modify other compo- nents as necessary. 5 Conclusion In this demo we have shown how an existing set of IE tools has been modified to a diverse set of languages with minimum overhead. The ad- vantage of having such low-overhead portability is that it enables quick deployment of IE tools with acceptable performance, which, even if not developed in end-used applications ; can be used to bootstrap the creation of IE-annotated cor- pora and/or facilitate the training of learning tools for adaptive IE. In addition, some adap- tive IE tools are now using the ANNIE corn- ponents to provide them with richer linguistic information (Ciravegna et al., 2002). References P. Baker, A. Hardie, T. McEnery, H. Cunning- ham, and R. Gaizauskas. 2002. EMILLE, A 67-Million Word Corpus of Indic Lan- guages: Data Collection, Mark-up and Har- monisation. In Proceedings of 3rd Lan- guage Resources and Evaluation Conference (LREC'2002), pages 819-825. F. Ciravegna, A. Dingli, D. Petrelli, and Y. Wilks. 2002. User-System Cooperation in Document Annotation Based on Information Extraction. In 13th International Confer- ence on Knowledge Engineering and Knowl- edge Management (EKAW02), pages 122- 137, Siguenza, Spain. H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. 2002. GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In Pro- ceedings of the 40th Anniversary Meeting of the Association for Computational Linguis- tics. 0. Hamza, D. Maynard V.Tablan, C. Ursu, H. Cunningham, and Y. Wilks. 2002. Named Entity Recognition in Romanian. Techni- cal report, Department of Computer Science, University of Sheffield. D. Maynard, V. Tablan, H. Cunningham, C. Ursu, H. Saggion ; K. Bontcheva, and Y. Wilks. 2002. Architectural elements of language engineering robustness. Journal of Natural Language Engineering - Special Is- sue on Robust Methods in Analysis of Natural Language Data, 8(2/3):257-274. E. Paskaleva, G. Angelova, M.Yankova, K. Bontcheva, H. Cunningham, and Y. Wilks. 2002. Slavonic named enti- ties in gate. Technical Report CS-02-01, University of Sheffield. Katerina Pastra, Diana Maynard, Hamish Cun- ningham, Oana Hamza, and Yorick Wilks. 2002. How feasible is the reuse of grammars for named entity recognition? In Proceed- ings of 3rd Language Resources and Evalu- ation Conference. 222 . - bcej31 Las" Yiddish: UEP1 - rn Lou DY px w 5a Inv VP TS. Multilingual adaptations of ANNIE, a reusable information extraction tool Diana Maynard, Hamish. calatoriilorfa 4a vita in ball Bilandul activitadii poobor buzodani in anul 111.(partea b•Polidistii au posibOtatea legislativa de aid indeplini mai

Ngày đăng: 24/03/2014, 03:20

Từ khóa liên quan

Mục lục

  • Page 1

  • Page 2

  • Page 3

  • Page 4

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan