DSpace at VNU: Building a treebank for Vietnamese dependency parsing

2013 IEEE RIVF International Conference on Computing & Communication Technologies Research, Innovation, and Vision for the Future (RIVF) Building a Treebank for Vietnamese Dependency Parsing Luong Nguyen Thi Dalat University Lamdong, Vietnam Email:luongnt@dlu.edu.vn Linh Ha My, Hung Nguyen Viet, Huyen Nguyen Thi Minh, Phuong Le Hong VNU University of Science Hanoi, Vietnam Email: halinh.hus@gmail.com, hungnguyenviet@vnu.edu.vn, huyenntm@vnu.edu.vn, phuonglh@vnu.edu.vn Abstract—The problem of Vietnamese syntactic parsing, especially constituency parsing, has recently been tackled by several research groups A common effort of the Vietnamese language processing community has allowed the creation of VietTreebank, a reference parsed corpus containing about 10,000 sentences for the constituency parsing task In this paper, we present our work to build a reference treebank, based on VietTreebank, for the dependency parsing task, which has not yet been very well studied for Vietnamese First we define a dependency label set by adapting the dependency schema developed by the NLP group at Stanford university and taking into account the particularities of Vietnamese grammar Then we propose an algorithm to convert a constituency treebank to a dependency one The algorithm is tested on a set of 100 sentences of VietTreebank corpus and gives very good results Finally, we carry out an experiment on Vietnamese dependency parsing using MaltParser tool and the dependency treebank converted from VietTreebank I INTRODUCTION Dependency parsing has been one interesting approach to syntactic parsing in recent years The basic idea of dependency parsing is to find the syntactic structure which consists of lexical items, linked by binary asymmetric relations called dependencies There have been many studies on dependency parsing Many tools have been developed to solve this problem Especially, methods based on machine learning give high accuracy parsing results on English, Chinese or Swedish For Vietnamese, most studies centered on constituency parsing such as [1], [2] The Vietnamese treebank reported in [2] consists of about 10,000 sentences in Penn treebank format For dependency parsing, there exists only two works, one of Nguyễn Lê Minh et al [3] which uses MST parser on a corpus consisting of 450 sentences, and one of Lê Hồng Phương et al [4], which uses a lexicalized tree-adjoining grammar parser trained on a subset of the Vietnamese treebank In this paper, we report on our work on building a large corpus for Vietnamese dependency parsing We first develop algorithms for converting from constituency structure to dependency structure We then use the resulting dependency treebank to train MaltParser - a language-independent dependency parser [5] and report the parsing results This paper is organized as follows The next section introduces dependency parsing where basic concepts and some existing works are given The following section presents the construction of a Vietnamese dependency treebank Finally, 978-1-4799-1350-3/13/$31.00 ©2013 IEEE 147 the last section reports experimental results on Vietnamese dependency parsing with MaltParser II DEPENDENCY PARSING A Definition The dependency parsing of a sentence consists in determining the binary asymmetric relations, called dependencies, between its lexical elements A dependency relation between two tokens can be named to clarify the relationship between them Dependency structure is determined by the relationship between the center token (head) and its dependent token (dependent), denoted by an arrow By convention, the root of the arrow is the head, and the top of the arrow is the dependent In comparison to constituency structure, dependency structure is more appropriate to represent syntactic structures of free languages, such as Czech or Turkish In dependency parsing, each syntactic parse of a sentence can be represented by a dependency graph A dependency graph is a graph where each node is a token of the sentence Arcs (edges) of the graph are used to represent dependency relationship between two nodes and the name of the arc is dependency label between those nodes For example, consider an English sentence: "Bills on ports and immigration were submitted by Senator Brownback, Republican of Kansas" Figure shows its dependency graph containing 13 nodes corresponding to 13 words and 12 relationships connecting these words The relationships presented in the sentence are prep(Bills, on), pobj(on, ports) [6] By convention, a special node that does not correspond to any token in the sentence is introduced to represent the root of the dependency graph Dependency parsing is the problem of constructing the most probable dependency graph for a given input sentence The input of a dependency parser is a tokenized and part-ofspeech tagged sentence Most studies on dependency parsing employ machine learning techniques To build a a supervised dependency parser for a language, we need a large dependency treebank of that language B Related Works Recently dependency parsing has received the attention of many research groups There have been many studies submitted auxpass nsubjpass Bills ports prep by were prep prep on by pobj cc and For Vietnamese, few works on dependency parsing exist because of the lack of training dependency treebank In [3], MST was used to parse dependency structures in Vietnamese text Experiments conducted on 450 Vietnamese sentences (POS tagged) give an accuracy of ASU = 67.7%, and of ASL = 63.11% Each dependence is assigned a label by automatic scoring algorithm in MST No concrete label definition is given In [4], dependencies were determined from derivation trees by TAG parsing Each word in the sentence is represented by a elementary tree Derivation trees were constructed from these elementary trees and converted to dependencies by transforming each derivation operation into a dependency relation with label There were 13 labels divided into types: arg (relationship between a head word and its argument), mod (modification relation between a word and its head word), coord (relationship between two lexical heads of two coordinating phrases within a conjunction) conj pobj immigration Brownback nn Senator appos Republican prep pobj of Fig As we can see, the most important step to develop a dependency parser for Vietnamese is to build a reference dependency treebank The definition of a dependency label set is essential for this task In the next section, we present our work on constructing a Vietnamese dependency treebank Kansas Dependency graph of an English sentence III BUILDING VIETNAMESE DEPENDENCY TREEBANK and tools for dependency parsing: MaltParser, StanfordParser, MSTParser Most dependency parsing tools achieve high accuracy and are suitable for many languages, such as English, Chinese, German, Czech The accuracy of a parser is evaluated using two indices: unlabeled attachment score, which is the proportion of correct head - ASU , and labeled attachment score, which is the proportion of correct head and correct dependency type - ASL 1) MSTParser: MSTParser is developed by Ryan McDonald et al [7] MSTParser has two processes: training and analysis In training, MSTParser uses on-line algorithms [8] In analysis, MSTParser uses a graph-based algorithm The accuracy of MSTParser on a variety of languages is quite high: ASU = 92.8%, ASL = 90.7% for Japanese, ASU = 91.1%, ASL = 85.9% for Chinese, ASU = 90.4%, ASL = 87.3% for German 2) Stanford Parser: Stanford Parser is developed by NLP group at Stanford University Stanford Parser defines 53 dependency types for English based on Penn Treebank [6] The accuracy of the parser is quite high, in particular for English ASU = 87.2% and ASL = 84.2% This parser has been extended to parse languages other than English, such as Chinese, German, French and Arabic.2 3) MaltParser: MaltParser is developed by Johan Hall et al MaltParser is the most effective dependency parsing tool, with high accuracy for more than 20 languages MaltParser has two processes: training and analysis In training, MaltParser uses support vector machines algorithm In analysis, MaltParser uses a transition-based algorithm The accuracy of the tool is high, for example ASU = 88.1%, ASL = 86.3% for English and ASU = 88.1%, ASL = 83.4% for German.3 http://sourceforge.net/projects/mstparser/ To build a dependency treebank for Vietnamese, we first define a dependency scheme specific to this language Then we design an algorithm to convert the available Vietnamese constituency treebank [2] to a dependency treebank The orgininal constituency treebank is a corpus containing about 10,000 sentences in Penn treebank format An example sentence is (S-TTL (NP-SUB (Nc-H Mảnh) (N đất) (PP (E-H của) (NP (N-H đạn) (N-H bom)))) (VP (R khơng) (V-H còn) (NP-DOB (N-H người) (A nghèo))) ( .))4 , where • S, NP, PP are the labels of phrases and clauses; • Nc, N, R are the labels of tokens; • SUB, H, DOB, are the functional syntactic labels of phrases, clauses or tokens The converting algorithm has two steps: (1) determining all the dependencies in the sentence and (2) labeling the dependency relations The first step is solved by determining the central element (head element) of all grammatical phrases and clauses using head rules The second step is done by using a dependency label set and a rule for labeling dependencies A Dependency Schema Different dependency labels represent different types of relationships between pairs of tokens of a sentence Typically, the set of dependency labels depends on a particular language Nevertheless, many languages may share an important subset of dependency labels The dependency schema developed by the NLP group at Stanford University defines 53 types of English dependency All of them are binary relations where each dependency defines a relation between the head and its dependent We http://nlp.stanford.edu/software/lex-parser.shtml What http://www.maltparser.org/ 148 used to be the land of bombs was no longer the land of the poor adapt and extend this schema to build a dependency schema for Vietnamese which takes into account the particularities of Vietnamese grammar [9] This schema consists of 48 labels, all of which are explicitly defined and consistent with Vietnamese syntax The most common dependency labels are given below: • vmod: verb modifier, for example vmod(đi, qua) in (VP (V-H đi) (V qua)); • rmod: adverb modifier, for example rmod(Xa xa, nữa) in (AP (A-H Xa xa) (R nữa)); • dobj: direct object of a verbal phrase, for example dobj(còn, người) in (VP (R khơng) (V-H còn) (NPDOB (N-H người) (A nghèo))); • pobj: direct object of a prepositional phrase, for example pobj(bằng, cùi_tay) in (PP-MNR (E-H bằng) (NP (M hai) (N-H cùi_tay) (A cụt_lủn))) S SBAR SQ NP VP AP RP PP QP XP YP MDP WHNP WHAP WHRP WHPP WHXP UCP WHADV WHVP → → → → → → ← → → → → → → → → → → → → → -H;S;VP;AP;NP;.* -H;SBAR;S;VP;AP;NP;.* -H;SQ;VP;AP;NP;.* -H;NP;Nc;Nu;Np;N;P;.* -H;VP;V;A;AP;N;NP;S;.* -H;AP;A;N;S;.* -H;RP;R;T;NP;.* -H;PP;E;VP;SBAR;AP;QP;.* -H;QP;M;.* -H;XP;X;.* -H;YP;Y;.* -H;MDP;T;I;A;P;R;X;.* -H;WHNP;NP;Nc;Nu;Np;N;P;.* -H;WHAP;A;N;V;P;X;.* -H;WHRP;P;E;T;X;.* -H;WHPP;E;P;X;.* -H;XP;X;.* -H;.* -H;R;.* -H;V;.* For example, the rule: VP → -H;VP;V;A;AP;N;NP;S;.* nsubj ROOT-0 punct prepc ncdep Mảnh-1 Fig đất-2 của-3 neg nn pobj đạn-4 bom-5 khơng-6 dobj còn-7 người-8 amod nghèo-9 -10 An example of dependency parsing in Vietnamese Figure shows a dependency parse of the sentence "Mảnh đất đạn bom khơng người nghèo" In this figure, an edge from "Mảnh" to "đất" indicates that "đất" is the modifier of "mảnh" The label of this edge is the relationship name between them can be understood as follows: to find the head of a VP phrase, we browse from left to right to find the first element marked as -H; if there is such element, it will be the head of the VP phrase, if not, we find the VP element to be the head; if VP is not found we find V and so on If there is not any such element, take the first element from the left as head (".*") The following example will describe how to find the head in a phrase: (VP (R khơng) (V-H còn) (NP-DOB (N-H người) (A nghèo)) First, we need to find the head rule for VP phrase in the list of head rules The head rule of VP phrase is: VP → -H;VP;V;A;AP;N;NP;S;.* Second, we need to browse from left to right in the head rule for VP phrase to find the first element marked as -H which is (V-H còn) That means the token "còn" is the head of this VP phrase All dependency relations of this sentence are: ncdep(Mảnh - 1,đất - 2) prepc(Mảnh - 1, - 2) nsubj(còn - 7, Mảnh - 1) pobj(của - 3, đạn - 4) nn(đạn - 4, bom - 5) neg(còn - 7, khơng - 6) Root(ROOT - 0, - 7) dobj(còn - 7, người - 8) amod(người - 8, nghèo - 9) punct(còn - 7, - 10) C Conversion Algorithm The conversion algorithm has two stages In the first stage, a constituency parse is constructed from the bracket format of each sentence of the treebank For example, the parsed sentence (S-TTL (NP-SUB (Nc-H Mảnh) (N đất) (PP (E-H của) (NP (N-H đạn) (N-H bom)))) (VP (R khơng) (V-H còn) (NPDOB (N-H người) (A nghèo))) ( .)) has the constituency parse as shown in Figure In the second stage, the constituency parse is converted to the dependency one This stage has three steps First, find the head of each phrase in the sentence using the head rule table (see Algorithm 1) Second, find a label for each dependency (head, dependent) (see Algorithm 2) Finally, build all the labeled dependencies using a recursive routine calling the two previous steps (see Algorithm 3) B Head Rules In order to determine the head element of each phrase, we build a head rule table This table constitutes an important part of our work Our head rules follow that presented in [10] 149 D Results To evaluate the accuracy of the conversion algorithm, we first select a subset of 100 sentences from the Vietnamese S-STL NP-SUB Nc-H N Mảnh đất PP E-H Fig VP NP R V-H khơng N-H N-H đạn bom NP-DOB N-H A người nghèo A constituency parse of a sentence in the Vietnamese treebank Algorithm FindHeadP(P, lstHeadRules, lstElements) Require: P: a phrase; lstElements: list of elements in P; lstHeadRules: list of head rules Ensure: head of P for headRule ∈ lstHeadRules if headrule.Phrase=P then hr ← headRule break end if end for lstRightHR ← hr.Right for element ∈ lstElements for rightEle ∈ lstRightHR if element.Phrase=rightEle or element.Pos=rightEle then head ← element break end if end for end for return head treebank and manually annotate them with dependency relations We then run the conversion algorithm presented above on these sentences to get dependency parses and compare them to the manual annotation The result is very good–the unlabeled attachment score is of 99.6% and the labeled attachment score is perfect on matched attachments Algorithm GetDependentLabel(h, d, lstLabels) Require: (h, d), where d is a head and d is its dependent; lstLabels: list of labels l Ensure: a dependency label l: h −→ d for labelele ∈ lstlabel lef t ← GetInf ormation(h, labelele.Lef t) right ← GetInf ormation(d, labelele.Right) center ← GetCenterInf ormation(h, d, labelele.center) if IsLabel(lef t, right, center) then l ← labelele.Label break end if end for return l 150 Algorithm ConvertToDP(Root,lstHeadRules,lstLabels,dpTree) Require: Root: root node of the constituency tree; lstHeadRules: list of head rules; lstLabels: list of dependency labels; dpTree: saved dependency tree Ensure: Head of the sentence if Root=null then return end if if IsLeaf(Root) then lstElements ← Word(Root) return FindHeadP(Phrase(Root),lstHeadRules,lstElements) end if if AllChildIsLeaf(Root) then for child ∈ Root lstElements ← Word(child) end for h ← FindHeadP(Phrase(Root),lstHeadRules,lstElements) for child ∈ Root label ← GetDependencyLabel(h, child, lstLabels) depTree ← (h, child, label) end for return h end if lstHeadChilds ← null for child ∈ Root lstHeadChilds ← ConverToDP(Phrase(child), lstHeadRules,lstLabels, dpTree) end for h ← FindHeadP(Phrase(Root),lstHeadRules, lstHeadChilds) for headchild ∈ lstHeadChild label ← GetDependencyLabel(h, headchild, lstLabels) depTree ← (h, headchild, label) end for return h As an example, from the constituency parse (S-TTL (NPSUB (Nc-H Mảnh) (N đất) (PP (E-H của) (NP (N-H đạn) (N-H bom)))) (VP (R khơng) (V-H còn) (NP-DOB (N-H người) (A nghèo))) ( .)), the automatic conversion algorithm produces the following dependency parse: 10 Mảnh đất đạn bom khơng người nghèo Nc N E N N R V N A 1 7 nsubj ncdep prepc pobj nn neg Root dobj amod punct Table I shows the percentage of common labels assigned to dependencies on all the Vietnamese treebank containing of about 10,000 sentences IV EXPERIMENTS WITH MALTPARSER In this section, we present parsing experiments on the Vietnamese dependency treebank constructed in the previous section We use MaltParser to train and test dependency TABLE I P ERCENTAGE OF COMMON DEPENDENCY LABELS ON THE V IETNAMESE TREEBANK No Label % vmod 9.95 rmod 6.36 nsubj 5.81 dobj 5.7 pobj 5.6 nn 5.55 conj 4.67 parsing models on the treebank using cross-validation 10 data sets are created for training and testing Each round, 500 sentences are randomly selected as test set and the rest is used to train MaltParser The configuration of the parser that we use is as follow: • Transition system: Arc-Eager • Parser configuration: Nivre with allowroot=true and allow_reduce=false • Feature model: NivreEager.xml • Learner: liblinear • Oracle: Arc-Eager The experimental results are described in Table II TABLE II D EPENDENCY PARSING ACCURACY WITH MALTPARSER No Test (500 sentences) ASU ASL 1-500 76.43 70.45 1001-1500 75.58 68.40 2001-2500 72.37 65.12 3001-3500 74.16 66.58 4001-4500 69.69 63.47 5001-5500 74.10 67.42 6001-6500 73.49 67.27 7001-7500 72.76 65.91 8001-8500 69.04 63.16 10 9001-9500 72.82 65.74 Average 73.03 66.35 The average ASU is 73.03% and average ASL is 66.35% In these experiments, MaltParser was not optimized for Vietnamese, therefore the accuracy was not high The accuracy can be improved by fixing some errors on the dependency treebank such as: determining the wrong root in the sentences with many clauses, or wrong dependencies of special tokens The set of guidelines for dependency annotation needs to be defined more clearly to improve the quality of dependency identification V CONCLUSION There has been several works on constituency parsing but not many works on dependency parsing for Vietnamese language as few data exist for training dependency parsers However, dependency parsing provides more useful information in natural language processing field than constituency parser Our work aims to automatically build Vietnamese 151 dependency treebank from constituency treebanks which exist more frequently The dependency label set is defined based on Vietnamese grammar in a way allowing us to compare directly our labels with English dependency labels To this, the English dependency label set developed by the NLP group at Stanford University is used as reference Once the Vietnamese dependency treebank of about 10,000 setences converted from VietTreebank, we have done experiments on Vietnamese dependency parsing using MaltParser The evaluation results give 73.03% for the average ASU and 66.35% for the average ASL In a first step, these experiment results help to show some errors in the reference data In the next step, we will revise the corpus and carry out experiments with different parsers to find the best methods for Vietnamese dependency parsing ACKNOWLEDGMENT This work is supported by the VNU research grant QG.12.22 REFERENCES [1] L T Hương, P H Quang, and N T Thủy, “Một cách tiếp cận việc tự động phân tích cú pháp văn tiếng việt,” Tạp chí tin học Điều khiển học, vol 15, no 4, 2000 [2] P T Nguyen, L V Xuan, T M H Nguyen, V H Nguyen, and P LeHong, “Building a large syntactically-annotated corpus of Vietnamese,” in Proceedings of the 3rd Linguistic Annotation Workshop, ACLIJCNLP, Singapore, 2009 [3] N L Minh, H T Điệp, and T M Kế, “Nghiên cứu luật hiệu chỉnh kết dùng phương pháp MST phân tích cú pháp phụ thuộc tiếng việt,” in ICT-rda 8, Hanoi, Vietnam, 2008, pp 258–267 [4] P Le-Hong, T M H Nguyen, and R Azim, “Vietnamese parsing with an automatically extracted tree-adjoining grammar,” in Proceedings of the IEEE International Conference in Computer Science: Research, Innovation and Vision of the Future, RIVF, HCMC, Vietnam, 2012 [5] J Nivre, J Hall, J Nilsson, A Chanev, G Eryigit, S Kubler, S Marinov, and E Marsi, “Maltparser: A language-independent system for data-driven dependency parsing,” Natural Language Engineering, vol 13, no 2, pp 95–135, 2007 [6] M.-C de Marneffe, B MacCartney, and C D Manning, “Generating typed dependency parses from phrase structure parses,” in Proceedings of LREC 2006, Genoa, Italy, 2006 [7] R McDonald, K Lerman, and F Pereira, “Multilingual dependency parsing with a two-stage discriminative parser,” in Proceedings of the Tenth Conference on Computational Natural Language Learning, 2006 [8] R McDonald, K Crammer, and F Pereira, “Online large-margin training of dependency parsers,” in Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, 2005 [9] Q B Diệp and V T Hoàng, Ngữ pháp Tiếng Việt (Vietnamese Grammar) NXB Giáo dục, Hà Nội, Việt Nam, 1999 [10] P Le-Hong, T M H Nguyen, P T Nguyen, and A Roussanaly, “Automated extraction of tree adjoining grammars from a treebank for Vietnamese,” in Proceedings of The Tenth International Workshop on Tree Adjoining Grammars and Related Formalisms (TAG+10), Yale University, New Haven, CT, USA, 2010 ... treebank Kansas Dependency graph of an English sentence III BUILDING VIETNAMESE DEPENDENCY TREEBANK and tools for dependency parsing: MaltParser, StanfordParser, MSTParser Most dependency parsing. .. dependency parsing tools achieve high accuracy and are suitable for many languages, such as English, Chinese, German, Czech The accuracy of a parser is evaluated using two indices: unlabeled attachment... high accuracy for more than 20 languages MaltParser has two processes: training and analysis In training, MaltParser uses support vector machines algorithm In analysis, MaltParser uses a transition-based

DSpace at VNU: Building a treebank for Vietnamese dependency parsing

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan