Integrated linguistic to Statistical Machine Translation = Tích hợp thông tin ngôn ngữ vào dịch máy tính thống kê

VIETNAM NATIONAL UNIVERSITY HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY HOAI-THU VUONG INTEGRATED LINGUISTIC TO STATISTICAL MACHINE TRANSLATION MASTER THESIS HANOI - 2012 Contents Introduction 1.1 Overview 1.1.1 A Short Comparison Between English 1.2 Machine Translation Approaches 1.2.1 Interlingua 1.2.2 Transfer-based Machine Translation 1.2.3 Direct Translation 1.3 The Reordering Problem and Motivations 1.4 Main Contributions of this Thesis 1.5 Thesis Organization and Vietnamese Related works 2.1 Phrase-based Translation Models 2.2 Type of orientation phrases 2.2.1 The Distance Based Reordering Model 2.3 The Lexical Reordering Model 2.4 The Preprocessing Approaches 2.5 Translation Evaluation 2.5.1 Automatic Metrics 2.5.2 NIST Scores 2.5.3 Other scores 2.5.4 Human Evaluation Metrics 2.6 Moses Decoder Shallow Processing for SMT 3.1 Our proposal model 3.2 The Shallow Syntax 3.2.1 Definition of the shallow syntax 3.2.2 How to build the shallow syntax 3.3 The Transformation Rule 3.4 Applying the transformation rule into the shallow syntax tree 1 2 3 5 7 9 10 11 11 12 12 13 13 15 15 16 16 17 18 19 Experiments 21 4.1 The bilingual corpus 21 4.2 Implementation and Experiments Setup 21 4.3 BLEU Score and Discussion 22 Conclusion and Future Work 25 5.1 Conclusion 25 5.2 Future work 25 ii Contents Appendix A A hand written of the transformation rules 27 Appendix B Script to train the baseline model 29 Bibliography 31 List of Tables Corpus Statistical Details of our experimental, AR is named as using automatic named as using handwritten rules Size of phrase tables Translation performance for the English-Vietnamese task rules, MR is 21 22 23 23 List of Figures The machine translation pyramid The concept architecture of Moses Decoder An overview of preprocess before training and decoding A pair of source and target language The training process The decoding process A shallow syntax tree The building of the shallow syntax The building of the shallow syntax 14 15 15 16 17 17 18 20 Chapter Introduction In this chapter, we would like to give a brief of Statistical Machine Translation (SMT), to address the problem, the motivations of our work, and the main contributions of this thesis Firstly, we introduce the Machine Translation (MT), which is one of big applications in Natural Language Processing (NLP) and an approach to solve this problem by using statistical Then, we also introduce the main problem of this thesis and our research motivations The next section will describe the main contributions of this thesis Finally, the content of this thesis will be outlined 1.1 Overview In the field of NLP, MT is a big application to help a user translate automatically a sentence from one language to another language MT is very useful in real life: MT help us surf the website in foreign languages, which we don’t understand, or help you understand the content of an advertising board on the street However, the high quality MT is still challenges for researchers Firstly, the reason comes from the ambiguity of natural language at various levels At lexical level, we have problem with the morphology of the word such as the word tense or word segmentation, such as Vietnamese, Japanese, Chinese or Thai, in which there is no symbol to separate two words For an example, in Vietnamese, we have a sentence "học sinh học sinh học.", "học"is a verb, which means ”study” in English, "học sinh"is a noun, which means a pupil or student in English, "sinh học"is a noun, which means a subject (biology) in English At the syntax level, we have the ambiguity of coordinate For example, we have another sentence the man saws the girl with the telescope We can understand that the man used the telescope to see the girl or the girl with the telescope is seen by the man So on, the ambiguity is more difficult in the semantic level Secondly, Jurafsky and Martin (2009) shows that there are some differences in a pair of language such as the difference in structure, lexical, etc , which make MT become challenges Specially, one of the differences between two languages, which we want to aim on this thesis, is the order of words in each language For example, English is a type of Subject-Verb-Object (SVO) language, which means subject comes first, then verb follows the subject and the end of the sentence is Object In the sentence ”I go to school”, ”I” is its subject, the verb is go to and the object is school The different from English, Japanese is a type of SOV language, or Classical Arabic is VSO language In the past, the rule-based method were favorite They built MT system with some manually rules, which are created by human, so that, in closed domain or restricted area, Chapter Introduction the quality of rule based system is very high However, with the increase of internet and social network, we need a wide broad of MT system, and the rule based method is not suitable So that, we need a new way to help the MT, and the statistic is applied to the field of MT At the same time, the statistical method is applied in many studies: automatic speech recognition, etc So that, the idea of using statistical for MT has been coming out Nowadays, there are some MT systems, in which statistical method is used, can compare with human translation such as GOOGLE1 1.1.1 A Short Comparison Between English and Vietnamese English and Vietnamese have some similarities such as they base on the Latin character or are the type of SVO structure For an example: en: I go to school vn: Tôi i học But the order of words in an English noun phrase is different from that in a Vietnamese one For example: en: a black hat vn: mũ màu_en In the above English example, hat is the head of the noun phrase and it stands at the end of the phrase And in Vietnamese, mũ is also the head noun, but it is in the middle of phrase The reorder of words can be seen in wh-question, too en: what is your job? vn: cơng_việc anh ? In this example, the word what mean in Vietnamese The position of these two words can be easil seen Because, English follows with S-Structure and Vietnamese follows with D-Structure 1.2 Machine Translation Approaches In this section, we would like to give a short of approaches in the field of machine translation We would like to begin with complex method (interlingua) and en with simple one (direct method) From a source sentence, we use some analyzing methods to get the complex structures, and then generate the structures or sentences in the target language The highest complex structure is the interlingua language (figure 1) http://translate.google.com 1.2 Machine Translation Approaches Figure 1: The machine translation pyramid 1.2.1 Interlingua The interlingua systems (Farwell and Wilks, 1991; Mitamura, 1999) are based on the idea of finding a language, which called interlingua language to represent the source language and is easy enough to generate the sentence in other language In the figure 1, we can see the process of this approach The analyzing method is the understanding process, in this step, from source sentence we can use some technical in NLP to map source sentence to data structure in the interlingua, then retrieve the target sentence by generating process The problem is how complex the interlingua is If the interlingua is simple, we can get many translation options In other way, the more complex the interlingua is, the more cost effort the analyzing and the generating are 1.2.2 Transfer-based Machine Translation Another approach is analyzing the complex structure (simpler than interlingua structure), then using some transfer rules to get the similar structure in the target language Then generating the target sentence On this model, MT involves three phrases: analysis, transfer and generation Normally, we can use all three phrases However, we sometimes use two of three phrases such as transfer from the source sentence to the structure in target language then generate the target sentence For example, we would like to introduce a simple transfer rule to translate source sentence to the target sentence2 [N ominal → AdjN oun]source language ⇒ [N ominal → N ounAdj]target language This example is take from Jurafsky and Martin (2009) Chapter Introduction 1.2.3 Direct Translation 1.2.3.1 Example-based Machine Translation Example based machine translation was first introduced by Nagao (1984), the author used a bilingual corpus with parallel texts as its main knowledge base, at run time The idea is behind it, is finding the pattern in the bilingual and combining with the parallel text to generate the new target sentence This method is similar with the process in human brain Finally, the problem of example based machine translation comes from the matching criteria, the length of the fragments, etc 1.2.3.2 Statistical Machine Translation Extending the idea of using statistical for speech recognition, Brown et al (1990, 1993) introduced the method using statistical, a version of noisy channel to MT Applyied noisy channel to machine translation, the target sentence is transformed to the source sentence by noisy channel We can represent MT problem as three tasks of noisy channel: forward task: compute the fluency of the target sentence learning task: from parallel corpus find the conditional probability between the target sentence and the source sentence decoding task: find the best target sentence from source sentence So that the decoding task can be represented as this formula: eˆ = arg max P r(e|f ) e Applying the Bayes rule, we have: eˆ = arg max e P r(f |e) ∗ P r(e) P r(f ) Because of the same denominator, we have: eˆ = arg max P r(f |e) ∗ P r(e) e (Jurafsky and Martin, 2000, 2009) define the Pr (e) as the fluency of the target sentence, known as the language model It is usually modeled by n-gram or n-th Markov model The P r(f |e) is defined as the faithfulness between the source and target language We use the alignment model to compute this value base on the unit of the SMT Basing on the definition of the translation unit we have some of approaches: • word based: using word as a translation unit (Brown et al., 1993) • phrase based: using phrase as a translation unit (Koehn et al., 2003) • syntax based: using a syntax as a translation unit (Yamada and Knight, 2001) 1.3 The Reordering Problem and Motivations 1.3 The Reordering Problem and Motivations In the field of MT, the reordering problem is the task to reorder the words, in the target language, to get the best target sentence Sometimes, we call the reordering model as distortion model Phrase-based Statistical Machine Translation (PBSMT), which was introduced by Koehn et al (2003); Och and Ney (2004), is currently the state of the art model in word choice and local word reordering The translation unit is the sequence of words without linguistic information So that, in this thesis, we would like integrate some linguistic information such as a chunking, a syntax shallow tree or transformation rule and with a special aim at solving the global reordering problem There are some studies on integrating syntactic resources within SMT Chiang Chiang (2005) shows significant improvement by keeping the strengths of phrases, while incorporating syntax into SMT Chiang (2005) built a kind of the syntax tree based on synchronous Context Free Grammar (CFG), known as the hierarchical of phrase Chiang (2005) used log linear model to determine the weighted of extracted rules and developed various of CYK algorithm to implement decoding So that, the reordering phrase is defined by the synchronous CFG Some approaches have been applied at the word-level (Collins et al., 2005) They are particularly useful for language with rich morphology, for reducing data sparseness Other kinds of syntax reordering methods require parser trees , such as the work in Quirk et al (2005); Collins et al (2005); Huang and Mi (2010) The parsed tree is more powerful in capturing the sentence structure However, it is expensive to create tree structure, and building a good quality parser is also a hard task All the above approaches require much decoding time, which is expensive The approach we are interested in here is to balance the quality of translation with decoding time Reordering approaches such as a preprocessing step Xia and McCord (2004); Xu et al (2009); Talbot et al (2011); Katz-Brown et al (2011) is very effective (improvement significant over state of-the-art phrase-based and hierarchical machine translation systems and separately quality evaluation of reordering models) 1.4 Main Contributions of this Thesis Inspiring this preprocess approach, we have proposed a combination approach which preserves the strength of phrase-based SMT in local reordering and decoding time as well as the strength of integrating syntax in reordering As the result, we use an intermediate syntax between the Parts of Speech (POS) tag and parse tree: shallow parsing Firstly, we use shallow parsing for preprocess with training and testing Secondly, we apply a series of transformation rules to the shallow tree We have get two sets of transformation rules: the first set is written by hand, and the other is extracted automatically from the bilingual corpus The experiment results from English-Vietnamese pair showed that our approach achieves significant improvements over MOSES, which is the state-of-the art phrase based system ... • oi = M if − ai−1 = • oi = S if − ai−1 = −1 • oi = D if − ai−1 6= ±1 At decoding time, they define there feature function such as: P • fm = ni=1 log p(oi = M | ) P • fs = ni=1 log p(oi = S|... like to give a brief of Statistical Machine Translation (SMT), to address the problem, the motivations of our work, and the main contributions of this thesis Firstly, we introduce the Machine Translation. .. example is take from Jurafsky and Martin (2009) Chapter Introduction 1.2.3 Direct Translation 1.2.3.1 Example-based Machine Translation Example based machine translation was first introduced by

Integrated linguistic to Statistical Machine Translation = Tích hợp thông tin ngôn ngữ vào dịch máy tính thống kê

Thông tin tài liệu

Từ khóa liên quan

Mục lục

Contents

List of Tables

List of Figures

1.1 Overview

1.1.1 A Short Comparison Between English and Vietnamese

1.2 Machine Translation Approaches

1.2.1 Interlingua

1.2.2 Transfer-based Machine Translation

1.2.3 Direct Translation

1.3 The Reordering Problem and Motivations

1.4 Main Contributions of this Thesis

1.5 Thesis Organization

2.1 Phrase-based Translation Models

2.2 Type of orientation phrases

2.2.1 The Distance Based Reordering Model

2.3 The Lexical Reordering Model

2.4 The Preprocessing Approaches

2.5 Translation Evaluation

2.5.1 Automatic Metrics

2.5.2 NIST Scores

Tài liệu cùng người dùng

Tài liệu liên quan