Bilingual sentence alignment based on sentence length and word translation

VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY HAI-LONG TRIEU BILINGUAL SENTENCE ALIGNMENT BASED ON SENTENCE LENGTH AND WORD TRANSLATION MASTER THESIS OF INFORMATION TECHNOLOGY Hanoi - 2014 2 VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY HAI-LONG TRIEU BILINGUAL SENTENCE ALIGNMENT BASED ON SENTENCE LENGTH AND WORD TRANSLATION Major: Computer science Code: 60 48 01 MASTER THESIS OF INFORMATION TECHNOLOGY SUPERVISOR: PhD. Phuong-Thai Nguyen Hanoi - 2014 3 ORIGINALITY STATEMENT „I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at University of Engineering and Technology (UET) or any other educational institution, except where due acknowledgement is made in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project‟s design and conception or in style, presentation and linguistic expression is acknowledged.‟ Signed 4 Acknowledgements I would like to thank my advisor, PhD Phuong-Thai Nguyen, not only for his supervision but also for his enthusiastic encouragement, right suggestion and knowledge which I have been giving during studying in Master‟s course. I would also like to show my deep gratitude M.A Phuong-Thao Thi Nguyen from Institute of Information Technology - Vietnam Academy of Science and Technology - who provided valuable data in my evaluating process. I would like to thank PhD Van-Vinh Nguyen for examining and giving some advices to my work, M.A Kim-Anh Nguyen, M.A Truong Van Nguyen for their help along with comments on my work, especially M.A Kim-Anh Nguyen for supporting and checking some issues in my research. In addition, I would like to express my thanks to lectures, professors in Faculty of Information Technology, University of Engineering and Technology (UET), Vietnam University, Hanoi who teach me and helping me whole time I study in UET. Finally, I would like to thank my family and friends for their support, share, and confidence throughout my study. 5 Abstract Sentence alignment plays an important role in machine translation. It is an essential task in processing parallel corpora which are ample and substantial resources for natural language processing. In order to apply these abundant materials into useful applications, parallel corpora first have to be aligned at the sentence level. This process maps sentences in texts of source language to their corresponding units in texts of target language. Parallel corpora aligned at sentence level become a useful resource for a number of applications in natural language processing including Statistical Machine Translation, word disambiguation, cross language information retrieval. This task also helps to extract structural information and derive statistical parameters from bilingual corpora. There have been a number of algorithms proposed with different approaches for sentence alignment. However, they may be classified into some major categories. First of all, there are methods based on the similarity of sentence lengths which can be measured by words or characters of sentences. These methods are simple but effective to apply for language pairs that have a high similarity in sentence lengths. The second set of methods is based on word correspondences or lexicon. These methods take into account the lexical information about texts, which is based on matching content in texts or uses cognates. An external dictionary may be used in these methods, so these methods are more accurate but slower than the first ones. There are also methods based on the hybrids of these first two approaches that combine their advantages, so they obtain quite high quality of alignments. In this thesis, I summarize general issues related to sentence alignment, and I evaluate approaches proposed for this task and focus on the hybrid method, especially the proposal of Moore (2002), an effective method with high performance in term of precision. From analyzing the limits of this method, I propose an algorithm using a new feature, bilingual word clustering, to improve the quality of Moore‟s method. The baseline method (Moore, 2002) will be introduced based on analyzing of the framework, and I describe advantages as well as weaknesses of this approach. In addition to this, I describe the basis knowledge, algorithm of bilingual word clustering, and the new feature used in sentence alignment. Finally, experiments performed in this research are illustrated as well as evaluations to prove benefits of the proposed method. Keywords: sentence alignment, parallel corpora, natural language processing, word clustering. 6 Table of Contents ORIGINALITY STATEMENT 3 Acknowledgements 4 Abstract 5 Table of Contents 6 List of Figures 9 List of Tables 10 CHAPTER ONE Introduction 11 1.1. Background 11 1.2. Parallel Corpora 12 1.2.1. Definitions 12 1.2.2. Applications 12 1.2.3. Aligned Parallel Corpora 12 1.3. Sentence Alignment 12 1.3.1. Definition 12 1.3.2. Types of Alignments 12 1.3.3. Applications 15 1.3.4. Challenges 15 1.3.5. Algorithms 16 1.4. Thesis Contents 16 1.4.1. Objectives of the Thesis 16 1.4.2. Contributions 17 1.4.3. Outline 17 1.5. Summary 18 CHAPTER TWO Related Works 19 2.1. Overview 19 2.2. Overview of Approaches 19 7 2.2.1. Classification 19 2.2.2. Length-based Methods 19 2.2.3. Word Correspondences Methods 21 2.2.4. Hybrid Methods 21 2.3. Some Important Problems 22 2.3.1. Noise of Texts 22 2.3.2. Linguistic Distances 22 2.3.3. Searching 23 2.3.4. Resources 23 2.4. Length-based Proposals 23 2.4.1. Brown et al., 1991 23 2.4.2. Vanilla: Gale and Church, 1993 24 2.4.3. Wu, 1994 27 2.5. Word-based Proposals 27 2.5.1. Kay and Roscheisen, 1993 27 2.5.2. Chen, 1993 27 2.5.3. Melamed, 1996 28 2.5.4. Champollion: Ma, 2006 29 2.6. Hybrid Proposals 30 2.6.1. Microsoft’s Bilingual Sentence Aligner: Moore, 2002 30 2.6.2. Hunalign: Varga et al., 2005 31 2.6.3. Deng et al., 2007 32 2.6.4. Gargantua: Braune and Fraser, 2010 33 2.6.5. Fast-Champollion: Li et al., 2010 34 2.7. Other Proposals 35 2.7.1. Bleu-align: Sennrich and Volk, 2010 35 2.7.2. MSVM and HMM: Fattah, 2012 36 2.8. Summary 37 CHAPTER THREE Our Approach 39 3.1. Overview 39 8 3.2. Moore‟s Approach 39 3.2.1. Description 39 3.2.2. The Algorithm 40 3.3. Evaluation of Moore‟s Approach 42 3.4. Our Approach 42 3.4.1. Framework 42 3.4.2. Word Clustering 43 3.4.3. Proposed Algorithm 45 3.4.4. An Example 49 3.5. Summary 50 CHAPTER FOUR Experiments 51 4.1. Overview 51 4.2. Data 51 4.2.1. Bilingual Corpora 51 4.2.2. Word Clustering Data 53 4.3. Metrics 54 4.4. Discussion of Results 54 4.5. Summary 57 CHAPTER FIVE Conclusion and Future Work 58 5.1. Overview 58 5.2. Summary 58 5.3. Contributions 58 5.4. Future Work 59 5.4.1. Better Word Translation Models 59 5.4.2. Word-Phrase 59 Bibliography 60 9 List of Figures Figure 1.1. A sequence of beads (Brown et al., 1991). 13 Figure 2.1. Paragraph length (Gale and Church, 1993). 25 Figure 2.2. Equation in dynamic programming (Gale and Church, 1993) 26 Figure 2.3. A bitext space in Melamed‟s method (Melamed, 1996). 29 Figure 2.4. The method of Varga et al., 2005 31 Figure 2.5. The method of Braune and Fraser, 2010 33 Figure 2.6. Sentence Alignment Approaches Review. 38 Figure 3.1. Framework of sentence alignment in our algorithm. 43 Figure 3.2. An example of Brown's cluster algorithm 44 Figure 3.3. English word clustering data 44 Figure 3.4. Vietnamese word clustering data 44 Figure 3.5. Bilingual dictionary 46 Figure 3.6. Looking up the probability of a word pair 47 Figure 3.7. Looking up in a word cluster 48 Figure 3.8. Handling in the case: one word is contained in dictionary 48 Figure 4.1. Comparison in Precision 55 Figure 4.2. Comparison in Recall 56 Figure 4.3. Comparison in F-measure 57 10 List of Tables Table 1.1. Frequency of alignments (Gale and Church, 1993) 14 Table 1.2. Frequency of beads (Ma, 2006) 14 Table 1.3. Frequency of beads (Moore, 2002) 14 Table 1.4. An entry in a probabilistic dictionary (Gale and Church, 1993) 15 Table 2.1. Alignment pairs (Sennrich and Volk, 2010) 36 Table 4.1. Training data-1 51 Table 4.2. Topics in Training data-1 52 Table 4.3. Training data-2 52 Table 4.4. Topics in Training data-2 52 Table 4.5. Input data for training clusters 53 Table 4.6. Topics for Vietnamese input data to train clusters 53 Table 4.7. Word clustering data sets. 54 [...]... statistical machine translation Meanwhile, word correspondence in another way (geometric correspondence) for sentence alignment is proposed by Melamed, 1996 These algorithms have higher accuracy in comparison with length -based methods Because they use the lexical information from source and translation lexicons rather than only sentence length to determine the translation relationship between sentences in... performance of length -based algorithms declines quickly 2.2.3 Word Correspondences Methods The second approach, one that tries to overcome the disadvantages of length -based approaches, is the word -based method that is based on lexical information from translation lexicons, and/ or through the recognition of cognates These methods take into account the lexical information about texts Most algorithms match content... correspondence and a combination of these first two methods There are also some other techniques such as methods based on BLEU score, support vector machine, and hidden Markov model classifiers 2.2.2 Length -based Methods Length -based approaches are based on modeling the relationship between the lengths of sentences that are mutual translations The length is measured by characters or words of a sentence. .. English sentence, and “19f” means the sentence length (19 words) of a French sentence There is a sequence of beads as follows:  An 𝑒𝑓-bead (one English sentence aligned with one French sentence) followed by  An 𝑒𝑓𝑓-bead (one English sentence aligned with two French sentences) followed by  An 𝑒-bead (one English sentence) followed by  A ¶𝑒¶𝑓 bead (one English paragraph and one French paragraph) An alignment, ... text A comparison between this text (T‟) and the actual target one (T) on a sentence is performed by sentence basis S text T text Aligned Sentences Dictionary Aligning based on Combination: Sentence Length and Shared-Words T‟ text Figure 2.4 The method of Varga et al., 2005 31 There are two major components to indicate the similarity score between a source and a target sentence The token -based score is... source text and the target text, these algorithms usually are more robust than the lengthbased algorithms Nevertheless, algorithms based on a lexicon are slower than those based on length sentence because they require considerably more expensive computation In addition to this, they usually depend on cognates or a bilingual lexicon The method of Chen requires an initial bilingual lexicon; the proposal... of the text are not considered Statistical methods are used for this task instead of the content of texts In other words, these 19 methods only consider the length of sentences in order to make the decision for alignment These methods are based on the fact that longer sentences in one language tend to be translated into longer sentences in the other language, and that shorter sentences tend to be translated... the distance function, and let 𝐷(𝑖, 𝑗) be the minimum distance between sentences 𝑠1 , … 𝑠 𝑖 and their translations 𝑡1 , … , 𝑡𝑗 , under the maximum-likelihood alignment 𝐷(𝑖, 𝑗) is computed by minimizing over six cases (substitution, deletion, insertion, contraction, expansion, and merger) These, in effect, impose a set of slope constraints 𝐷(𝑖, 𝑗) is defined with the initial condition 𝐷 𝑖, 𝑗 = 0 This...CHAPTER ONE Introduction 1.1 Background Parallel corpora play an important role in a number of tasks such as machine translation, cross language information retrieval, word disambiguation, sense disambiguation, bilingual lexicography, automatic translation verification, automatic acquisition of knowledge about translation, and cross-language information retrieval Building a parallel... the cost of expanding 𝑥1 to 𝑦1 and 𝑦2 , and  𝑑(𝑥1 , 𝑦1 ; 𝑥2 , 𝑦2 ) be the cost of merging 𝑥1 and 𝑥2 and matching with 𝑦1 and 𝑦2 The Dynamic Programming Algorithm is summarized in the following recursion equation 25 Figure 2.2 Equation in dynamic programming (Gale and Church, 1993) Let 𝑠 𝑖 , 𝑖 = 1 𝐼, be the sentences of one language, and 𝑡𝑗 , 𝑗 = 1 𝐽, be the translations of those sentences in the . VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY HAI-LONG TRIEU BILINGUAL SENTENCE ALIGNMENT BASED ON SENTENCE LENGTH AND WORD TRANSLATION MASTER. in sentence lengths. The second set of methods is based on word correspondences or lexicon. These methods take into account the lexical information about texts, which is based on matching content. information retrieval, word disambiguation, sense disambiguation, bilingual lexicography, automatic translation verification, automatic acquisition of knowledge about translation, and cross-language