... involving the comparable corpora. Re-scoring through the ComparableCorpora Comparable corpora could be considered for thedisambiguation of translation alternatives and thusselection of best phrasal ... with 10.35%,8.27% and 3.08% for the WWW-based, the NTCIR-based andcomparable corpora- based techniques, re-spectively compared to the hybrid two-stages com-parable corporaand linguistics-based ... (1998-1999) forJapanese and Mainichi Daily News (1998-1999) forEnglish were considered as comparable corpora. Wehave also considered documents of NTCIR-2 testcollection as comparablecorpora in order...
... WordNet) and Collins bilingualdictionary. Section 5 evaluates our methodolo-gies and Section 6 concludes the paper suggestingsome future developments.2 ComparableCorpora Comparable corpora ... tiz∈ T∗ and for every language Lj, and is known, then thecorpus is parallel and aligned at document level.For the purpose of this paper it is enough to as-sume that two corpora are comparable, ... Italian and 27,821 English newspartitioned by AdnKronos into four fixed cat-egories: QUALITYOF LIFE, MADE IN ITALY,TOURISM, CULTURE AND SCHOOL. The En-glish and the Italian corpora are comparable, ...
... Research, and Interna-tional Communications Foundation.ReferencesJ.Allan and R.Papka and V.Lavrenko, On-line new eventdetection and tracking, Proc. of the DARPA Workshop,1998.J.Allan and V.Lavrenko ... chose the TDT3 English corpora as our goldstandard corpora. TDT3 consists of 34,600 sto-ries with 60 manually identified topics. We thencreated Japanese corpora (Mainichi and Yomiurinewspapers) ... stories, and tracking.3.1 Extracting Bilingual Story PairsWe extract story pairs which consist of positiveEnglish story and its associated Japanese storiesusing the TDT English and Mainichi and...
... (Cucerzan and Yarowsky, 1999) and (Collins and Singer, 1999) present algorithms to obtainNEs from untagged corpora. However, they focuson the classification stage of already segmentedentities, and ... transliteration candidates in another lan-guage. Time sequence scoring is then used to re-rank the list and choose the candidate best tem-porally aligned with the NE. Pairs of NEs and thebest candidates ... pairs of En-glish NEs and their Russian transliterations. Nega-tive examples here and during the rest of the train-ing were pairs of randomly selected non-NE En-glish and Russian words.New...
... terms and named entities. The toolkit pairs similar bilingual comparable documents and extracts parallel sentences and bilingual terminological and named entity dictionaries from comparable corpora. ... from parallel corpora by means of symmetrical word alignment and/ or by phrase generation (Koehn et al., 2003). Our toolkit exploits comparablecorpora in order to find and extract comparable ... extraction from comparable corpora. It consists of tools bundled in two workflows: (1) alignment of comparable documents and extraction of parallel sentences and (2) extraction and bilingual...
... 1: ComparableCorpora The corpora can be categorized into two sepa-rate groups, group S (for Small) consisting of EK-S, ET-S, ER-S, and EH-S and group L (for Large) consisting of EK-L and ... language and the pairing of articles in the comparablecorpora is known in advance. We want to emphasize here that such corpora are indeed available in many domains such as tech-nical documents and ... 11111,|,||1jajajjAmjnmtstpsaapstPjj Here, jt (and resp. is) denotes the jth (and resp. ith) character in wT (and resp. wS) and maA1is the hidden alignment between wT and wS where jtis aligned...
... Acquisitionfrom Comparable Corpora Takehito UtsuroTakashi Horiuchi and Kohei HinoGraduate School of Informatics, Takeshi Hamamoto and Takeaki NakayamaKyoto UniversityDpt. Information and Computer ... se-quences, and word frequency vectors v (d j) and v (dlYT) are generated. Then, cosine similaritiesbetween v (d j) and v (dr') are calculated3 and pairs of articles di and dE ... help of anyexisting bilingual lexicons. On the other hand,later works such as Kaji and Aizono (1996),Fung and Yee (1998), Rapp (1999), and Tanaka (2002) studied to exploit existing bilinguallexicons...
... The taxes increased across the board and everyone must pay more. act high and mighty - to act proud and powerful The woman always acts high and mighty and nobody likes her. act one's ... energy The man stood up and belted out several old songs. Mr_doody2004@yahoo.com 29 B back Idioms back and forth - backwards and forwards, first one way and then the other way ... Mr_doody2004@yahoo.com 47 bite the hand that feeds you - to harm or turn against someone who does good things for you He is biting the hand that feeds him when he criticizes and fights against his boss....
... pair(isola, island).2. Remove the words isola and island fromtheir respective vocabularies.3. Since island is not in the vocabulary, theindirect association between arcipelago and island is not ... onprecision and recall of bilingual lexicon extractionfrom parallel corpora. This assumption shouldalso be reasonable for many types of comparable corpora such as Wikipedia or news corpora, whichare ... promisingperformance.5 Conclusions and Future WorkWe have designed an algorithm that focuses on ac-quiring and keeping only highly confident trans-lation candidates from multilingual comparable corpora. By employing...
... including the joint probability phrase-based model (Marcu and Wong, 2002) and a vari-ant on the alignment template approach (Och and Ney, 2004), and contrast them to the performance ofthe word-based ... pairs, and has127 million words in English, and 106 million wordsin Arabic. The table shows the number of uniqueArabic phrases, and gives the average number oftranslations into English and their ... corpus, and another created fromthe target language portion of the corpus,• An index that tells us the correspondence be-tween sentence numbers and positions in thesource and target language corpora, •...
... on me.backhanded compliment and left-handed compli-mentan unintended or ambiguous compliment. ᮀ Back-handed compliments are the only kind he ever gives! ᮀ And I think his left-handed compliments ... abbreviations and symbols areused, and these are explained in the section“Terms and Symbols.” The user who understandsthe meaning of entry head, variable, and wildcard term is equipped to understand ... back and made a motion withhis hand indicating that Mary should go first. “After you,”smiled Bob.again and again repeatedly; again and even more[times]. ᮀ He knocked on the door again and...
... and that of P2is 0.939. Both corpora aremore comparable than P0of which the comparabil-ity is 0.881. Furthermore, both P1 and P2are more comparable than P1(comparability 0.912) and ... studies and isnow standard.3.2.2 Results and AnalysisIn a first series of experiments, bilingual lexiconswere extracted from the corpora obtained by our ap-proach (P1 and P2), the corpora ... Englishdocuments and 87k French documents) consisting ofthe corpora LAT94, MON94 and SDA94; P2T(368kEnglish documents and 378k French documents)consisting of Wiki-En and Wiki-Fr.1http://trec.nist.gov2http://www.clef-campaign.org3The...
... acquire translation candidates basedon comparableand unrelated corpora comes from(Rapp, 1995). Similar approaches are described in(Diab and Finch, 2000), (Koehn and Knight, 2002) and (Gaussier et ... behind our work and gives an overview and a theoretical background ofthe methods. Section 4 evaluates and discusses ini-tial results. Finally, section 5 proposes several ex-tensions and gives a ... 479–484,Portland, Oregon, June 19-24, 2011.c2011 Association for Computational LinguisticsIdentifying Word Translations from ComparableCorpora Using LatentTopic ModelsIvan Vuli´c, Wim De Smet and...
... corpora (Myaeng and Jeong, 1999; Oh and Choi, 2001) use the phonetic differences between conventional Korean words and loanwords. However, these methods require manually tagged training corpora, ... or comparable corpus is not available, such as Mongolian and Japanese. Fujii et al. (2004) proposed a method that does not require tagged corpora or parallel corpora to extract loanwords and ... bilingual comparable corpora, and matched named entities in each language corpus if they were similar to each other. Thus, Lam et al.’s method cannot be used for a language pair where comparable corpora...
... several comparable cor-pora for English and Russian, including large ref-erence corpora (the BNC and the Russian Refer-ence Corpus) andcorpora of major British and Russian newspapers. All corpora ... Barcelona.Michael Carl and Andy Way, editors. 2003. Re-cent advances in example-based machine transla-tion. Kluwer, Dordrecht.Ido Dagan and Kenneth Church. 1997. Ter-might: Coordinating humans and machines ... corpus-based extraction and the very large lexicon. In LarsBorin, editor, Language and Computers, Parallel corpora, parallel worlds, pages 137–149. Rodopi.John S. Justeson and Slava M. Katz. 1995....