Báo cáo khoa học: "Detecting Highly Conﬁdent Word Translations from Comparable Corpora without Any Prior Knowledge" doc

... 2012.c2012 Association for Computational LinguisticsDetecting Highly Conﬁdent Word Translations from Comparable Corpora without Any Prior KnowledgeIvan Vuli´c and Marie-Francine MoensDepartment ... topic model ondocument-aligned comparable corpora and intro-duce different methods for identifying word trans-lations across languages, underpinned by per-topic word distributions from the trained ... because it is sharedby each of the paired bilingual comparable doc- uments. Topics for each document are sampled from θ, from which the words are then sampledin conjugation with the vocabulary...

Báo cáo khoa học: "Detecting Compositionality in Multi-Word Expressions" doc

... evaluationset is derived from WordNet in a semi-supervised way. Graph connectivity mea-sures are employed for unsupervised pa-rameter tuning.1 Introduction and related workMulti -word expressions ... sequences ofwords that tend to cooccur more frequently thanchance and are either idiosyncratic or decompos-able into multiple simple words (Baldwin, 2006).Deciding idiomaticity of MWEs is highly ... accuracythat is comparable to the best manually selectedcombination of parameters. Thirdly, we proposea semi-supervised approach for extracting non-compositional MWEs from WordNet, to decreaseannotation...

Báo cáo khoa học: "Power-Law Distributions for Paraphrases Extracted from Bilingual Corpora" pdf

... same with any other phrase vertex in G, then the paraphrases willbe captured.The transition probability from any vertex u to any other vertex v in G, i.e., the probability of6hopping from u ... identifying similar words on the graph ofWordNet (Rao et al., 2008) and a related measure,the hitting time is known to perform well in har-vesting paraphrases on a graph constructed from multiple ... from additional monolingual data.The initial problem in harvesting paraphrases from a phrase-table is the identiﬁcation of thesearch space. Previous work has relied on breadthﬁrst search from...

Báo cáo khoa học: Nautilin-63, a novel acidic glycoprotein from the shell nacre of Nautilus macromphalus doc

... extracted from the nacre of the cephalo-pod N. macromphalus [34]. In particular, we obtainedapproximately 40 short sequences of different shellproteins, both extracted from the acid-soluble and from ... determined bymonitoring the pH decrease (Fig. 3). In the blankexperiment (without sample), the pH decreased with-out any time lag (approximately 120 s), correspondingto the rapid precipitation ... weregently dried by capillarity and then by rapid incubation at50 °C. Blank controls were performed without any sample.Samples were subsequently carbon-sputtered and observed at15 keV by SEM (JEOL 6400;...

Báo cáo khoa học: "Diversify and Combine: Improving Word Alignment for Machine Translation on Low-Resource Languages" docx

... endings from words in English. For Pashto, we utilizea morphological decompostion algorithm that hasbeen shown to be effective for Arabic speechrecognition (Xiang et al., 2006). We start from ... suffixes are stripped off from the Pashto words under the two constraints:(1)Longest matched affixes first; (2) Remaining stemmust be at least two characters long.2.3 Partial Word For low-resource ... English and Pashto word togenerate one more alternative for the word align-ment.3 Confidence-Based AlignmentCombinationNow we describe the algorithm to combine mul-tiple sets of word alignments...

Báo cáo khoa học: "Simultaneous Tokenization and Part-of-Speech Tagging for Arabic without a Morphological Analyzer" doc

... regexes. If it text-matches any closed-class expression, we pick a random choice from among those regexes and otherwise from theopen-class regexes that it text-matches. Any POSambiguities for a ... testing, we run each word through all the open and closed regexes. Text-matches for an open-class regex give rise to fea-tures as just described. Also, if the word matches any closed-class regex, ... Table 4 lists the matching regularexpression for three words. The ﬁrst, yjry, text-matches the generic regular expressions for any string/NOA, any string/IV, etc. These are sum-marized in one listing,...

Báo cáo khoa học: "Pivot Approach for Extracting Paraphrase Patterns from Bilingual Corpora" ppt

... that word alignmenterror is the major factor that inﬂuences the perfor-mance of the methods learning paraphrases from bilingual corpora. The LW based features validatethe quality of word alignment ... induced from SE, we extract the pivot patternPCaligning to PE(e) as in Algorithm 2. Note thatthe Chinese patterns are not extracted from parsetrees. They are only sequences of Chinese wordsand ... j) ∈ a}|∀(i,j)∈aw(ci|ej)) (5)where a denotes the word alignment between c ande. n is the number of words in c. ciand ejare wordsof c and e. w(ci|ej) is computed as follows:w(ci|ej)...

Báo cáo khoa học: "Bilingual Terminology Mining – Using Brain, not brawn comparable corpora" ppt

... features of theharvested comparable corpora: the number of doc- uments, and the number of words for each languageand each type of discourse.French Japanese doc. words doc. wordsScientific 65 425,781 ... 538 807,287Table 2: Comparable corpora statistics From these documents, we created two compara-ble corpora: scientific corpora , composed only of scientificdocuments;mixed corpora , composed ... style or discourse.For lexical alignment from comparable corpora, good results on single words can be obtained from large corpora — several millions words — the accu-racy of proposed translation...

Tài liệu Báo cáo khoa học: "Detecting Semantic Equivalence and Information Disparity in Cross-lingual Documents" doc

... exploits dictionaries andphrase tables extracted from bilingual parallel cor-pora to determine the number of word sequences inH that can be mapped to word sequences in T. Inthis way a semantic ... loss in precision.Like lexical phrase tables, SPTs are extracted from parallel corpora. As a ﬁrst step we annotatethe parallel corpora with named-entity taggers forthe source and target languages, ... Ney, 2000) to align the resulting semanticallyaugmented corpora. Finally, we extract the seman-tic phrase table from the augmented aligned corpora using the Moses toolkit (Koehn et al., 2007)....

Tài liệu Báo cáo khoa học: "Collecting Highly Parallel Data for Paraphrase Evaluation" doc

... Linguistics(COLING-2008).Chris Callison-Burch. 2008. Syntactic constraints onparaphrases extracted from parallel corpora. In Pro-ceedings of the 2008 Conference on Empirical Meth-ods in Natural Language ... 7(1):1–29.Ali Ibrahim, Boris Katz, and Jimmy Lin. 2003. Extract-ing structural paraphrases from aligned monolingual corpora. In Proceedings of the 41st Annual Meeting ofthe Association for Computational ... presenting a novel data collectionframework that produces highly parallel textdata relatively inexpensively and on a largescale. The highly parallel nature of this dataallows us to use simple...

Xem thêm