... 100–108,Suntec, Singapore, 2-7 August 2009.c2009 ACL and AFNLP Bayesian UnsupervisedWordSegmentation with Nested Pitman-Yor Language Modeling Daichi Mochihashi Takeshi Yamada Naonori UedaNTT Communication ... Japanese word segmentation. Ourmodel is also considered as a way to con-struct an accurate word n-gram language model directly from characters of arbitrary language, without any word indications.1 ... a character n-gramin word n-gram from a Bayesian perspective, Sec-tion 3 introduces a novel language model for word segmentation, which we call the Nested Pitman-Yor language model. Section...
... MethodologyFor each language pair, we train two log-lineartranslation models as described above (§3), once with English as the source and once with Englishas the target language. For a baseline, ... modelstrained to maximize likelihood: infrequent sourcewords act as “garbage collectors”, with many targetwords aligned to them (the word dislike in the Model4 alignment in Figure 2 is an ... 2011.c2011 Association for Computational Linguistics Unsupervised Word Alignment with Arbitrary FeaturesChris Dyer Jonathan Clark Alon Lavie Noah A. Smith Language Technologies InstituteCarnegie Mellon...
... Sweden, 11-16 July 2010.c2010 Association for Computational Linguistics Unsupervised Discourse Segmentation of Documents with Inherently Parallel StructureMinwoo Jeong and Ivan TitovSaarland ... problem, we propose an un-supervised Bayesian model for joint dis-course segmentation and alignment. Weapply our method to the “English as a sec-ond language podcast dataset where eachepisode ... stories, etc. This is especially common with the emergence of the Web 2.0 technologies: manytexts on the web are now accompanied with com-ments and discussions. Segmentation of these par-allel parts...
... separate experiments without such optimization.8451 word w2 word bigram w1w23single-character word w4a word starting with character c and havinglength l5a word ending with character c ... c2of two con-secutive words12the ending characters c1and c2of two con-secutive words13a word of length l and the previous word w14a word of length l and the next word wTable 1: feature ... sub-words, which include single-character words andthe most frequent multiple-character words from thetraining corpus. Thus it can be seen as a step towardsa word- based model. However, sub-words...
... scored segmentations. 3.2.1 Possible Segmentations of a Word Possible segmentations of a word token are restricted to those derivable from a table of prefixes and suffixes of the language ... We have presented a robust wordsegmentation algorithm which segments a word into a prefix*-stem-suffix* sequence, along with experimental results. Our Arabic word segmentation system implementing ... tokens with prefix P / number of tokens starting with sub-string P (6) Sscore = number of tokens with suffix S / number of tokens ending with sub-string S (7) PSscore = number of tokens with...
... 29–32,Suntec, Singapore, 4 August 2009.c2009 ACL and AFNLPA Novel WordSegmentation Approach forWritten Languages withWord Boundary MarkersHan-Cheol Cho†, Do-Gil Lee§, Jung-Tae Lee§, ... module.1 Introduction Word segmentation (WS) has been a fundamen-tal research issue for languages that do not have word boundary markers (WBMs); on the con-trary, other languages that do have ... applications work under the as-sumption that a user input is error-free;thus, wordsegmentation (WS) for writtenlanguages that use word boundary mark-ers (WBMs), such as spaces, has been re-garded as...
... Vietnamese wordsegmentation is very problematic, especially without a manual segmentation test corpus. Therefore, we perform two experiments, one is done by human judgment for wordsegmentation ... ways of segmentation, i.e. the important words are segmented correctly while less important words may be segmented incorrectly. Table 6 represents the human judgment for our wordsegmentation ... inhomogeneous phenomenon in judgment word segmentation. However, the acceptable segmentation percentage is satisfactory. Nearly eighty percent of wordsegmentation outcome does not make the...
... contrasts with recent work on languagemodelingwith tree sub-stitution grammars (Post and Gildea, 2009), wherelarger treelet contexts are incorporated by using so-phisticated priors to learn a segmentation ... NNTS.Head Annotations We annotate every non-terminal orpreterminal with its head word if the head is a closed-class word 3and with its head tag otherwise. Klein andManning (2003) used head tag ... gigaword, version 3. In Lin-guistic Data Consortium, Philadelphia, Catalog Num-ber LDC2003T05.Keith Hall. 2004. Best-first Word- lattice Parsing: Tech-niques for Integrated Syntactic Language Modeling. Ph.D....
... . wm, and len(wi) is thelength of a word wiused here to be able to com-pare segmentations resulting in a different numberof words. This best segmentation can be computedeasily using ... set of all the possiblesegmentations, then we are looking for:arg maxW ∈Seg(s)∑wi∈Wa(wi) · len(wi),where W is the segmentation corresponding to thesequence of words w0w1. . . wm, ... With this measure, we can redefine the sentence segmentation problem as the maximization of the au-tonomy measure of its words. For a character se-quence s,...
... the different words coupled with the same word in the synonym pairs as synonyms. For in-stance, the words ‘head’, ‘chief’ and ‘forefront’ inthe bilingual sentences are replaced with ‘chief’,since ... and k are a word in a dif-ferent language E and a latent topic, respectively.It has been shown that a word e in a different language is an appropriate representation of s insynonym modeling (Bannard ... correctly. For instance, functional wordsin one language tend to correspond to functionalwords in another language (Deng and Gao, 2007),and the syntactic dependency of words in each lan-guage can...