0

bayesian unsupervised word segmentation with nested pitmanyor language modeling

Báo cáo khoa học:

Báo cáo khoa học: "Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling" doc

Báo cáo khoa học

... 100–108,Suntec, Singapore, 2-7 August 2009.c2009 ACL and AFNLP Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling Daichi Mochihashi Takeshi Yamada Naonori UedaNTT Communication ... Japanese word segmentation. Ourmodel is also considered as a way to con-struct an accurate word n-gram language model directly from characters of arbitrary language, without any word indications.1 ... a character n-gramin word n-gram from a Bayesian perspective, Sec-tion 3 introduces a novel language model for word segmentation, which we call the Nested Pitman-Yor language model. Section...
  • 9
  • 238
  • 0
Báo cáo khoa học:

Báo cáo khoa học: "Unsupervised Word Alignment with Arbitrary Features" potx

Báo cáo khoa học

... MethodologyFor each language pair, we train two log-lineartranslation models as described above (§3), once with English as the source and once with Englishas the target language. For a baseline, ... modelstrained to maximize likelihood: infrequent sourcewords act as “garbage collectors”, with many targetwords aligned to them (the word dislike in the Model4 alignment in Figure 2 is an ... 2011.c2011 Association for Computational Linguistics Unsupervised Word Alignment with Arbitrary FeaturesChris Dyer Jonathan Clark Alon Lavie Noah A. Smith Language Technologies InstituteCarnegie Mellon...
  • 11
  • 292
  • 0
Tài liệu Báo cáo khoa học:

Tài liệu Báo cáo khoa học: "Unsupervised Discourse Segmentation of Documents with Inherently Parallel Structure" pdf

Báo cáo khoa học

... Sweden, 11-16 July 2010.c2010 Association for Computational Linguistics Unsupervised Discourse Segmentation of Documents with Inherently Parallel StructureMinwoo Jeong and Ivan TitovSaarland ... problem, we propose an un-supervised Bayesian model for joint dis-course segmentation and alignment. Weapply our method to the “English as a sec-ond language podcast dataset where eachepisode ... stories, etc. This is especially common with the emergence of the Web 2.0 technologies: manytexts on the web are now accompanied with com-ments and discussions. Segmentation of these par-allel parts...
  • 5
  • 376
  • 0
Báo cáo khoa học:

Báo cáo khoa học: "Chinese Segmentation with a Word-Based Perceptron Algorithm" docx

Báo cáo khoa học

... separate experiments without such optimization.8451 word w2 word bigram w1w23single-character word w4a word starting with character c and havinglength l5a word ending with character c ... c2of two con-secutive words12the ending characters c1and c2of two con-secutive words13a word of length l and the previous word w14a word of length l and the next word wTable 1: feature ... sub-words, which include single-character words andthe most frequent multiple-character words from thetraining corpus. Thus it can be seen as a step towardsa word- based model. However, sub-words...
  • 8
  • 380
  • 0
Báo cáo khoa học:

Báo cáo khoa học: "Language Model Based Arabic Word Segmentation" pdf

Báo cáo khoa học

... scored segmentations. 3.2.1 Possible Segmentations of a Word Possible segmentations of a word token are restricted to those derivable from a table of prefixes and suffixes of the language ... We have presented a robust word segmentation algorithm which segments a word into a prefix*-stem-suffix* sequence, along with experimental results. Our Arabic word segmentation system implementing ... tokens with prefix P / number of tokens starting with sub-string P (6) Sscore = number of tokens with suffix S / number of tokens ending with sub-string S (7) PSscore = number of tokens with...
  • 8
  • 189
  • 0
Báo cáo khoa học:

Báo cáo khoa học: "A Novel Word Segmentation Approach for Written Languages with Word Boundary Markers" pptx

Báo cáo khoa học

... 29–32,Suntec, Singapore, 4 August 2009.c2009 ACL and AFNLPA Novel Word Segmentation Approach forWritten Languages with Word Boundary MarkersHan-Cheol Cho†, Do-Gil Lee§, Jung-Tae Lee§, ... module.1 Introduction Word segmentation (WS) has been a fundamen-tal research issue for languages that do not have word boundary markers (WBMs); on the con-trary, other languages that do have ... applications work under the as-sumption that a user input is error-free;thus, word segmentation (WS) for writtenlanguages that use word boundary mark-ers (WBMs), such as spaces, has been re-garded as...
  • 4
  • 268
  • 0
Tài liệu Word Segmentation for Vietnamese Text Categorization: An online corpus approach pptx

Tài liệu Word Segmentation for Vietnamese Text Categorization: An online corpus approach pptx

Cao đẳng - Đại học

... Vietnamese word segmentation is very problematic, especially without a manual segmentation test corpus. Therefore, we perform two experiments, one is done by human judgment for word segmentation ... ways of segmentation, i.e. the important words are segmented correctly while less important words may be segmented incorrectly. Table 6 represents the human judgment for our word segmentation ... inhomogeneous phenomenon in judgment word segmentation. However, the acceptable segmentation percentage is satisfactory. Nearly eighty percent of word segmentation outcome does not make the...
  • 6
  • 741
  • 1
Tài liệu Báo cáo khoa học:

Tài liệu Báo cáo khoa học: "Large-Scale Syntactic Language Modeling with Treelets" docx

Báo cáo khoa học

... contrasts with recent work on language modeling with tree sub-stitution grammars (Post and Gildea, 2009), wherelarger treelet contexts are incorporated by using so-phisticated priors to learn a segmentation ... NNTS.Head Annotations We annotate every non-terminal orpreterminal with its head word if the head is a closed-class word 3and with its head tag otherwise. Klein andManning (2003) used head tag ... gigaword, version 3. In Lin-guistic Data Consortium, Philadelphia, Catalog Num-ber LDC2003T05.Keith Hall. 2004. Best-first Word- lattice Parsing: Tech-niques for Integrated Syntactic Language Modeling. Ph.D....
  • 10
  • 463
  • 0
Tài liệu Báo cáo khoa học:

Tài liệu Báo cáo khoa học: "Unsupervized Word Segmentation: the case for Mandarin Chinese" doc

Báo cáo khoa học

... . wm, and len(wi) is thelength of a word wiused here to be able to com-pare segmentations resulting in a different numberof words. This best segmentation can be computedeasily using ... set of all the possiblesegmentations, then we are looking for:arg maxW ∈Seg(s)∑wi∈Wa(wi) · len(wi),where W is the segmentation corresponding to thesequence of words w0w1. . . wm, ... With this measure, we can redefine the sentence segmentation problem as the maximization of the au-tonomy measure of its words. For a character se-quence s,...
  • 5
  • 467
  • 1
Tài liệu Báo cáo khoa học:

Tài liệu Báo cáo khoa học: "Word Alignment with Synonym Regularization" doc

Báo cáo khoa học

... the different words coupled with the same word in the synonym pairs as synonyms. For in-stance, the words ‘head’, ‘chief’ and ‘forefront’ inthe bilingual sentences are replaced with ‘chief’,since ... and k are a word in a dif-ferent language E and a latent topic, respectively.It has been shown that a word e in a different language is an appropriate representation of s insynonym modeling (Bannard ... correctly. For instance, functional wordsin one language tend to correspond to functionalwords in another language (Deng and Gao, 2007),and the syntactic dependency of words in each lan-guage can...
  • 5
  • 470
  • 2

Xem thêm