... International ChineseWordSegmentation Bake-off. Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, Sapporo, Japan, July2003.Xue, N. 2003. ChineseWordSegmentation ... co-occurrence. Word based model. In this model, statistical dataabout word boundary frequencies for each characteris retrieved word- wise. For example, in the case ofa monosyllabic word only two word ... introduce is that Chinese wordsegmentation is the classifi-cation of a string of character-boundaries(CB’s) into either word- boundaries (WB’s)and non -word- boundaries. In Chinese, CB’sare delimited...
... as that in English. Chinese word segmentation is therefore the first step for any Chinese information processing system[ 1]. Almost all methods for Chineseword segmentation developed so far, ... Automatic Word Segmentation System for Written Chinese Texts", Journal of Chinese Information Processing, Vol. 1, No.2, 1987 (in Chinese) [2] Fan C.K.,Tsai WH., "Automatic Word Identification ... ofHong Kong, Hong Kong Abstract Chinese wordsegmentation is the first step in any Chinese NLP system. This paper presents a new algorithm for segmenting Chinese texts without making use...
... decoding.3 ChineseWordSegmentation (CWS)3.1 Wordsegmentation as character taggingConsidering the ambiguity problem that a Chinese character may appear in any relative position in a word and ... Character- and word- based featuresAs studied in previous work, word- based featuretemplates usually include the word itself, sub-wordscontained in the word, contextual characters/wordsand so ... are incorporatedinto word- based CWS models, some word- basedfeatures are no longer of interest, such as the start-ing character of a word, sub-words contained inthe word, contextual characters...
... obtain accuracyimprovements on both segmentation and Joint S&T.2 Segmentation and POS TaggingGiven a Chinese character sequence:C1:n= C1C2 Cnthe segmentation result can be depicted ... end of the word • s: a single-character word We can extract segmentation result by splittingthe labelled result into subsequences of pattern s orbm∗e which denote single-character word and ... 3-gram word language model measuring the flu-ency of the segmentation result, a 4-gram POS lan-guage model functioning as the product of state-transition probabilities in HMM, and a word- POSco-occurrence...
... stacked sub -word model. Given multiple word segmentations of onesentence, we formally define a sub -word structurethat maximizes the agreement of non -word- breakpositions. Based on the sub -word structure, ... pre-dicted words and their POS information as clues tofind a new word. After one word is found and classi-fied, solvers move on and search for the next possi-ble word. This word- by -word method ... data for sub -word tagging.3 Method3.1 ArchitectureIn our stacked sub -word model, joint word segmen-tation and POS tagging is decomposed into twosteps: (1) coarse-grained word segmentation...
... Generation of Words with InternalStructuresWords with rich internal structures can be describedusing a context-free grammar formalism as word → root (3) word → word suffix (4) word → prefix word (5)Here ... trained with the Penn Chinese Treebank and actually is able to parse both word and phrase structures in a unified way.1 Why Parse Word Structures?Research in Chinesewordsegmentation has pro-gressed ... 2003. Chinesewordsegmentation ascharacter tagging. Computational Linguistics and Chinese Language Processing, 8(1):29–48.Yue Zhang and Stephen Clark. 2007. Chinese segmenta-tion with a word- based...
... len(wi),where W is the segmentation corresponding to thesequence of words w0w1. . . wm, and len(wi) is thelength of a word wiused here to be able to com-pare segmentations resulting ... redefine the sentence segmentation problem as the maximization of the au-tonomy measure of its words. For a character se-quence s, if we call Seg(s) the set of all the possiblesegmentations, then ... againstthe corpora from the Second International Chi-nese WordSegmentation Bakeoff (Emerson, 2005).These corpora cover 4 different segmentation guide-lines from various origins: Academia Sinica...
... 1996. A stochastic finite-state word- segmentation algorithm for Chinese. ComputationalLinguistics, 22.Weiwei Sun. 2011. A stacked sub -word model for joint Chinese wordsegmentation and part-of-speech ... improve the segmentation of out-of-vocabulary (OOV) words. Unlike languages suchas Japanese that use a distinct character set (i.e.katakana) for foreign words, the transliterated wordsin Chinese, ... POStags. The joint approach to wordsegmentation andPOS tagging has been reported to improve word seg-mentation and POS tagging accuracies by more than1% in Chinese (Zhang and Clark, 2008)....
... Vietnamese wordsegmentation is very problematic, especially without a manual segmentation test corpus. Therefore, we perform two experiments, one is done by human judgment for wordsegmentation ... ways of segmentation, i.e. the important words are segmented correctly while less important words may be segmented incorrectly. Table 6 represents the human judgment for our wordsegmentation ... inhomogeneous phenomenon in judgment word segmentation. However, the acceptable segmentation percentage is satisfactory. Nearly eighty percent of wordsegmentation outcome does not make the...
... specific to Chinese, are shown in Table 2.The wordsegmentation features are extractedfrom word bigrams, capturing word, word lengthand character information in the context. The word length ... last word can be a complete word ora partial word. A problem arises in whether to givePOS tags to incomplete words. If partial words aregiven POS tags, it is likely that some partial wordsare ... pattern “number word + “number word can help to prevent seg-menting a long number word into two words.In order to avoid error propagation and make useof POS information for word segmentation, ...
... monosemous word is usually synonymous to some polysemous words. For example the words "信守, 严守, 恪守遵照 遵从 遵循, , , , 遵守" has similar meaning as one of the senses of the ambiguous word ... in Chinese, which can be used as a knowledge source for WSD. 3.1 Definition of Equivalent Pseudoword If the ambiguous words in the corpus are re-placed with its synonymous monosemous word, ... ambiguous word need to simulate the function of the real ambiguous word, and to acquire semantic knowledge as the real ambiguous word does. Thus, we call it an equivalent pseudoword (EP)...
... of the Chinese sideof the training data, including the total vocabulary(Voc), number of character vocabulary (Char.voc)in Voc, and the running words (Run.words) whendifferent word segmentations ... iterations).4 Word Lattice Decoding4.1 Word LatticesIn the decoding stage, the various segmentation alternatives can be encoded into a compact rep-resentation of word lattices. A word lattice ... Given a Chinese sentencecJ1consisting of J characters {c1, . . . , cJ} andan English sentence eI1consisting of I words{e1, . . . , eI}, AC→Ewill denote a Chinese- to-English word...