Identifying coordinated compound words for Vietnamese word segmentation

16 369 0
Identifying coordinated compound words for Vietnamese word segmentation

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

Identifying coordinated compound words for Vietnamese word segmentation

VIETNAM NATIONAL UNIVERSITY, HA NOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY  CÔNG TRÌNH DỰ THI GIẢI THƯỞNG SINH VIÊN NGHIÊN CỨU KHOA HỌC NĂM 2012 Tên công trình: Identifying coordinated compound words for Vietnamese word segmentation Họ và tên sinh viên: Nguyễn Minh Cường Nam/nữ: Nam Lớp: K53CA Khoa: KHMT Người hướng dẫn: Ts. Nguyễn Phương Thái Ths. Trần Ngọc Anh Ha Noi – 2012 Abstract Word segmentation is considered the first step in most natural language processing applications. Vietnamese word segmentation encounters some difficulties that other occidental language does not. English and many other languages use blanks to separate words which is easy for a tokenizer to do word segmentation tasks. Vietnamese words can be formed by one syllables, two or more than two syllables. In natural language processing, a dictionary is an essential resources to the analysis of language problems from simple to complex. In most vietnamese dictionary, there are small amount of coordinated compound words defined. Since most of natural language processing depend heavily on dictionary in word segmentation step, there are much problems apprear when the tokenizer detecting coordinated compound words. We are trying to build a coordinated compound word with large number of words which we hope that helps to improve the accuracy of vietnamese segmentation task. 2 Contents Figure List 3 Chapter 1 Introduction Word segmentation is considered the first step in most natural language processing applications. Vietnamese word segmentation encounters some difficulties that other occidental language does not. English and many other languages use blanks to separate words which is easy for a tokenizer to do word segmentation tasks. Vietnamese words can be formed by one syllables, two or more than two syllables. In general, Vietnamese compound word meaning is created by combining the meaning of each syllables that made the compound words, and blanks are not used to separate Vietnamese word. That creates problems for all natural language processing tasks. The main problems include word ambiguities, unknown words detection and proper name recognition. 4 Chapter 2 Vietnamese word segmentation 2.1 Coordinated Compound Word 2.1.1 Definition Coordinated compound words are made up of two or more single syllables and the meaning of each word is combination of meaning of each syllable which has similar meaning. The syllables that made up coordinated compound word are in equal relation. In other words, the meaning of coordinated compound word is more general than of each syllable, and equally based on meaning of them. The order of coordinated compound word is oftenly changeable. For example: “quần áo”, “áo quần”, “chung riêng”, “riêng chung”, “đen đỏ”, “đỏ đen”, “ốm đau”, “đau ốm”,… 2.1.2 Type of coordinated compound words There ara two types of coordinated compound word:  All syllables are Vietnamese origin words: “đất nước”, “trời đất”, “đất cát”, “ruộng đấy”, “rượng vường”, “ruộng nương”, “ấm chén”, “bát đĩa”, “đỏ đen”, “trắng đen”, “may rủi”, etc.  All syllables are Chinese borrowed: “ân nghĩa”, “nam nữ”, “đầu não”, “đấu tranh”, “học tập”, “lợi lộc”, “thuận lợi”, etc. 5  One syllable is Vietnamese origin word and one is borrowed from Chinese: “binh lính”, “bụng dạ”, “lính tráng”, “nuôi dưỡng”, “gan dạ”,etc. 2.2 VCL Dictionary In natural language processing, a dictionary is an essential resources to the analysis of language problems from simple to complex. A good quality vocabular should provide the language processing system with natural language information in many diffirent steps such as morphology, grammar, semantics, or even able to used for single language processing system or multiple language processing system. VCL (Vietnamese Computational Lexicon) is a dictionary from Vietlex with 35000 words which is created for natural language processing purposes. Each word in the dictionary is represented with the information of morphology, syntactics and semantics.  Morphology: Morphology information include HeadWord, WordType Figure 1 Basic information and morphology of “bàn” (noun)  Syntactics : 6 Syntactics information includes category (noun, verb, adverb, adjective,etc ), subcategory ( proper noun, countable noun, abstract noun, etc), frame set, forward and backward. Figure 2 Syntactics of “bàn”(verb) – frameset Figure 3 Syntactics of “ăn”(verb) – forward, backward  Semantics information Semantics information include logical constraint and semantic contrainst. 7 - Logical constraint include categorial meaning,synonym and antonym.Categorial meaning can be understand as a “semantic- wordtype”, for example ‘tướng sĩ’ and “tướng tá” are belongs to “People”, “trâu” and “bê” belongs to “Mammal”,etc. Synonym and antonym helps with analysing and using words correctly. Figure 4: Semantics tree - Semantic contraints: information about “semantic role” of words when standing in sentences: Agent, experiencer, possessor, force, patient, recipient, reference, concomitant,etc 8 Figure 5 Semantics information of “bắt” (verb) Figure 6 VCL in xml format 9 Chapter 3 Building Coordinated Compound Word Dictionary Vietnamese word segmentation is highly based on the definition of the word in dictionary. A good dictionary is very important in vietnamese word segmentation. The dictionary contain small amount of coordinated compound words. The purpose of building a coordinated compound word is increase the accuracy of vietnamese word segmentation when detecting coordinated compound words. There are several steps when building coordinated compound word dictionary base on the VCL dictionary. 3.1 Finding coordinated compound words that already been defined in VCL dictionary This step can be helped by a small web-base system.After this step the dictionary now have more than 1600 coordinated compound words.  Using Rails 3.1 framework with Mongoid database.  Read the VCL dictionary and store in database  Display the dictionary  Approaches: Most of the coordinated compound word that defined in VCL dictionary have one or more of below characteristics: 10 [...]... the original word 13 Figure 3 List all reverse word of coordinated compound words then check 3.4 Review and estimate the accuracy of the dictionary The new coordinated compound words (about 3000 words) have the same format of the VCL dictionary and it can be easily used for improving the accuracy of vietnamese word segmentation 14 3.5 Future work For some reason (time limit, vietnamese words knowledge),... Example of coordinated compound word 11 3.2 Try to classify these compound words and other simple words Try to classify these compound words and other simple words from dictionary into ‘categorial meaning’, (semantic-wordtype), in each class, match two simple words that belongs the same ‘categorial meaning’ to make new coordinated compound words For examples giường giường chiếu chiếu chăn màn chăn  chăn... màn chiếu 12 Figure 2 classify the simple words to ‘categorial meaning’ 3.3 Find the new coordinated compound words by reverse the old word quần áo => áo quần chung thủy => thủy chung đỏ đen => đen đỏ rừng núi => núi rừng bay lượn => lượn bay Create all the possible reverse word from all the coordinated compound words that we already reviewed Each new created words have the same ‘categorial meaning’,... the syllales and word “và” The field contain the reverse word of the main word  Query all the possible case and sort for the most number of conditions meeting first • • • • Write a script to help choose the correct coordinated compound words with just one click (the choosen word will be then displayed italic and set a flag to true) Figure 1 Example of coordinated compound word 11 3.2 Try... dictionary is still small The work is still continuing finding more words to make the dictionary to be large 15 References [1] D.Q.Thang 2008, Word segmentation of Vietnamese texts: a comparison of approaches [2] Cam-Tu Nguyen 2008, Vietnamese Word Segmentation with CRFs and SVMs: An Investigation [3] Le.An.Ha 2003, A method for word segmentation in Vietnamese orpus Linguistics, Lancaster, UK (2003) 16 . words for Vietnamese word segmentation Họ và tên sinh viên: Nguyễn Minh Cường Nam/nữ: Nam Lớp: K53CA Khoa: KHMT Người hướng dẫn: Ts. Nguyễn Phương Thái Ths. Trần Ngọc Anh Ha Noi – 2012 Abstract Word

Ngày đăng: 12/04/2014, 15:43

Từ khóa liên quan

Mục lục

  • VIETNAM NATIONAL UNIVERSITY, HA NOI

  • UNIVERSITY OF ENGINEERING AND TECHNOLOGY

  • CÔNG TRÌNH DỰ THI GIẢI THƯỞNG SINH VIÊN NGHIÊN CỨU KHOA HỌC

  • NĂM 2012

  • Tên công trình:

  • Identifying coordinated compound words for Vietnamese word segmentation

  • Họ và tên sinh viên: Nguyễn Minh Cường Nam/nữ: Nam

  • Lớp: K53CA Khoa: KHMT

  • Người hướng dẫn: Ts. Nguyễn Phương Thái

  • Ths. Trần Ngọc Anh

  • Ha Noi – 2012

  • Word segmentation is considered the first step in most natural language processing applications. Vietnamese word segmentation encounters some difficulties that other occidental language does not. English and many other languages use blanks to separate words which is easy for a tokenizer to do word segmentation tasks. Vietnamese words can be formed by one syllables, two or more than two syllables. In natural language processing, a dictionary is an essential resources to the analysis of language problems from simple to complex. In most vietnamese dictionary, there are small amount of coordinated compound words defined. Since most of natural language processing depend heavily on dictionary in word segmentation step, there are much problems apprear when the tokenizer detecting coordinated compound words. We are trying to build a coordinated compound word with large number of words which we hope that helps to improve the accuracy of vietnamese segmentation task.

  • Word segmentation is considered the first step in most natural language processing applications. Vietnamese word segmentation encounters some difficulties that other occidental language does not. English and many other languages use blanks to separate words which is easy for a tokenizer to do word segmentation tasks. Vietnamese words can be formed by one syllables, two or more than two syllables. In general, Vietnamese compound word meaning is created by combining the meaning of each syllables that made the compound words, and blanks are not used to separate Vietnamese word. That creates problems for all natural language processing tasks. The main problems include word ambiguities, unknown words detection and proper name recognition.

    • 2.1 Coordinated Compound Word

      • 2.1.1 Definition

      • Coordinated compound words are made up of two or more single syllables and the meaning of each word is combination of meaning of each syllable which has similar meaning. The syllables that made up coordinated compound word are in equal relation. In other words, the meaning of coordinated compound word is more general than of each syllable, and equally based on meaning of them.

      • The order of coordinated compound word is oftenly changeable. For example: “quần áo”, “áo quần”, “chung riêng”, “riêng chung”, “đen đỏ”, “đỏ đen”, “ốm đau”, “đau ốm”,…

        • 2.1.2 Type of coordinated compound words

        • 2.2 VCL Dictionary

        • In natural language processing, a dictionary is an essential resources to the analysis of language problems from simple to complex. A good quality vocabular should provide the language processing system with natural language information in many diffirent steps such as morphology, grammar, semantics, or even able to used for single language processing system or multiple language processing system.

        • VCL (Vietnamese Computational Lexicon) is a dictionary from Vietlex with 35000 words which is created for natural language processing purposes. Each word in the dictionary is represented with the information of morphology, syntactics and semantics.

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan