Tài liệu Báo cáo khoa học: "Incorporating Context Information for the Extraction of Terms" pdf

3 369 0
Tài liệu Báo cáo khoa học: "Incorporating Context Information for the Extraction of Terms" pdf

Đang tải... (xem toàn văn)

Thông tin tài liệu

Incorporating Context Information for the Extraction of Terms Katerina T. Frantzi Dept. of Computing Manchester Metropolitan University Manchester, M1 5GD, U.K. K. Frantzi@doc. mmu. ac. uk Abstract The information used for the extraction of terms can be considered as rather 'inter- nal', i.e. coming from the candidate string itself. This paper presents the incorpora- tion of 'external' information derived from the context of the candidate string. It is embedded to the C-value approach for automatic term recognition (ATR), in the form of weights constructed from statisti- cal characteristics of the context words of the candidate string. 1 Introduction &: Related Work The applications of term recognition (specialised dic- tionary construction and maintenance, human and machine translation, text categorization, etc.), and the fact that new terms appear with high speed in some domains (e.g. in computer science), enforce the need for automating the extraction of terms. ATR also gives the potential to work with large amounts of real data, that it would not be able to handle man- ually. We should note that by ATR we neither mean dictionary string matching, nor term interpretation (which deals with the relations between terms and concepts). Terms may consist of either one or more words. When the aim is the extraction of single-word terms, domain-dependent linguistic information (i.e. mor- phology) is used (Ananiadou, 1994). Multi-word ATR usually uses linguistic information in the form of a grammar that mainly allows noun phrases or compounds to be extracted as candidate terms: (Bourigault, 1992) extracts maximal-length noun phrases and their subgroups (depending on their grammatical structure and position) as candidate terms. (Dagan and Church, 1994), accept sequen- cies of nouns, which give them high precision, but not such a good recall as that of (Justeson and Katz, 1995), which allow some prepositions (i.e. oj~ to be part of the extracted candidate terms. (Frantzi and Ananiadou, 1996), stand between these two ap- proaches, allowing the extracted compounds to con- tain adjectives but no prepositions. (Daille et al., 1994) also allow adjectives to be part of the two- word English terms they extract. From the above, only (Bourigault, 1992) does not use any statistical information. (Justeson and Katz, 1995) and (Dagan and Church, 1994) use the fre- quency of occurrence of the candidate string as a measure of its likelihood to be a term. (Daille et al., 1994) agree that frequency of occurrence "presents the best histogram", but also suggest the likeli- hood ratio for the extraction of two-word English terms. (Frantzi and Ananiadou, 1996), besides the frequency of occurrence, also consider the frequency of the candidate string as a part of longer candidate terms, as well as the number of these longer candi- date terms it is found nested in. In this paper, we extend C-value, the statisti- cal measure proposed by (Frantzi and Ananiadou, 1996), incorporating information gained from the textual context of the candidate term. 2 Context information for terms The idea of incorporating context information for term extraction came from that "Extended term units are different in type from extended word units in that they cannot be freely modified" (Sager, 1978). Therefore, information from the modifiers of the candidate strings could be used in the pro- cedure of their evaluation as candidate terms. This could be extended beyond adjective/noun modifica- tion, to verbs that belong to the candidate string's context. For example, the form shows of the verb to show in medical domains, is very often followed by a term, e.g. shows a basal cell carcinoma. There are cases where the verbs that appear with terms can even be domain independent, like the form called of 501 the verb to call, or the form known of the verb to know, which are often involved in definitions in var- ious areas, e.g. is known as the singular existential quantifier, is called the Cartesian product. Since context carries information about terms it should be involved in the procedure for their ex- traction. We incorporate context information in the form of weights constructed in a fully automatic way. 2.1 The Linguistic Part The corpus is tagged, and a linguistic filter will only accept specific part-of-speech sequencies. The choice of the linguistic filter affects the precision and re- call of the results: having a 'closed' filter, that is, a strict one regarding the part-of-speech sequencies it accepts, like the N + that (Dagan and Church, 1994) use, wilt improve the precision but have bad effect on the recall. On the other side, an 'open' filter, one that accepts more part-of-speech sequen- cies, like that of (Justeson and Katz, 1995) that ac- cepts prepositions as well as adjectives and nouns, will have the opposite result. In our choice of the linguistic filter, we lie some- where in the middle, accepting strings consisting of adjectives and nouns: ( N ounlAdjective) + Noun (1) However, we do not claim that this specific fil- ter should be used at all cases, but that its choice depends on the application: the construction of domain-specific dictionaries requires high coverage, and would therefore allow low precision in order to achieve high recall, while when speed is required, high quality would be better appreciated, so that the manual filtering of the extracted list of candidate terms can be as fast as possible. So, in the first case we could choose an 'open' linguistic filter (e.g. one that accepts prepositions), while in the second, a 'closed' one (e.g. one that only accepts nouns). The type of context involved on the extraction of candidate terms is also an issue. At this stage of this work, the adjectives, nouns and verbs are considered. However, further investigation is needed over the context used (as it is discussed in the future work). 2.2 The Statistical Part The procedure involves the following steps: Step 1: The raw corpus is tagged and from the tagged corpus the strings that obey the (NounlAdjective)+Noun expression are extracted. Step 2: For these strings, C-value is calculated resulting in a list of candidate terms (ranked by C- value as their likelihood of being terms). The length of the string is incorporated in the C-value measure resulting to C-value' C-value' (a) -=- I where log2 lalf(a) lal = max, ~,~, ~(b) log2 lal(f(a) - p(ro) ) otherwise (2) a is the examined string, lal the length of a in terms of number of words, f(a) the frequency of a in the corpus, Ta the set of candidate terms that contain a, P(T~) the number of these candidate terms. At this point the incorporation of the context in- formation will take place. Step 3: Since C-value is a measure for extract- ing terms, the top of the previously constructed list presents the higher density on terms among any other part of the list. This top of the list, or else, the 'first' of these ranked candidate terms will give the weights to the context. We take the top ranked candidate strings, and from the initial corpus we ex- tract their context which currently are the adjec- tives, nouns and verbs that surround the candidate term. For each of these adjectives, nouns and verbs, we consider three parameters: 1. its total frequency in the corpus, 2. its frequency as a context word (of the 'first' candidate terms), 3. the number of these 'first' candidate terms it appears with. These characteristics are combined in the following way to assign a weight to the context word ft(w) ) Weight(w) = 0.5(~ -~ + f(w) (3) where w is the noun/verb/adjective to be assigned a weight, n the number of the 'first' candidate terms consid- ered, t(w) the number of candidate terms the word w ap- pears with, ft(w) w's total frequency appearing with candidate terms, f(w) w's total frequency in the corpus. A variation to improve the results, that involves human interaction, is the following: the candidate terms involved for the extraction of context are firstly manually evaluated, and only the 'real terms' will proceed to the extraction of the context and as- signment of weights (as previously). 502 At this point a list of context words together with their weights has been created. Step 4: The previously created by C-value r list will now be re-ordered considering the weights obtained from step 3. For each of the candidate strings of the list. its context (adjectives, nouns and verbs that surround it) are extracted from the corpus. These context words have either been found at step 3 and therefore assigned a weight, or not. In the latter case, they are now assigned weight equal to 0. Each of these candidate strings is now ready to be assigned a context weight which would be the sum of the weights of its context words: wei(a) = Weight(b) + 1 (4) b~C° where a is the examined n-gram, Ca the context of a, Weight(b) the calculated (from step 3) weight for the word b. The candidate terms will be now re-ranked according to: 1 NC.value(a) = ~ C-value'(a) • wei(a) (5) tog(. r) where a is the examined n-gram, C-value'(a) calculated from step 2, wei(a), the calculated from step 4 sum of the context weights for a, N the size of the corpus in terms of number of words. 3 Future work Our future work involves 1. The investigation of the context used for the evaluation of the candidate string, and the amount of information that various context carries. We said that for this prototype we considered the adjectives, nouns and verbs that surround the candidate string. However, could ~something else' also carry useful in- formation? Should adjectives, nouns and verbs all be considered to carry the same amount of informa- tion, or should they be assigned different weights? 2. The investigation of the assignment of weights on the parameters used for the measures. Currently, the measures contain the parameters in a 'flat' way. That is, not really considering the 'weight' (the im- portance) of each of them. So, the measures are at this point a description of which parameters to be used, and not on the degree to which they should be used. 3. The comparison of this method with other ATR approaches. The experimentation on real data will show if this approach actually brings improvement to the results in comparison with previous approaches. Moreover, the application on real data should cover more than one domains. 4 Acknowledgement I thank my supervisors Dr. S. Ananiadou and Prof. J. Tsujii. Also Dr. T. Sharpe from the Med- ical School of the University of Manchester for the eye-pathology corpus. References Sophia Ananiadou. 1988. A Methodology for Auto- matic Term Recognition. Ph.D Thesis, University of Manchester Institute of Science and Technol- ogy. Didier Bourigault. 1992. Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases. In Proceedings of the Interna- tional Conference on Computational Linguistics, COLING-92, pages 977-981. Ido Dagan and Ken Church. 1994. Termight: Iden- tifying and Translating Technical Terminology. In Proceedings of the European Chapter of the Asso- ciation for Computational Linguistics, EACL-94, pages 34-40. B~atrice Daille, I~ric Gaussier and Jean-Marc Lang,. 1994. Towards Automatic Extraction of Monolin- gual and Bilingual Terminology. In Proceedings of the International Conference on Computational Linguistics, COLING-94, pages 515-521. Katerina T. Frantzi and Sophia Ananiadou. 1996. A Hybrid Approach to Term Recognition. In Pro- ceedings of the International Conference on Nat- ural Language Processing and Industrial Applica- tions, NLP+L4-96. pages 93-98. John S. Justeson and Slava M. Katz. 1995. Tech- nical terminology: some linguistic properties and an algorithm for identification in text. In Natural Language Engineering, 1:9-27. Juan C. Sager. 1978. Commentary in Table Ronde sur les Probldmes du Ddcourage du Terme. Ser- vice des Publications, Direction des Francaise, Montreal, 1979, pages 39-52. 503 . context of the candidate term. 2 Context information for terms The idea of incorporating context information for term extraction came from that "Extended. ) otherwise (2) a is the examined string, lal the length of a in terms of number of words, f(a) the frequency of a in the corpus, Ta the set of candidate

Ngày đăng: 22/02/2014, 03:20

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan