Data Analysis Machine Learning and Applications Episode 3 Part 6 doc

Quantitative Text Analysis Using L-, F- and T-Segments 639 Table 1. Text numbers in the corpus with respect to genre and author Brentano Goethe Rilke Schnitzler  poetry 10 10 10 - 30 prose 2 9 10 15 36 3 Distribution of segment types Starting from the hypothesis that L-, F- and T-segments are not only units which are easily defined and easy to determine, but also posses a certain psychological reality i.e. that they play a role in the process of text generation, it seems plausible to assume that these units display a lawful distributional behaviour similar to the well- known linguistic units such as words or syntactic constructions (c.f. Köhler (1999)). A first confirmation - however on data from only a single Russian text - was found in (Köhler (2007)). A corresponding test on the data of the present study corroborates the hypothesis. Each of the 66 texts shows a rank-frequency distribution of the 3 kinds of segment patterns according to the Zipf-Mandelbrot distribution, which was fitted to the data in the following form: P x = (b+x) −a F(n) , x = 1, 2,3, ,n a ∈ R b > −1 n ∈ N F(n)= n  i=1 (b + i) −a (1) Figure 1 shows the fit of this distribution to the data of one of the texts on the basis of Fig. 1. Rank-Frequency Distribution of L-Segments L-segments on a log-log scale. In this case, the goodness-of-fit test yielded P(F 2 ) ≈ 640 Reinhard Köhler and Sven Naumann 1.0 with 92 degrees of freedom. N = 941 L-segments were found in the text forming x max = 112 different patterns. Similar results were obtained for all three kinds of segments and all texts. Various experiments with the frequency distributions show promising differences between authors and genres. However, these differences alone do not yet allow for a crisp discrimination. 4 Length distribution of L-segments As a consequence of our general hypothesis, not only the segment types but also the length of the segments should follow lawful patterns. Here, we study the distribution of L-segment length. First, a theoretical model is set up on the basis of three plausible assumptions: 1. There is a tendency in natural language to form compact expressions. This can be achieved at the cost of more complex constituents on the next level. An example is the following: The phrase "as a consequence" consists of 3 words, where the word "consequence" has 3 syllables. The same idea can be expressed using the shorter expression "consequently", which consists of only 1 word of 4 syllables. Hence, more compact expressions on one level go along with more complex expressions on the next level. Here, the consequence of the formation of longer words is relevant. The variable K will represent this tendency. 2. There is an opposed tendency, viz. word length minimization. It is a consequence of the same tendency of effort minimization which is responsible for the first tendency but now considered on the word level. We will denote this requirement by M. 3. The mean word length in a language can be considered as constant, at least for a certain period of time. This constant will be represented by q. According to a general approach proposed by Altmann (cf. Altmann and Köhler (1996)) and substituting k = K −1andm = M −1, the following equation can be set up: P x = k +x−1 m + x −1 qP x−1 (2) which yields the hyper-Pascal distribution (cf. Wimmer and Altmann (1999)): P x =  k+x−1 x   m+x−1 x  q x P 0 ,x = 0,1,2, (3) with P −1 0 = 2 F 1 (k, 1;m;q) - the hyper-geometric function - as norming constant. Here, (3) is used in a 1-displaced form because length 0 is not defined, i.e. L- segments consisting of 0 words are impossible. As this model is not likely to be adequate also for F- and T-segments - the requirements concerning the basic properties frequency and polytextuality do not imply interactions between adjacent levels - a simpler one can be set up. Due to length limitations to our contribution in this volume we will not describe the appropriate model for these segment types but it Quantitative Text Analysis Using L-, F- and T-Segments 641 can be said here that their length distributions can be modeled and explained by the hyper-Poisson distribution. Fig. 2. Theoretical and empirical distribution of L-segments in a poem Fig. 3. Theoretical and empirical distribution of L-segments in a short story The empirical tests on the data from the 66 texts support our hypothesis with good and very good 2 values. Figures 2 and 3 show typical graphs of the theoretical and empirical distributions as modeled using the hyper-Pascal distribution. Figure 2 is an example of poetry; Figure 3 shows a narrative text. Good indicators of text genre or authors could not yet be found on the basis of these distributions. However, only a few of the available characteristics have been considered so far. The same is true of the corresponding experiments with F- and T-segments. 642 Reinhard Köhler and Sven Naumann 5 TTR studies Another hypothesis investigated in our study is the assumption that the dynamic behavior of the segments with respect to the increase of types in the course of the given text, the so-called TTR, is analogous to that of words or other linguistic units. Word TTR has the longest history; the large number of approaches presented in linguistics is described and discussed in (Altmann (1988), p. 85-90), who gives also a theoretical derivation of the so-called Herdan model, the most commonly used one in linguistics: y = x a , (4) where x represents the number of tokens, i.e. the individual position of a running word in a text, and y the number of types, i.e. different words. a is an empirical parameter. However, this model is appropriate only in case of very large inventories, such as the vocabulary of a language. For smaller inventories, other models must be derived (cf. Köhler, R. and Martináková-Rendeková, Z. (1998), Köhler, R. (2003a) and Köhler, R. (2003b)). We expect model (4) to work with segment TTR, an equation, which was derived by Altmann (1980) for the Menzerath-Altmann Law and later in the framework of synergetic linguistics: y = ax b e cx ,c < 0. (5) The value of a can be assumed to be equal to unity, because the first segment of a text must be the first type, of course. Therefore, we can remove this parameter from the model and simplify (4) as shown in (5): y = e −c x b e cx = x b e c(x−1) ,c < 0. (6) Figures 4 and 5 show the excellent fits of this model to data from one of the poems and one of the prose texts. Goodness-of-fit was determined using the determination coefficient R 2 , which was above 0.99 in all 66 cases. The parameters b and c of the Quantitative Text Analysis Using L-, F- and T-Segments 643 Fig. 4. L-segment TTR of a poem Fig. 5. L-segment TTR of a short story TTR model turned out to be quite promising characteristics of text genre and author. They are not likely to discriminate these factors sufficiently when taken alone but seem to carry a remarkable amount of information. Figure 6 shows the relationship between the parameters b and c. 644 Reinhard Köhler and Sven Naumann Fig. 6. Relationship between the values of b and c in the corpus 6 Conclusion Our study has shown that L-, F- and T-Segments on the word level display a lawful behavior in all aspects investigated so far and that some of the parameters, in particular those of the TTR, seem promising for text classification. Further investigations on more languages and on more text genres will give more reliable answers to these questions. References ALTMANN, G. and KÖHLER, R. (1996): "Language Forces? and Synergetic Modelling of Language Phenomena. In: P. Schmidt [Ed.]: Glottometrika 15. Issues in General Linguis- tic Theory and The Theory of Word Length. WVT, Trier, 62-76. ANDERSEN, S. (2005): Word length balance in texts: Proportion constancy and word-chain- lengths in Proust’s longest sentence. Glottometrics 11, 32-50. BORODA, M. (1982): Häufigkeitsstrukturen musikalischer Texte. In: J. Orlov, M. Boroda, G. Moisei and I. Nadarej ˆ svili [Eds.]: Sprache, Text, Kunst. Quantitative Analysen. Brock- meyer, Bochum, 231-262. HERDAN, G. (1966): The advanced Theory of Language as Choice and Chance. Springer, Berlin et al., 423. KÖHLER, R. (1999): Syntactic Structures. Properties and Interrelations. Journal of Quantita- tive Linguistics 6, 46-57. KÖHLER, R. (2000): A study on the informational content of sequences of syntactic units. In: L.A. Kuz’min [Ed.]: Jazyk, glagol, predlo?enie. K 70-letiju G. G. Sil’nitskogo. Smolensk, S. 51-61. KÖHLER, R. and G. ALTMANN (2000): Probability Distributions of Syntactic Units and Properties. Journal of Quantitative Linguistics 7/3, S.189-200. KÖHLER, R. (2006b): Word length in text. A study in the syntagmatic dimension. To appear. Quantitative Text Analysis Using L-, F- and T-Segments 645 KÖHLER, R. (2006a): The frequency distribution of the lengths of length sequences. In: J. Genzor and M. Bucková [Eds.]: Favete linguis. Studies in honour of Victor Krupa. Slovak Academic Press, Bratislava, 145-152. UHLÍHOVÁ, L. (2007): Word frequency and position in sentence. To appear. WIMMER, G. and ALTMANN, G. (1999): Thesaurus of Univariate Discrete Probability Dis- tributions. Stamm, Essen. Structural Differentiae of Text Types – A Quantitative Model Olga Pustylnikov and Alexander Mehler Faculty of Linguistics and Literature Study, University of Bielefeld, Germany {Olga.Pustylnikov, Alexander.Mehler}@uni-bielefeld.de Abstract. The categorization of natural language texts is a well established research field in computational and quantitative linguistics (Joachims 2002). In the majority of cases, the vector space model is used in terms of a bag of words approach. That is, lexical features are extracted from input texts in order to train some categorization model and, thus, to attribute, for example, authorship or topic categories. Parallel to these approaches there has been some effort in performing text categorization not in terms of lexical, but of structural features of document structure. More specifically, quantitative text characteristics have been computed in order to derive a sort of structural text signature which nevertheless allows reliable text categorizations (Kelih & Grzybek 2005; Pieper 1975). This “bag of features” approach regains attention when it comes to categorizing websites and other document types whose structure is far away from the simplicity of tree-like structures. Here we present a novel approach to structural classifiers which systematically computes structural signatures of documents. In summary, we present a text categorization algorithm which in the absence of any lexical features nevertheless per- forms a remarkably good classification even if the classes are thematically defined. 1 Introduction An alternative way to categorize documents apart from the well established “ bag of words” approach is to categorize by means of structural features. This approach func- tions in absence of any lexical information utilizing quantitative characteristics of documents computed from the logical document structure. 1 That means that markers like content words are completely disregarded. Features like distributions of sections, paragraphs, sentence length etc. are considered instead. Capturing structural properties to build a classifier assumes that given category separations are reflected by structural differences. According to Biber (1995) we can expect that functional differences correlate with structural and formal representa- tions of text types. This may explain good overall results in terms of F-Measure 2 . 1 See also Mehler et al. (2006). 2 The harmonic mean of precision and recall is used here to measure the overall success of the classification 656 Olga Pustylnikov and Alexander Mehler However, the F-Measure gives no information about the quality of the investigated categories. That is, no a prior knowledge about the suitability of the categories for representing homogenous classes and for applying them in machine learning tasks is provided. Since natural language categories e.g. in form of web documents or other textual units arise not necessarily with a well defined structural representation available it is important to know how the classifier behaves dealing with such categories. Here, we investigate a large number of existing categories, thematic classes or rubrics taken from a 10 years newspaper corpus of Süddeutsche Zeitung (SZ 2004) whereas a rubric represents a recurrent part of the newspaper like `sportst’ or `tv- newst’. We test systematically their goodness in a structural classifier framework ask- ing more specifically for a maximal subset of all rubrics which gives an F-Measure above a predefined cut-off c ∈ [0, 1] (e.g. c = 0.9). We evaluate the classifier in the way allowing to exclude possible drawbacks with respect to: • the categorization model used (here SVM 3 and Cluster Analysis), 4 • the text representation model used (here the bag of features approach) and • the structural homogeneity of categories used. The first point relates to distinguishing supervised and unsupervised learning. That is, we perform these sorts of learning although we do not systematically evaluate them comparatively with respect to all possible parameters. Rather, we investigate the potential of our features evaluating them with respect to both scenarios. The representation format (vector representation) is restricted by the model used (e.g. SVM). Thus, we concentrate on the third point and apply an iterative categorization procedure (ICP) 5 to explore the structural suitability of categories. In summary, our experiments have twofold goals: 1. to study given categories using the ICP in order to filter out structurally incon- sistent types and 2. to make judgements about the structural classifier’s behavior dealing with categories of different size and quality levels. 2 Category selection The 10 years corpus of the SZ used in the present study contains 95 different rubrics. The frequency distribution of these rubrics shows an enormous inequality for the whole set (See Figure 1). In order to minimize the calculation effort we reduce the initial set of 95 rubrics to a smaller subset according to the following criteria. 1. First, we compute the mean z and the standard deviation V for the whole set. 3 Support Vector Machines. 4 Supervised vs. unsupervised respectively. 5 See sec. 4. Structural Differentiae of Text Types – A Quantitative Model 657 0 10 20 30 40 50 60 70 80 90 100 0 2000 4000 6000 8000 10000 12000 Categories Articles 95 Rubrics of SZ Fig. 1. Categories/Articles-Distribution of 95 Rubrics of SZ. 2. Second, we pick out all rubrics R with the cardinality |R| (the number of exam- ples within the corpus) ranging between the interval: z −V/2 < |R| < z +V/2 This selection method allows to specify a window around the mean value of all documents leaving out the unusual cases. 6 Thus, the resulting subset of 68 categories is selected. 3 The evaluation procedure The data representation format for the subset of rubrics uses a vector representation (bag of features approach) where each document is represented by a feature vector. 7 The vectors are calculated as structural signatures of the underlying documents. To avoid drawbacks (See Sec. 1) caused by the evaluation method in use, we compare three different categorization scenarios: 1. Supervised scenario by means of SVM-light 8 , 2. Unsupervised scenario in terms of Cluster Analysis and 3. Finally, a baseline experiment based on random clustering. 6 The method is taken from Bock (1974). Rieger (1989) uses it to identify above-average agglomeration steps in the clustering framework. Gleim et al. (2007) successfully applied the method to develop quality filters for wiki articles. 7 See Mehler et al. (2007) for a formalization of this approach. 8 Joachims (2002). [...]... Interval data have even been studied in Symbolic Data Analysis (SDA) (Bock and Diday (2000)), a new domain related to multivariate analysis, pattern recognition and 1 The present paper has been supported by the LC3 Italian research project 7 06 Rosanna Verde and Antonio Irpino artificial intelligence In this framework, in order to take into account the variability and/ or the uncertainty inherent to the data, ... allowing a good categorization of small, less representative categories That fact motivates to use logical document (or any other kind of) structure for machine learning tasks and to extend the framework to more demanding tasks, when it comes to deal with, e.g., web documents 66 2 Olga Pustylnikov and Alexander Mehler References ALTMANN, G (1988): Wiederholungen in Texten Brockmeyer, Bochum BIBER, D (1995):... Bochum GRZYBEK, P., and R KÖHLER (Eds) (2007): Exact Methods in the Study of Language and Text [Quantitative Linguistics 62 ] De Gruyter Berlin HAMP, E.P (1998): “Whose were the Tocharians? Linguistic subgrouping and Diagnostic Idiosyncrasy” The Bronze Age and Early Iron Age Peoples of Eastern Central Asia Vol 1 :30 7- 46 Edited by Victor H Mair Washington DC: Institute for the Study of Man 63 6 Hans J Holm... represents the results of random clustering 6 Discussion According to Figure 2 we can see, that all F-Measure results lie high above the baseline of random clustering All the subsets are well separated by their document 66 0 Olga Pustylnikov and Alexander Mehler Table 1 Corpus Formation (by Categories) Category Set Number Total 95 Selected Initial Set 68 Unsupervised 55 Supervised 16 Unsupervised ∩ Supervised... signals in the data, or simply by chance References CYSOUW, M (2004): email.eva.mpg.de/ cysouw/pdf/cysouwWIP.pdf CYSOUW, M., WICHMANN, S and KAMHOLZ, D (20 06) : A critique of the separation base method for genealogical subgrouping, with data from Mixe-Zoquean Journal of Quantitative Linguistics, 13( 2 -3) , 225– 264 EMBLETON, S.M (19 86) : Statistics in historical linguistics [Quantitative Linguistics 30 ] Brockmeyer,... this as Fig 3 Fig 3 All frequencies of the LIV-2 data 63 4 Hans J Holm 3. 4 Detecting the reason Immediately we observe to the right hand the few verbs which occur in many languages, growing up to the left with the many verbs occurring in fewer languages, breaking down to the special case of verbs occurring in one language only To find out the reason of the false correlation between these curves and the bias... languages 2.2 Applications up to now The first one to propose and apply this method was the British mathematician D.G Kendall (1950) with the Indo-European data of Walde/Pokorny (19 26 -32 ) It has then independently been extensively applied to the data of the improved dictionary of Pokorny (1959) by Holm (2000, passim) The results seemed to be convincing, in particular for the North-Western group, and also... team at an established department of Indo-European under the supervision of a professional historical linguist should guarantee a very high standard, moreover in this second edition Compared with the in many parts outdated Pokorny, we have now much better knowledge of the Anatolian and Tocharian languages 63 2 Hans J Holm Fig 2 Unwanted dependence N from k in LIV-2 list 3 The bias 3. 1 Unwanted dependence... LIPP, R and SCHIRMER, B (2001): Lexikon der indogermanischen Verben Die Wurzeln und ihre Primärstammbildungen 2 Aufl Reichert, Wiesbaden SWOFFORD, D.L., OLSEN, G.J., Waddell, P.J., and HILLIS, D.M (19 96) : “Phylogenetic Inference” In: HILLIS, D.M., M CRAIG, and B.K MABLE (Eds) Molecular Systematics, Second Edition Sinauer Associates, Sunderland MA, Chapter 11 WALDE, A., and J Pokorny (Ed) (19 26- 1 932 ): Vergleichendes... condition Distribution of Data in Word Lists 63 3 is the easier part, since linguists would agree that changes in the lexicon or grammar occur independently of each other (The so-called push -and- pull chains are mainly a phonetic symptom and of lesser interest here) The real problem is the first condition, since the chance of survival is not at all the same for any feature, and every word has its own . Text Analysis Using L-, F- and T-Segments 63 9 Table 1. Text numbers in the corpus with respect to genre and author Brentano Goethe Rilke Schnitzler  poetry 10 10 10 - 30 prose 2 9 10 15 36 3 Distribution. 3. Theoretical and empirical distribution of L-segments in a short story The empirical tests on the data from the 66 texts support our hypothesis with good and very good 2 values. Figures 2 and. determination coefficient R 2 , which was above 0.99 in all 66 cases. The parameters b and c of the Quantitative Text Analysis Using L-, F- and T-Segments 6 43 Fig. 4. L-segment TTR of a poem Fig. 5. L-segment

Data Analysis Machine Learning and Applications Episode 3 Part 6 doc

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan