Data Analysis Machine Learning and Applications Episode 3 Part 3 pps

Analysis of Dwell Times in Web Usage Mining Patrick Mair 1 and Marcus Hudec 2 1 Department of Statistics and Mathematics and ec3 Wirtschaftuniversität Wien Augasse 2-6, 1090 Vienna, Austria patrick.mair@wu-wien.ac.at 2 Department of Scientific Computing and ec3 University of Vienna Universitätsstr. 5, 1010 Vienna, Austria marcus.hudec@univie.ac.at Abstract. In this contribution we focus on dwell times a user spends on various areas of a web site within a session. We assume that dwell times may be adequately modeled by a Weibull distribution which is a flexible and common approach in survival analysis. Furthermore we introduce heterogeneity by various parameterizations of dwell time densities by means of proportional hazards models. According to these assumptions the observed data stem from a mixture of Weibull densities. Estimation is based on EM-algorithm and model selection may be guided by BIC. Identification of mixture components corresponds to a segmentation of users/sessions. A real life data set stemming from the analysis of a world wide operating eCommerce application is provided. The corresponding computations are performed with the mixPHM package in R. 1 Introduction Web Usage Mining focuses on the analysis of visiting behavior of users on a web site. Common starting point are the so called click-stream data which are derived from web-server logs and may be viewed as the electronic trace a user leaves on a web site. Adequate modeling of the dynamics of browsing behavior is of particular relevance for the optimization of eCommerce applications. Recently Montgomery et al. (2004) proposed a dynamic multinomial probit model of navigation patterns which lead to an remarkable increase of conversion rates. Park and Fader (2004) developed multivariate exponential-gamma models which enhance cross-site customer acquisition. These papers indicate the potential that such approaches offer for web- shop providers. In this paper we will focus on modeling dwell times, i.e., the time a user spends for viewing a particular page impression. They are defined by the time span between two subsequent page requests and can be calculated by taking the difference between the two logged time points when the page request have been issued. For the analysis 594 Patrick Mair and Marcus Hudec of complex web sites which consist of a large number of pages it is often reasonable to reduce the number of different pages by aggregating individual page-impressions to semantically related page categories reflecting meaningful regions of the web site. Analysis of dwell times is an important source of information with regard to the relevance of the content for different users and the effectiveness of the page in attracting visitors. In this paper we are particularly interested in segmentation of users into various groups which exhibit a similar behavior with regard to the dwell times they spend on various areas of the site. Such a segmentation analysis is an important step towards a better understanding of the way a user interacts on a web site. It is therefore of relevance with regard to the prediction of user behavior as well as for a user-specific customization or even personalization of web sites. 2 Model specification and estimation 2.1 Weibull mixture model Since survival analysis focuses on duration times until some event occurs (e.g. the death of a patient in medical applications) it seems straightforward to apply these concepts to the analysis of dwell times in web usage mining applications. With regard to dwell time distributions we assume that they follow a Weibull distribution with density function f (t)=OJt J−1 exp(−Ot J ), where O is a scale parameter and J the shape parameter. For modeling the heterogeneity of the observed population, we assume K latent segments of sessions. While the Weibull assumption holds within all segments, different segments exhibit different parameter values. This leads to the underlying idea of a Weibull mixture model. For each page category p (p = 1, ,P) under consideration the resulting mixture has the following form f (t p )= K  k=1 S k f (t p ;O pk ,J pk )= K  k=1 S k O pk J pk t J pk −1 p exp(−O pk t J pk p ) (1) where t p represents the dwell time on page category p with mixing proportions S k which correspond to the relative size of each segment. In order to reduce the number of parameters involved we impose restrictions on the hazard rates of different components of the mixture respectively pages. An elegant way of doing this is offered by the concept of Weibull proportional hazards models (WPHM). The general formulation of a WPHM (see e.g., Kalbfleisch and Prentice (1980)) is h(t;Z)=OJt J−1 exp(ZE). (2) where Z is a matrix of covariates, and E are the regression parameters. The term OJt J−1 is the baseline hazard rate h 0 (t) due to the Weibull assumption and h(t;Z) hazard proportional to h 0 (t) resulting from the regression part in the model. Analysis of Dwell Times in Web Usage Mining 595 2.2 Parsimonious modeling strategies We propose five different models with respect to different proportionality restrictions in the hazard rates as to reduce the number of parameters. In the mixPHM package by Mair and Hudec (2007) the most general model is called separate : The WPHM is computed for each component and page separately. Hence, the hazard of session i belonging to component k (k = 1, ,K) on page category p (p = 1, ,P)is h(t i,p ;1)=O k,p J k,p t J k,p −1 i,p exp(E1). (3) The parameter matrices can be represented jointly as / = ⎛ ⎜ ⎝ O 1,1 O 1,P . . . . . . . . . O K,1 O K,P ⎞ ⎟ ⎠ (4) for the scale parameters and * = ⎛ ⎜ ⎝ J 1,1 J 1,P . . . . . . . . . J K,1 J K,P ⎞ ⎟ ⎠ (5) for the shape parameters. Both the scale and the shape parameters can vary freely and there is no assumption of hazard proportionality in the separate model. In fact, the parameters (2 ×K ×P in total) are the same as they were estimated directly by using a Weibull mixture model. Next, we impose a proportionality assumption across the latent components. In the classification version of the EM-algorithm (see next section) in each iteration step we have a “crisp" assignment of each session to a component. Thus, if we consider this component vector g as main effect in the WPHM, i.e., h(t;g), we impose proportional hazards for the components across the pages ( main.g in mixPHM ). Again, the elements of the matrix / of scale parameters can vary freely, whereas the shape parameter matrix reduces to the vector * =(J 1,1 , ,J 1,P ). Thus, the shape parameters are constant over the components and the number of parameters is reduced to K ×P + P. If we impose page main effects in the WPHM, i.e., h(t; p) or main.p , respectively, as before, the elements of / are not restricted at all but this time the shape parameters are constant over the pages, i.e., * =(J 1,1 , ,J 1,K ). The total number of parameters is now K ×P + K. For the main-effects model h(t;g+ p) we impose proportionality restrictions on both / and * such that the total number of parameters is reduced to K + P.For the scale parameter matrix proportionality restrictions of this main.gp model hold row-wise as well as column-wise: / = ⎛ ⎜ ⎝ O 1 c 2 O 1 c P O 1 . . . . . . . . . . . . O K c 2 O K c P O K ⎞ ⎟ ⎠ = ⎛ ⎜ ⎜ ⎜ ⎝ O 1 O P d 2 O 1 d 2 O P . . . . . . . . . d K O 1 d K O 1 ⎞ ⎟ ⎟ ⎟ ⎠ . (6) 596 Patrick Mair and Marcus Hudec The c- and d-scalars are proportionality constants over the pages and components, respectively. The shape parameters are constant over the components and pages. Thus, * reduces to one shape parameter J which implies that the hazard rates are proportional over components and pages. To relax the rather restrictive assumption with respect to / we can extend the main effects model by the corresponding component-page interaction term, i.e., h(t;g ∗ p).In mixPHM notation this model is called int.gp . The elements of / can vary freely whereas * is again reduced to one parameter only, leaving us with a total number of parameters of K ×P + 1. With respect to the hazard rate this relaxation implies again proportional hazards over components and pages. 2.3 EM-estimation of parameters In order to estimate such mixtures of WPHM, we use the EM-Algorithm (Demp- ster et al. (1977), McLachlan and Krishnan (1997)). In the E-Step we establish the expected likelihood values for each session with respect to the K components. At this point it is important to take into account the probability that a session i of component k visits page p, denoted by Pr k,p , which is estimated by the corresponding relative frequency. The elements of the resulting K ×P matrix are model parameters and have to be taken into account when determining the total number of parameters. The resulting likelihood W k,p (s i ) for session i being in component k for each page p individually, is W k,p (s i )=  f (y p ; ˆ O k,p , ˆ J k,p )Pr k,p (s i ) if p was visited by s i 1−Pr k,p (s i ) if p was not visited by s i (7) To establish the joint likelihood, a crucial assumption is made: independence of the dwell times over page-categories. To make this assumption feasible, a well- would be hierarchical, the independence assumption would not hold. Without this independence assumption, a multivariate mixture Weibull model would have to be fitted which takes into account the covariance structure of the observations. This would require that each session must have a full observation vector of length p,i.e, each page category is visited within each session which seems not to be realistic within the context of dwell times in web usage mining. However, for a reasonable independence assumption the likelihood over all pages that session i belongs to component k is given by L k (s i )= P  p=1 W k,p (s i ). (8) Thus, by looking at each session i separately, a vector of likelihood values < i =(L 1 (s i ),L 2 (s i ), ,L k (s i )) results. At this point, the M-step is carried out. The mixPHM package provides three different methods. The classical version of the EM-algorithm (maximization EM; advised page categorization must be established. For instance, if some page-categories Analysis of Dwell Times in Web Usage Mining 597 EMoption = "maximization" in mixPHM ) computes the posterior probabilities that session i belongs to group k and does not make a group assignment within each iteration step but rather updates the matrix of posterior probabilities Q. A faster EM- version is proposed by Celeux and Govaert (1992) which they call classification EM ( EMoption = "classification" in mixPHM ): Within each iteration step a group assignment is performed due to sup k (< i ). Hence, the computation of the posterior matrix is not needed. A randomized version of the M-step considers a combination of the approaches above: After the computation of the posterior matrix Q,aran- domized group assignment is performed due to the corresponding probability values ( EMoption = "randomization" ). As usual, the joint likelihood L is updated at each EM-iteration l until a certain convergence criterion H is reached, i.e.,    L (l) −L (l−1)    < H. Theoretical issues about the EM-convergence in Weibull mixture models can be found in Ishwaran (1996) and Jewell (1982). 3 Real life example In this section we use a real dataset of a large Austrian company which runs a web- shop to demonstrate our modeling approach. We restrict empirical analysis to a sub- set of 333 buying-sessions and 7 page-categories we perform a dwell time based clustering with corresponding proportionality hazard assumptions by using the mixPHM package in R (R Development Core Team, 2007). bestview checkout service figurines jewellery landing search 6 16 592 30 12 183 0 13 15 136 157 0 139 430 11 0 23 428 2681 17 2058 2593 56 186 37 184 710 52 12 450 34 0 61 0 874 307 570 6 25 53 The above extract of the data matrix shows the dwell times of 5 sessions, while we coded non-visited page categories as 0’s. We start with a rather exploratory approach to determine an appropriate proportionality model with an adequate number of clusters K.Byusingthe msBIC statement we can accomplish such a heuristic model search. > res.bic <- msBIC(x,K=2:5,method="all") > res.bic Bayes Information Criteria Survival distribution: Weibull K=2 K=3 K=4 K=5 separate 23339.27 23202.23 23040.01 22943.11 main.g 23355.66 23058.25 22971.86 22863.43 main.p 23503.73 23368.77 23165.60 23068.47 int.gp 23572.21 23422.51 23305.63 23075.76 main.gp 23642.74 23396.51 23271.72 23087.64 598 Patrick Mair and Marcus Hudec It is obvious that the main.g model with K = 5 components fits quite well com- pared to the other models (if we fit models for K > 5theBIC’s do not decrease perspicuously anymore). For the sake of demonstration of the imposed hazard pro- portionalities, we compare this model to the more flexible separate model. First, we fit the two models again by using the phmclust statement which is the core routine of the mixPHM package. The matrices of shape parameters * sep and * g , respectively, for the first 5 pages (due to limited space) are: > res.sep <- phmclust(x,5,method="separate") > res.sep$shape[, 1:5] bestview checkout service figurines jewellery Component1 3.686052 2.692687 0.8553160 0.9057708 1.2503048 Component2 1.327496 3.393152 1.6260679 0.9716507 0.9941698 Component3 1.678135 2.829635 1.0417360 1.0706117 0.6902553 Component4 1.067241 1.847353 0.9860697 0.9339892 0.6321027 Component5 1.369876 2.030376 1.4565000 0.6434554 1.2414859 > res.g <- phmclust(x,5,method="main.g") > res.g$shape[, 1:5] bestview checkout service figurines jewellery Component1 1.362342 2.981528 1.116042 0.7935599 0.9145463 Component2 1.362342 2.981528 1.116042 0.7935599 0.9145463 Component3 1.362342 2.981528 1.116042 0.7935599 0.9145463 Component4 1.362342 2.981528 1.116042 0.7935599 0.9145463 Component5 1.362342 2.981528 1.116042 0.7935599 0.9145463 The shape parameters in the latter model are constant across components. As a consequence, page-wise within group hazard rates can vary freely for both models, while the group-wise within page hazard rates can cross only for the separate model (see Figure 1). From Figure 2 it is obvious that the hazards are proportional across components for each page. Note that due to space limitations, in both plots we only used three selected pages to demonstrate the hazard characteristics. The hazard plots allow to asses the relevance of different page categories with respect to cluster formation. Similar plots for dwell time distributions are available. 4 Conclusion In this work we presented a flexible framework to analyze dwell times on web pages by adopting concepts from survival analysis to probability based clustering. Unob- served heterogeneity is modeled by mixtures of Weibull distributed dwell times. Ap- plication of the EM-algorithm leads to a segmentation of sessions. Since the Weibull distribution is rather highly parameterized it offers a size- able amount of flexibility for the hazard rates. A more parsimonious modeling may either be achieved by posing proportionality restrictions on the hazards or mak- ing use of simpler distributional assumptions (e.g., for constant hazard rates). The Analysis of Dwell Times in Web Usage Mining 599 0 20406080100 0.00 0.01 0.02 0.03 0.04 0.05 bestview Dwell Time Hazard Function Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 0 20406080100 0.00 0.01 0.02 0.03 0.04 0.05 service Dwell Time Hazard Function Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 0 20406080100 0.00 0.01 0.02 0.03 0.04 0.05 figurines Dwell Time Hazard Function Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Fig. 1. Hazard Plot for Model separate mixPHM package covers therefore additional survival distributions such as Exponen- tial, Rayleigh, Gaussian, and Log-logistic. A segmentation of sessions as it is achieved by our method may serve as a starting point for optimization of a website. Identification of typical user behavior allows an efficient dynamic modification of content as well as an optimization of adverts for different groups of users. 0 20406080100 0.000 0.005 0.010 0.015 0.020 0.025 0.030 bestview Dwell Time Hazard Function Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 0 20406080100 0.000 0.005 0.010 0.015 0.020 0.025 0.030 service Dwell Time Hazard Function Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 0 20406080100 0.000 0.005 0.010 0.015 0.020 0.025 0.030 figurines Dwell Time Hazard Function Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Fig. 2. Hazard Plot for Model main.g 600 Patrick Mair and Marcus Hudec References CELEUX, G., and GOVAERT, G. (1992). A Classification EM Algorithm for Clustering and Two Stochastic Versions. Computational Statistics & Data Analysis, 14, 315–332. DEMPSTER, A.P., LAIRD, N.M. and RUBIN, D.B. (1977). Maximum Likelihood from In- complete Data via the EM-Algorithm. Journal of the Royal Statistical Society, Series B, 39, 1–38. ISHWARAN, H. (1996). Identifiability and Rates of Estimation for Scale Parameters in Loca- tion Mixture Models. The Annals of Statistics, 24, 1560-1571. JEWELL, N.P. (1982). Mixtures of Exponential Distributions. The Annals of Statistics, 24, 479–484. KALBFLEISCH, J.D. and PRENTICE, R.L. (1980): The Statistical Analysis of Failure Time Data. Wiley, New York. MAIR, P. and HUDEC, M. (2007). mixPHM: Mixtures of proportional hazard models. R package version 0.5.0: http://CRAN.R-project.org/ MCLACHLAN, G.J. and KRISHNAN, T. (1997). The EM Algorithm and Extensions. Wiley, New York. MONTGOMERY, A.L., LI, S., SRINIVASAN, K. and LIECHTY, J.C. (2004). Modeling on- line browsing and path analysis using clickstream data. Marketing Science, 23, 579–595. PARK, Y. and FADER, P.S. (2004). Modeling browsing behavior at multiple websites. Mar- keting Science, 23, 280–303 R Development Core Team. (2007). R: A Language and Environment for Statistical Comput- ing. Vienna, Austria. (ISBN 3-900051-07-0) Classifying Number Expressions in German Corpora Irene Cramer 1 , Stefan Schacht 2 , Andreas Merkel 2 1 Dortmund University, Germany irene.cramer@uni-dortmund.de 2 Saarland University, Germany {stefan.schacht, andreas.merkel}@lsv.uni-saarland.de Abstract. Number and date expressions are essential information items in corpora and therefore play a major role in various text mining applications. However, so far number expressions were investigated in a rather superficial manner. In this paper we introduce a comprehensive number classification and present promising, initial results of a classification experiment using various Machine Learning algorithms (amongst others AdaBoost and Maximum Entropy) to extract and classify number expressions in a German newspaper corpus. 1 Introduction In many natural language processing (NLP) applications such as Information Ex- traction and Question Answering number expressions play a major role, e.g. questions about the altitude of a mountain, the final score of a football match, or the opening hours of a museum make up a significant amount of the users’ information need. However, common Named Entity task definitions do not consider number and date/time expressions in detail (or as in the Conference on Computational Natural Language Learning (CoNLL) 2003 (Tjong Kim Sang (2003) do not incorporate them at all). We therefore present a novel, extended classification scheme for number expressions, which covers all Message Understanding Conference (MUC) (Chinchor (1998a)) types but additionally includes various structures not considered in common Named Entity definitions. In our approach, numbers are classified according to two aspects: their function in the sentence and their internal structure. We argue that our classification covers most of the number expressions occurring in text corpora. Based on this classification scheme we have annotated the German CoNLL 2003 data and trained various machine learning algorithms to automatically extract and classify number expressions. We also plan to incorporate the number extraction and classification system described in this paper into an open domain Web-based Question Answering system for German. As mentioned above, the recognition of certain date, time, and number expressions is especially important in the context of Information Extraction and Question Answering. E. g. the MUC Named Entity definitions (Chinchor (1998b)) include the following basic types: date, time ( <TIMEX> ) 554 Irene Cramer, Stefan Schacht, Andreas Merkel as well as monetary amount and percentage ( <NUMEX> ), and thus fostered the development of extraction systems able to handle number and date/time expressions. Famous Information Extraction systems developed in conjunction with MUC are e.g. FASTUS (Appelt et al. (1993)) or LaSIE (Humphreys et al. (1998)). At that time, many researchers used finite-state approaches to extract Named Entities. More recent Named Entity definitions, such as CoNLL 2003 (Tjong Kim Sang (2003)), aiming at the development of Machine Learning based systems, however, again ex- cluded number and date expressions. Nevertheless, due to the increasing interest in Question Answering and the TREC QA tracks (Voorhees et al. (2000)), recently, a number of research groups investigate various techniques to fast and accurately extract information items of different types form text corpora and the Web, respectively. Many answer typologies naturally include number and date expressions, e.g. the ISI Question Answer Typology (Hovy et al. (2002)). Unfortunately, in the corresponding papers only the whole Question Answering System’s performance is specified, we therefore could not detect any performance values, which would be directly comparable to our results. A very interesting and partially comparable (they only consider a small fraction of our classification) work (Ahn et al. (2005)) investigates the extraction and interpretation of time expressions. Their reported accuracy values range between about 40% and 75%. Paper Plan: This paper is structured as follows. Section 2 presents our classification scheme and the annotation. Section 3 deals with the features and the experimen- tal setting. Section 4 analyzes the results and comments on the future perspectives. 2 Classification of number expressions Many researchers use regular expressions to find numbers in corpora, however, most numbers are part of a larger construct such as ’2,000 miles’ or ’Paragraph 249 Bürg- erliches Gesetzbuch’. Consequently, the number without its context has no meaning or is highly ambiguous (2,000 miles vs. 2,000 cars). In applications such as Ques- tion Answering it is therefore necessary to detect this additional information. Table 1 shows example questions that obviously ask for number expressions as answers. The examples clearly indicate that we are not looking for mere digits but multi-word units or even phrases consisting of a number and its specifying context. Thus, a number is not a stand-alone information and, as the examples show, might not even look like a number at all. This paper therefore proposes a novel, extended classification that handles number expressions similar to Named Entities and thus provides a flexible and scalable method to incorporate these various entity types into one generic framework. We classify numbers according to their internal structure (which corresponds to their text extension) and their function (which corresponds to their class). We also included all MUC types to guarantee that our classification conforms with previous work. [...]... 1 2 3 4 15 23 36 41 34 59 85 90 59 84 95 101 77 97 108 114 85 96 105 102 75 82 87 90 S2 > SS 0 1 3 4 10 22 30 42 32 53 64 67 54 66 86 86 74 86 94 97 77 85 94 1 03 76 89 88 92 S1 > S2 1 2 3 4 13 22 29 33 31 54 67 68 47 65 65 70 60 67 65 68 68 66 63 68 65 59 68 68 S1 = SS 131 126 1 23 120 105 94 74 67 78 45 24 18 43 22 16 10 18 9 4 2 8 2 1 1 2 1 0 0 S2 = SS 132 127 1 23 120 108 97 79 65 70 51 35 26 39 25... four, and 114 with five The corresponding sequence of ratios (S1 > SS / S1 < SS) looks even better: 1.88, 3. 13, 4.70, and 6 .33 This means that Comparing Homograph Norms with Corpus Data 617 Table 3 Overall quantitative results for several concordance widths and various numbers of associations considered Width ±1 3 ±10 30 ±100 30 0 ±1000 Assoc 2 3 4 5 2 3 4 5 2 3 4 5 2 3 4 5 2 3 4 5 2 3 4 5 2 3 4 5... 0.08 0 .36 0.16 0.58 0 .37 0.67 0.46 0.07 0. 23 0. 23 0.97 0.19 0.09 0.20 0.06 0.00 0.09 0.00 0.60 0.14 0.45 0 .31 0.00 0.12 0.00 itemization.score 0. 83 0.64 0 .33 0.62 0.09 0.82 0.22 0.88 0 .34 0.69 0.91 0.66 0.49 0. 83 0.96 0. 43 0.00 0. 53 0.28 0.00 0.45 0.16 0.10 0.21 0.17 0.17 0.06 0.00 0 .32 0.85 0.40 0.08 0.25 0.14 0.00 0.28 0.00 0.00 0.19 0.11 0.18 0.00 0.00 0.10 0.89 0.04 0.00 0.26 0.02 0.00 0 .30 0.00... cases and thus significantly outperforms the rest of the classifiers Table 5 Overview of the F-Measure Values (AB: AdaBoost, DT: Decision Tree, KNN: kNearest Neighbor, ME: Maximum Entropy, NB: Naive Bayes) class AB DT KNN ME NB class AB DT KNN ME NB other 0.99 0 .37 0.67 0.72 0. 53 0 .37 0. 43 0.54 0.82 0.49 0.87 0.41 0 .38 0.21 0.84 0.99 0. 13 0. 73 0.61 0.15 0.05 0 .38 0 .36 0. 73 0. 43 0.76 0.40 0.02 0.28 0 .31 ... examples and a short explanation of the class’ sense and extension 2.2 Corpora and annotation According to our findings in Web data and newspaper corpora we developed guidelines which we used to annotate the German CoNLL 20 03 data To ensure a consistent and accurate annotation of the corpus, we worked every part over in several passes and performed a special reviewing process for critical cases Table 3 shows... three Table 2 Results for the first ten homographs (numbers to be multiplied by 10−6 ) Homograph S1 bar 2 23 beam 130 5 bill 166 block 194 bluff 934 S2 199 1424 95 945 2778 SS 37 1 23 202 112 226 Homograph S1 S2 SS board 205 799 53 bolt 1794 37 47 962 bound 675 692 139 bowl 32 7 644 25 break 156 63 95 In Table 3, for each concordance width we also distinguish four cases where each relates to a different number... Understanding Conference (MUC-7) TJONG KIM SANG, E F and DE MEULDER, F (20 03) : Introduction to the CoNLL Shared Task: Language-Independent Named Entity Recognition Proceedings of the Conference on Computational Natural Language Learning VOORHEES, E and TICE, D (2000): Building a Question Answering Test Collection Proceedings of SIGIR-2000 WITTEN, I H and FRANK, E (2005): Data Mining: Practical machine learning. .. STICKEL, M and TYSON, M (19 93) : FASTUS: A Cascaded Finite-State Tranducer for Extracting Information from Natural-Language Text SRI International BIKEL, D., MILLER, S., SCHWARTZ, R and WEISCHEDEL, R (1997): Nymble: a highperformance learning name-finder Proceedings of 5th ANLP CARRERAS, X., MÀRQUEZ, L and PADRÓ, L (20 03) : A Simple Named Entity Extractor using AdaBoost Proceedings of CoNLL-20 03 CHINCHOR,... the readings 60, 67, 65, and 68 from Table 3, and can compute the corresponding values of 58, 61, 66, and 64 for S1 < S2 Both sequences appear very similar Interpreted linguistically, this means that intra-sense association strengths tend to be similar for the primary and the secondary sense, at least for our selection of homographs Let us finally look at columns 3 and 4 of Table 3, which should give us... possible translations more than 3 years or for 3 year In German such structures are typically disambiguated by prosody Particular text type: A comparison between CoNLL and the corpora we used to develop our guidelines showed that there might be a very particular style We also had the impression that the CoNLL training and test data differ with respect to type distribution and style We therefore based . Weibull K=2 K =3 K=4 K=5 separate 233 39.27 232 02. 23 230 40.01 229 43. 11 main.g 233 55.66 230 58.25 22971.86 228 63. 43 main.p 235 03. 73 233 68.77 231 65.60 230 68.47 int.gp 235 72.21 234 22.51 233 05. 63 230 75.76 main.gp. jewellery Component1 1 .36 234 2 2.981528 1.116042 0.7 935 599 0.91454 63 Component2 1 .36 234 2 2.981528 1.116042 0.7 935 599 0.91454 63 Component3 1 .36 234 2 2.981528 1.116042 0.7 935 599 0.91454 63 Component4 1 .36 234 2 2.981528. jewellery landing search 6 16 592 30 12 1 83 0 13 15 136 157 0 139 430 11 0 23 428 2681 17 2058 25 93 56 186 37 184 710 52 12 450 34 0 61 0 874 30 7 570 6 25 53 The above extract of the data matrix

Data Analysis Machine Learning and Applications Episode 3 Part 3 pps

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan