Data Analysis Machine Learning and Applications Episode 2 Part 5 pps

ADSL customer segmentation combining several SOMs 349 Binary Profile Binary Profile Binary Profile Unknown Down Customers Daily Activities Profiles STEP 1 STEP 2 Global Daily Activity Profile Typical Applications Days Concatenation Typical Days STEP 3 STEP 4 STEP 5 Typical Customers log file Proportion of days spent in each "typical days" for the month Web Down P2PŦup Fig. 6. The multi-level exploratory data analysis approach. The first step leads to the formation of 9 to 13 clusters of “typical application days" profiles, depending on the application. Their behaviours can be summarized into inactive days, days with a mean or high activity on some limited time periods (early or late evening, noon for instance), and days with a very high activity on a long time segment (working hours, afternoon or night). Figure 7 illustrates the result of the first step for one application: it shows the mean hourly volume profiles of the 13 clusters revealed after the clustering for the web down application (the mean profiles are computed by the mean of all the observations that have been classified in the cluster; the hourly volumes are plotted in natural statistics). The other applications can be described similarly. 350 Francoise Fessant et al. 0 5 10 15 20 25 0 1 2 3 4 5 6 x 10 6 hours volume (in byte) Web Down Application C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 Fig. 7. Mean daily volumes of clusters for web down application The second clustering leads to the formation of 14 clusters of “typical days". Their behaviours are different in terms of traffic time periods and intensity. The main characteristics are a similar activity in up and down traffic directions and a similar usage of the peer-to-peer and unknown applications in clusters. The usage of the web application can be quite different in intensity. Globally, the time periods of trafficare very similar for the three applications in a cluster. 10 percent of the days show a high daily activity on the three applications, 25 percent of the days are inactive days. If we project the other applications on the map days, we can observe some correlations between applications: days with a high web daily traffic are also days with high mail, ftp and streaming activities and the traffic time periods are similar. The chat and games applications can be correlated to peer-to-peer in the same way. The last clustering leads to the formation of 12 clusters of customers which can be characterized by the preponderance of a limited number of typical days. Figure 8 illustrates the characteristic behaviour of one “typical customer" (cluster 6) which groups 5 percent of the very active customers on all the applications (with a high activity all along the day, 7 days out of 10 and very little days with no activity). We plot the mean profile of the cluster (computed by the mean of all the customers classified in the cluster (up left, in black). We also give the mean profile computed on all the observations (bottom left, in grey), for comparison. The profile can be discussed according to its variations against the mean profile in order to reveal its specific characteristics. The visual inspection of the left part of Figure 8 shows that the mean customer associated with the cluster is mainly active on “typical day 12" for 78 percent of the month. The contributions of the other “typical days" are low and are lower than the global mean. Typical day 12 corresponds to very active days. The mean profile of “typical day 12" is shown in the right top part ADSL customer segmentation combining several SOMs 351 0 10 20 30 40 50 60 70 80 0 0.2 0.4 0.6 0.8 1 typical application day cluster number Cluster 12, (9%) Global mean Unknown up Unknown down p2p up p2p down web up web down 0 10 20 30 40 50 60 70 80 0 0.2 0.4 0.6 0.8 1 Typical day 12 0 5 10 15 20 25 0 1 2 3 4 5 x 10 7 0 5 1 0 1 5 2 0 2 5 0 2 4 6 8 10 x 10 6 cluster 6, application: p2p down (12%) volume (in byte) global mean volume (in byte) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 20 40 60 80 typical day cluster number Cluster 6 (5%) Global 1 2 3 4 5 6 7 8 9 1 0 11 12 1 3 14 0 20 40 60 80 Typical customer 6 Typical day 12 Typical customer 6 Typical day 6 for p2p down application Fig. 8. Profile of one cluster of customers (up left) and mean profile (bottom) and profiles of associated typical days and typical application days of the figure in black. The day profile is formed by the aggregation of the individual application clustering results (a line delimits the set of descriptors for each application). We also give the mean profile computed on all the observations (bottom, in grey). Typical day 12 is characterized by a preponderant typical application day on each application (from 70 percent to 90 percent for each). These typical application days correspond to high daily activities. For example, we plot the mean profile of “typical day 6" for the peer-to-peer down application in the same figure (right bottom; in black the hourly profile of the typical day for the application and in grey the global average hourly profile; the volumes are given in bytes). These days show a very high activity all along the day and even at night for the application (12 percent of the days). Figure 8 schematizes and synthesizes the complete customer segmentation process. Our step-by-step approach aims at striking a practical balance between the faith- ful representation of the data and the interpretative power of the resulting clustering. The segmentation results can be exploited at several levels according to the level of details expected. The customer level gives an overall view on the customer behaviours. The analysis also allows a detailed insight into the daily cycles of the customers in the segments. The approach is highly scalable and deployable and clustering technique used allows easy interpretations. All the other segments of customers 352 Francoise Fessant et al. can be discussed similarly in terms of daily profiles and hourly profiles on the applications. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 20 40 60 80 1 2 3 4 5 6 7 8 9 1 0 11 12 1 3 14 0 20 40 60 80 Global mean Typical customer 10 typical day cluster number Cluster 10 (3%) 0 10 20 30 40 50 60 70 80 0 0.2 0.4 0.6 0.8 1 0 1 0 2 0 30 4 0 50 60 7 0 80 0 0.2 0.4 0.6 0.8 1 Typical day 1 Cluster 1 (4.5%) Global mean typical application day cluster number unknown up unknown down p2p up p2p down web up web down 0 5 10 15 20 25 0 1 2 3 4 5 6 x 10 6 0 5 1 0 1 5 2 0 2 5 0 0.5 1 1.5 2 2.5 x 10 6 Typical day 12 for web down application (8%) volume (in byte) Global mean volume (in byte) Typical day 1 Typical day 12 for web down application Typical customer 10 Fig. 9. Profile of another cluster of customers (top left) and mean profile (bottom) and profiles of associated typical days and typical application days We have identified segments of customers with a high or very high activity all along the day on the three applications (24 percent of the customers), others segments of customers with very little activity (27 percent of the customers) and segments of customers with activity on some limited time periods on one or two applications, for example, a segment of customers with overall a low activity mainly restricted to working hours on web applications. This segment is detailed in Figure 9. The mean customer associated with cluster 10 (3 percent of the customers) is mainly active on “typical day 1" for 42 percent of the month. The contributions on the other “typical days" are close to the global mean. Typical day 1 (4.5 percent of the days) is characterized by a preponderant typical application day on web application only (both in up and down directions); no specific typical day appears for the two other applications. The characteristic web days are working days with a high daily web activity on the segment 10h-19h. Figure 10 depicts the organization of the 12 clusters on the map (each of the clusters is identified by a number and a colour). The topological ordering inherent to the SOM algorithm is such that clusters with close behaviours lie close on the map and it is possible to visualize how the behaviour evolves in a smooth manner from one place of the map to another. The map is globally organized along an axis going ADSL customer segmentation combining several SOMs 353 from the north east (cluster 12) to the south west (cluster 6), from low activity to high activity on all the applications, non-stop all over the day. Customers map 4 1 2 3 5 6 7 8 9 10 11 12 Heavy users (high traffic on all applications) Users with very few activity Web activity 10hŦ19h Web activity P2P Activity, afternoon and evening Average activity Fig. 10. Interpretation of the learned SOM and its 12 clusters of customers 4 Conclusion In this paper, we have shown how the mining of network measurement data can reveal the usage patterns of ADSL customers. A specific scheme of exploratory data analysis has been presented to give lightings on the usages of applications and daily trafficprofiles. Our data-mining approach, based on the analysis and the interpretation of Kohonen self-organizing maps, allows us to define accurate and easily interpretable profiles of the customers. These profiles exhibit very heterogeneous behaviours ranging from a large majority of customers with a low usage of the applications to a small minority with a very high usage. The knowledge gathered about the customers is not only qualitative; we are also able to quantify the population associated to each profile, the volumes consumed on the applications or the daily cycle. Our methodologies are continuously in development in order to improve our knowledge of customer’s behaviours. 354 Francoise Fessant et al. References ANDERSON, B., GALE, C., JONES, M., and McWILLIAMS, A. (2002). Domesticating broadband-what consumers really do with flat-rate, always-on and fast Internet connec- tions. BT Technology Journal, 20(1):103–114. CLEMENT, H., LAUTARD, D., and RIBEYRON, M. (2002). ADSL traffic: a forecasting model and the present reality in France. In WTC (World Telecommunications Congress), Paris, France. CLEROT, C. and FESSANT, F. (2003). From IP port numbers to ADSL customer segmentation: knowledge aggregation and representation using Kohonen maps. In DATAMINING IV, Rio de Janeiro, Brazil. FRANCOIS, J. (2002). Otarie: observation du traffic d’accès des réseaux IP en exploitation. France Télécom R&D Technical Report FT.R&D /DAC-DT/2002-094/NGN (in French). KOHONEN, T. (2001). Self-Organizing Maps. Springer-Verlag, Heidelberg. LEMAIRE, V. and CLEROT, F. (2005) The many faces of a Kohonen Map,. Studies in computational Intelligence (SCI) 4, 1-13 (Classification and Clustering for Knowledge Discovery). Springer. OJA, E. and KASKI, S. (1999). Kohonen maps. Elsevier. VESANTO, J. and ALHONIEMI, E. (2000). Clustering of the self organizing map. In IEEE Transactions of Neural Networks. VESANTO, J., HIMBERG, J., ALHONIEMI, E., and PARHANKANGAS, J. (2000). Som toolbox for matlab 5. Technical Report Technical Report A57, Helsinki University of Technology, Neural Networks Research Centre. Finding New Technological Ideas and Inventions with Text Mining and Technique Philosophy Dirk Thorleuchter Fraunhofer INT, Appelsgarten 2, 53879 Euskirchen, Germany Dirk.Thorleuchter@int.fraunhofer.de Abstract. Text mining refers generally to the process of deriving high quality information from unstructured texts. Unstructured texts come in many shapes and sizes. It may be stored in research papers, articles in technical periodicals, reports, documents, web pages etc. Here we introduce a new approach for finding textual patterns representing new technological ideas and inventions in unstructured technological texts. This text mining approach follows the statements of technique philosophy. Therefore a technological idea or invention represents not only a new mean, but a new purpose and mean combination. By systematic identification of the purposes, means and purpose-mean combinations in unstructured technological texts compared to specialized reference collections, a (semi-) automatic finding of ideas and inventions can be realized. Characteristics that are used to measure the quality of these patterns found in technological texts are comprehensibility and novelty to humans and usefulness for an application. 1 Introduction The planning of technological and scientific research and development (R&D-) programs is a very demanding task, e.g. in the R&D-program of the German ministry of defense there are at least over 1000 different R&D-projects running simultane- ously. They all refer to about 100 different technologies in the context of security and defense. There is always a lot of change in these programs - a lot of projects starting new and a lot of projects running out. One task of our research group is finding new R&D-areas for this program. New ideas or new inventions are a basis for a new R&D-area. That means for planning new R&D-areas it is necessary to identify a lot of new technological ideas and inventions from the scientific community (Ripke et al. (1972)). Up to now, the identification of new ideas and inventions in unstructured texts is done manually (that means by humans) without the support of text mining. Therefore in this paper we will describe the theoretical background of the text mining approach to discover (semi-) automatically textual patterns representing new ideas and inventions in unstructured technological texts. 414 Dirk Thorleuchter Hotho (2004) describes the characteristics that are used to measure the quality of these textual patterns extracted by knowledge discovery tasks. The characteristics are comprehensibility and novelty to the users and usefulness for a task. In this paper the users are program planers or researchers and the task is to find ideas and inventions which can be used as basis for new R&D-areas. It is known from the cognition research that analysis and evaluation of textual information requires the knowledge of a context (Strube (2003)). The selection of the context depends on the users and the tasks. Referring to our users and our task, we have on one hand textual information about world wide existing technological R&D- projects (furthermore this is called "raw information"). This information contains a lot of new technological ideas and inventions. New means, that ideas and inventions are unknown to the user (Ipsen (2002)). On the other hand we have descriptions about own R&D-projects. This represents our knowledge base and furthermore this is called "context information". Ideas and inventions in the context information are already known to the user. To create a text mining approach for finding ideas and inventions inside the raw information we have to create a common structure for raw and context information first. This is necessary for the comparison between raw and context information e.g. to distinguish new (that means unknown) ideas and inventions from known ideas and inventions. In short we have to do 2 steps: 1. Create a common structure for raw and context information as a basis for the text mining approach. 2. Create a text mining approach for finding new, comprehensible and useful ideas and inventions inside the raw information. Below we describe step 1 and 2 in detail. 2 A common structure for raw and context information In order to perform knowledge discovery tasks (e.g. finding ideas and inventions) it is required that raw information and context information have to be structured and formatted in a common way as described above. In general the structure should be rich enough to allow for interesting knowledge discovery operations and it should be simple enough to allow an automatically converting of all kind of textual information in a reasonable cost as described by Feldman et al.(1995). Raw information is stored in research papers, articles in technical periodicals, reports, documents, databases, web pages etc. That means raw information contains a lot of different structures and formats. Normally context information also contains different structures and formats. Converting all structures and formats to a common structure and format for raw and context information by keeping all structure information available costs plenty of work. Therefore our structure approach is to convert all information into plain text format. That means firstly we destroy all existing structures and secondly build up a new common structure for raw and context information. The new structure should refer to the relationship between terms or term-combinations (Kamphusmann (2002)). In this paper we realize this by creating Finding New Technological Ideas and Inventions 415 sets of domain specific terms which occur in the context of a term or a combination of terms. For the structure formulation we define the term unit as word. First we create a set of domain specific terms. Definition 1. Let (a text) T =[Z 1 , ,Z n ] be a list of terms (words) Z i in order of appearance and let n ∈ N be the number of terms in T and i ∈ [1, ,n].Let6 = { ˜ Z 1 , , ˜ Z m } be a set of domain specific stop terms (Lustig (1986)) and let m ∈ Nbe the number of terms in 6. : - the set of domain specific terms in text T - is defined as the relative complement Twithout6. Therefore: : = T\6 (1) For each Z i ∈ : we create a set of domain specific terms which occur in the context of term Z i . Definition 2. Let l ∈ N be a context l ength of term Z i that means the maximum distance between Z i andatermZ j in text T. Let the distance be the number of terms (words) which occur between Z i and Z j including the term Z j and let j ∈[1, ,n]. ) i is defined as a set of those domain specific terms which occur in an l-length context of term Z i in text T: ) i =  Z j (Z j ∈ :) ∧( | i − j | ≤ l) ∧(Z i ≡ Z j )  (2) For each combination of terms in ) i we create a set of domain specific terms which occur in the context of this combination of terms. Definition 3. Let G p ∈ : be a term in a list of terms with number p ∈ [1, ,z].Let G 1 , ,G z be a list of terms - in further this will be called term-combination - with G p ≡ G q ∀p ≡ q ∈ [1, ,z] that occurs together in an l-length context of term G 1 in text T . Let z ∈ N be the number of terms in the term-combination G 1 , ,G z . ; T G 1 , ,G z is defined as the set of domain specific terms which occur together with the term- combination G 1 , ,G z in an l-length context of term G 1 in text T: ; T G 1 , ,G z =  ) i \ z  p=2 G p G 1 = Z i ∧ z  p=2 G p ⊂ ) i (3) In the Figure 1 an example for the relationships in set ; T G 1 , ,G z is presented. The term-combination (sensor, infrared, uncooled) has a relationship to the term- combination (focal, array, plane) because uncooled infrared sensors can be built by using the focal plane array technology. The text T could be a) the textual raw information or b) the textual context information. As result we get in case of a) ; raw G 1 , ,G z and in case of b); context G 1 , ,G z . Definition 4. To identify terms or term-combinations in the raw information which also occur in the context information - that means the terms or term-combinations are known to the user - we define ; known G 1 , ,G z as the set of terms which occur in ; raw G 1 , ,G z and ; context G 1 , ,G z : ; known G 1 , ,G z = ; raw G 1 , ,G z ∩; context G 1 , ,G z (4) 416 Dirk Thorleuchter Fig. 1. Example for the relationships in ; T G 1 , ,G z : Uncooled infrared sensors can be build by using the focal plane array technology. 3 Relevant aspects for the text mining approach from technique philosophy The text mining approach follows the statements of technique philosophy (Rohpohl (1996)). Below we describe some relevant aspects of the statements and some specific conclusions for our text mining approach. a) A technological idea or invention represents not only a new mean, but a new purpose and mean combination. That means to find an idea or invention it is necessary to identify a mean and an appertaining purpose in the raw information. Appertaining means that purpose and mean shall occur together in an l-length context. Therefore for our text mining approach we firstly want to identify a mean and secondly we want to identify an appertaining purpose or vice versa. b) Purposes and means can be exchanged. That means a purpose can become a mean in a specific context and vice versa. Example: A raw material (mean) is used to create an intermediate product (purpose). The intermediate product (mean) is then used to produce a product (purpose). In this example the intermediate product changes from purpose to mean because of the different context. Therefore for our text mining approach it is possible to identify textual patterns representing means or purposes. But it is not possible to distinguish between means and purposes without the knowledge of the specific context. c) A purpose or a mean is represented by a technical term or by several technical terms. Therefore purposes or means can be represented by a combination of domain specific terms (e.g. G 1 , ,G z ) which occur together in an l-length context. The purpose-mean combination is a combination of 2 term-combinations and it also occurs in an l-length context as described in 3 a). For the formulation a term- combination G 1 , ,G z represents a mean (a purpose) only if ; raw G 1 , ,G z ≡, which means there are further domain-specific terms representing a purpose (a mean) which occur in an l-length context together with the term-combination G 1 , ,G z [...]... 15 64 0 . 25 15 64 C conf threshold false false 0 . 25 0.01 0.99 15 65 2 15 65 M min nr inst/leaf false false 2 2 20 Learner inst Learner liid lid is default lid name version url class (charact) mach id corr fact (props) 15 J48 1 .2 http:// tree ng-06-04 1 15 13 true Machine Experiment type status priority 13 15 1 1 done 9 Data inst error machine classificat (backgr info) ng-06-04 eidlearner inst data. .. manner and in complete detail in an experiment database Such databases serve as a detailed log of previously performed experiments and a repository of verifiable learning experiments that can be reused by different researchers We present an existing database containing 25 0,000 runs of classifier learning systems, and show how it can be queried and mined to answer a wide range of questions on learning. .. power of this database by showing how SQL queries and data mining techniques can be used to investigate classifier learning behavior Section 5 concludes 2 A database for classification experiments To efficiently store and allow queries about all aspects of previously performed classification experiments, the relationships between the involved learning algorithms, datasets, experimental procedures and results... of many learning experiments and to obtain a clear picture of the performance of the involved algorithms and the effects of parameter settings and dataset characteristics We believe that this discussion may be of interest to anyone who may want to use this database for their own purposes, or set up a similar databases for their own research We describe the structure of the database in Sect 2 and the... purposes for his task 5 Evaluation and outlook We have done a first evaluation with a text about R&D-projects from the USA as raw information (Fenner et al (20 06)), a text about own R&D-projects as context information (Thorleuchter (20 07)), a stop word list created for the raw information and the parameter values l = 8 and pmin = 50 % The aim is to find new, comprehensible and useful ideas and inventions in... reasons, Blockeel (20 06) proposed the use of experiment databases: databases describing a large number of learning experiments in complete detail, serving as a detailed log of previously performed experiments and an (online available) repository of learning experiments that can be reused by different researchers Blockeel and Vanschoren (20 07) provide a detailed account of the advantages and disadvantages... mining approach 420 Dirk Thorleuchter References FELDMAN, R and DAGAN, I (19 95) : Kdt - knowledge discovery in texts In: Proceedings of the First International Conference on Knowledge Discovery (KDD) Montreal, 1 12 113 FENNER, J and THORLEUCHTER, D (20 06): Strukturen und Themengebiete der mittelstandsorientierten Forschungsprogramme in den USA Fraunhofer INT’s edition, Euskirchen, 2 HOTHO, A (20 04): Clustern... mit Hintergrundwissen Univ Diss., Karlsruhe, 29 IPSEN, C (20 02) : F&E-Programmplanung bei variabler Entwicklungsdauer Verlag Dr Kovac, Hamburg, 10 KAMPHUSMANN, T (20 02) : Text-Mining Symposion Publishing, Düsseldorf, 28 LUSTIG, G (1986): Automatische Indexierung zwischen Forschung und Anwendung Georg Olms Verlag, Hildesheim, 92 RIPKE, M and STÖBER, G (19 72) : Probleme und Methoden der Identifizierung potentieller... GOLDING, A R (19 95) : A Bayesian hybrid method for context-sensitive spelling correction In: Proceedings of the Third Workshop on Very Large Corpora, Boston, MA GOLDING, A R., and ROTH, D (1999): A Winnow based approach to context-sensitive spelling correction Machine Learning 34(1-3):107-130 Special Issue on Machine Learning and Natural Language HEYER, G., QUASTHOFF, U and WITTIG, T (20 06): Text Mining:... inst eval meth Dataset did 1 23 0 name origin url class index size def acc (charact) 23 0 diid did randomization value anneal uci http:// -1 898 0.7617 Eval meth parval Eval meth inst emiid method emiid param 1 cross-validation 1 nbfolds Testset of value 10 trainset Evaluation eid cputime 13 0:0:0:0 .55 testset Prediction memory 22 6kb pred acc mn abs err conf mat (metrics) 0.9844 0.0 056 [[.],[.], . (4 .5% ) Global mean typical application day cluster number unknown up unknown down p2p up p2p down web up web down 0 5 10 15 20 25 0 1 2 3 4 5 6 x 10 6 0 5 1 0 1 5 2 0 2 5 0 0 .5 1 1 .5 2 2 .5 x. 20 30 40 50 60 70 80 0 0 .2 0.4 0.6 0.8 1 Typical day 12 0 5 10 15 20 25 0 1 2 3 4 5 x 10 7 0 5 1 0 1 5 2 0 2 5 0 2 4 6 8 10 x 10 6 cluster 6, application: p2p down ( 12% ) volume (in byte) global. other applications can be described similarly. 350 Francoise Fessant et al. 0 5 10 15 20 25 0 1 2 3 4 5 6 x 10 6 hours volume (in byte) Web Down Application C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C 12 C13 Fig.

Data Analysis Machine Learning and Applications Episode 2 Part 5 pps

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan