IT training LNAI 9047 statistical learning and data sciences gammerman, vovk papadopoulos 2015 03 12

LNAI 9047 Alexander Gammerman Vladimir Vovk Harris Papadopoulos (Eds.) Statistical Learning and Data Sciences Third International Symposium, SLDS 2015 Egham, UK, April 20–23, 2015 Proceedings 123 Lecture Notes in Artificial Intelligence Subseries of Lecture Notes in Computer Science LNAI Series Editors Randy Goebel University of Alberta, Edmonton, Canada Yuzuru Tanaka Hokkaido University, Sapporo, Japan Wolfgang Wahlster DFKI and Saarland University, Saarbrücken, Germany LNAI Founding Series Editor Joerg Siekmann DFKI and Saarland University, Saarbrücken, Germany 9047 More information about this series at http://www.springer.com/series/1244 Alexander Gammerman · Vladimir Vovk Harris Papadopoulos (Eds.) Statistical Learning and Data Sciences Third International Symposium, SLDS 2015 Egham, UK, April 20–23, 2015 Proceedings ABC Editors Alexander Gammerman University of London Egham, Surrey UK Harris Papadopoulos Frederick University Nicosia Cyprus Vladimir Vovk University of London Egham, Surrey UK ISSN 0302-9743 Lecture Notes in Artificial Intelligence ISBN 978-3-319-17090-9 DOI 10.1007/978-3-319-17091-6 ISSN 1611-3349 (electronic) ISBN 978-3-319-17091-6 (eBook) Library of Congress Control Number: 2015935220 LNCS Sublibrary: SL7 – Artificial Intelligence Springer Cham Heidelberg New York Dordrecht London c Springer International Publishing Switzerland 2015 This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed The use of general descriptive names, registered names, trademarks, service marks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made Printed on acid-free paper Springer International Publishing AG Switzerland is part of Springer Science+Business Media (www.springer.com) In memory of Alexey Chervonenkis Preface This volume contains the Proceedings of the Third Symposium on Statistical Learning and Data Sciences, which was held at Royal Holloway, University of London, UK, during April 20–23, 2015 The original idea of the Symposium on Statistical Learning and Data Sciences is due to two French academics – Professors Mireille Gettler Summa and Myriam Touati – from Paris Dauphine University Back in 2009 they thought that a “bridge” was required between various academic groups that were involved in research on Machine Learning, Statistical Inference, Pattern Recognition, Data Mining, Data Analysis, and so on; a sort of multilayer bridge to connect those fields This is reflected in the symposium logo with the Passerelle Simone-de-Beauvoir bridge The idea was implemented and the First Symposium on Statistical Learning and Data Sciences was held in Paris in 2009 The event was indeed a great “bridge” between various communities with interesting talks by J.-P Benzecri, V Vapnik, A Chervonenkis, D Hand, L Bottou, and many others Papers based on those talks were later presented in a volume of the Modulad journal and separately in a post-symposium book entitled Statistical Learning and Data Sciences, published by Chapman & Hall, CRC Press The second symposium, which was equally successful, was held in Florence, Italy, in 2012 Over the last years since the first symposium, the progress in the theory and applications of learning and data mining has been very impressive In particular, the arrival of technologies for collecting huge amounts of data has raised many new questions about how to store it and what type of analytics are able to handle it – what is now known as Big Data Indeed, the sheer scale of the data is very impressive – for example, the Large Hadron Collider computers have to store 15 petabytes a year (1 petabyte = 1015 bytes) Obviously, handling this requires the usage of distributed clusters of computers, streaming, parallel processing, and other technologies This volume is concerned with various modern techniques, some of which could be very useful for handling Big Data The volume is divided into five parts The first part is devoted to two invited papers by Vladimir Vapnik The first paper, “Learning with Intelligent Teacher: Similarity Control and Knowledge Transfer,” is a further development of his research on learning with privileged information, with a special attention to the knowledge representation problem The second, “Statistical Inference Problems and their Rigorous Solutions,” suggests a novel approach to pattern recognition and regression estimation Both papers promise to become milestones in the developing field of statistical learning The second part consists of 16 papers that were accepted for presentation at the main event, while the other three parts reflect new research in important areas of statistical learning to which the symposium devoted special sessions Specifically the special sessions included in the symposium’s program were: – Special Session on Conformal Prediction and its Applications (CoPA 2015), organized by Harris Papadopoulos (Frederick University, Cyprus), Alexander Gammerman (Royal Holloway, University of London, UK), and Vladimir Vovk (Royal Holloway, University of London, UK) VIII Preface – Special Session on New Frontiers in Data Analysis for Nuclear Fusion, organized by Jesus Vega (Asociacion EURATOM/CIEMAT para Fusion, Spain) – Special Session on Geometric Data Analysis, organized by Fionn Murtagh (Goldsmith College London, UK) Overall, 36 papers were accepted for presentation at the symposium after being reviewed by at least two independent academic referees The authors of these papers come from 17 different countries, namely: Brazil, Canada, Chile, China, Cyprus, Finland, France, Germany, Greece, Hungary, India, Italy, Russia, Spain, Sweden, UK, and USA A special session at the symposium was devoted to the life and work of Alexey Chervonenkis, who tragically died in September 2014 He was one of the founders of modern Machine Learning, a beloved colleague and friend All his life he was connected with the Institute of Control Problems in Moscow, over the last 15 years he worked at Royal Holloway, University of London, while over the last years he also worked for the Yandex Internet company in Moscow This special session included talks in memory of Alexey by Vladimir Vapnik – his long standing colleague and friend – and by Alexey’s former students and colleagues We are very grateful to the Program and Organizing Committees, the success of the symposium would have been impossible without their hard work We are indebted to the sponsors: the Royal Statistical Society, the British Computer Society, the British Classification Society, Royal Holloway, University of London, and Paris Dauphine University Our special thanks to Yandex for their help and support in organizing the symposium and the special session in memory of Alexey Chervonenkis This volume of the proceedings of the symposium is also dedicated to his memory Rest in peace, dear friend February 2015 Alexander Gammerman Vladimir Vovk Harris Papadopoulos Organization General Chairs Alexander Gammerman, UK Vladimir Vovk, UK Organizing Committee Zhiyuan Luo, UK Mireille Summa, France Yuri Kalnishkan, UK Myriam Touati, France Janet Hales, UK Program Committee Chairs Harris Papadopoulos, Cyprus Xiaohui Liu, UK Fionn Murtagh, UK Program Committee Members Vineeth Balasubramanian, India Giacomo Boracchi, Italy Paula Brito, Portugal Léon Bottou, USA Lars Carlsson, Sweden Jane Chang, UK Frank Coolen, UK Gert de Cooman, Belgium Jesus Manuel de la Cruz, Spain Jose-Carlos Gonzalez-Cristobal, Spain Anna Fukshansky, Germany Barbara Hammer, Germany Shenshyang Ho, Singapore Carlo Lauro, Italy Guang Li, China David Lindsay, UK Henrik Linusson, Sweden Hans-J Lenz, Germany Ilia Nouretdinov, UK Matilde Santos, Spain Victor Solovyev, Saudi Arabia Jesus Vega, Spain Rosanna Verde, Italy Contents Invited Papers Learning with Intelligent Teacher: Similarity Control and Knowledge Transfer Vladimir Vapnik and Rauf Izmailov Statistical Inference Problems and Their Rigorous Solutions Vladimir Vapnik and Rauf Izmailov 33 Statistical Learning and its Applications Feature Mapping Through Maximization of the Atomic Interclass Distances Savvas Karatsiolis and Christos N Schizas 75 Adaptive Design of Experiments for Sobol Indices Estimation Based on Quadratic Metamodel Evgeny Burnaev and Ivan Panin 86 GoldenEye++: A Closer Look into the Black Box Andreas Henelius, Kai Puolamäki, Isak Karlsson, Jing Zhao, Lars Asker, Henrik Boström, and Panagiotis Papapetrou 96 Gaussian Process Regression for Structured Data Sets Mikhail Belyaev, Evgeny Burnaev, and Yermek Kapushev 106 Adaptive Design of Experiments Based on Gaussian Processes Evgeny Burnaev and Maxim Panov 116 Forests of Randomized Shapelet Trees Isak Karlsson, Panagotis Papapetrou, and Henrik Boström 126 Aggregation of Adaptive Forecasting Algorithms Under Asymmetric Loss Function Alexey Romanenko 137 Visualization and Analysis of Multiple Time Series by Beanplot PCA Carlo Drago, Carlo Natale Lauro, and Germana Scepi 147 Recursive SVM Based on TEDA Dmitry Kangin and Plamen Angelov 156 Baire Metric for High Dimensional Clustering 429 Implementation of Algorithm In our implementation, (i) we take one random projection axis at a time (ii) By means of maximum value of the projection vector (10317 projected values on a random axis), we rescale so that projection values are in the closed/open interval, [0, 1) This we to avoid having a single projection value equal to (iii) We cumulatively add these rescaled projection vectors (iv) We take the mean vector of the, individually rescaled, projection vectors That mean vector then is what we use to endow it with the Baire metric Now consider our processing pipeline, as just described, in the following terms Take a cloud of 10317 points in a 34352-dimensional space (This sparse matrix has density 0.285%; the maximum value is 3.218811, and the minimum value is 0.) Our linear transformation, R, maps these 10317 points into a 99-dimensional space R consists of uniformly distributed random values (and the column vectors of R are not normalized) The projections are rescaled to be between and on these new axes I.e projections are in the (closed/open) interval [0, 1) By the central limit theorem, and by the concentration (data piling) effect of high dimensions [10,21], we have as dimension m → ∞: pairwise distances become equidistant; orientation tends to be uniformly distributed We find also: the norms of the target space axes are Gaussian distributed; and as typifies sparsified data, the norms of the 10317 points in the 99-dimensional target space are distributed as a negative exponential or a power law We find: (i) correlation between any pair of our random projections is greater than 0.98894, and most are greater than 0.99; (ii) correlation between the first principal component loadings and our mean random projection is 0.9999996; and the correlation between the first principal component loadings and each of our input random projections is greater than 0.99; (iii) correlations between the second and subsequent principal component loadings are close to In summary, we have the following We not impose unit norm on the column vectors of our random linear mapping, R The norms of the initial coordinate system are distributed as negative exponential, and the linear mapping into the subspace gives norms of the subspace coordinate system that are Gaussian distributed We find very high correlation (0.99 and above, with just a few instances of 0.9889 and above) between all of the following: pairs of projected (through linear mapping with uniformly distributed values) vectors; projections on the first, and only the first, principal component of the subspace; the mean set of projections among sets of projections on all subspace axes For computational convenience, we use the latter, the mean subspace set of projections for endowing it with the Baire metric With reference to other important work in [12,21,22] which uses conventional random projection, the following may be noted Our objective is less to determine or model cluster properties as they are in very high dimensions, than it is to 430 F Murtagh and P Contreras extract useful analytics by “re-representing” the data That is to say, we are having our data coded (or encoded) in a different way (In [15], discussion is along the lines of alternatively encoded data being subject to the same general analysis method This is as compared to the viewpoint of having a new analysis method developed for each variant of the data.) Traditional approaches to clustering, [16], use pairwise distances, between adjacent clusters of points; or clusters are formed by assigning to cluster centres A direct reading of a partition is the approach pursued here Furthermore, we determine these partitions level by level (of digit precision) The hierarchy, or tree, results from our set of partitions This is different from the traditional (bottom-up, usually agglomerative) process where the sequence of partitions of the data result from the hierarchy See [3] for further discussion To summarize: in the traditional approach, the hierarchy is built, and then the partition of interest is determined from it In our new approach, a set of partitions is built, and then the hierarchy is determined from them Conclusions We determine a hierarchy from a set of – random projection based – partitions As we have noted above, the traditional hierarchy forming process first determines the hierarchical data structure, and then derives the partitions from it One justification for our work is interest in big data analytics, and therefore having a top-down, rather than bottom-up hierarchy formation process Such hierarchy construction processes can be also termed, respectively, divisive and agglomerative In this article, we have described how our work has many innovative features References Achlioptas, D.: Database-friendly random projections: Johnson-Lindenstrauss with binary coins Journal of Computer and System Sciences 66(4), 671–687 (2003) Bingham, E., Mannila, H.: Random projection in dimensionality reduction: applications to image and text data In: Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining, pp 245–250 ACM, New York (2001) Contreras, P., Murtagh, F.: Fast, linear time hierarchical clustering using the Baire metric Journal of Classification 29, 118–143 (2012) Dasgupta, S., Gupta, A.: An elementary proof of a theorem of Johnson and Lindenstrauss Random Structures and Algorithms 22(1), 60–65 (2003) Dasgupta, S.: Experiments with random projection In: Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence, pp 143–151 Morgan Kaufmann, San Francisco (2000) Deegalla, S., Bostră om, H.: Reducing high-dimensional data by principal component analysis vs random projection for nearest neighbor classification In: ICMLA 2006: Proceedings of the 5th International Conference on Machine Learning and Applications, pp 245–250 IEEE Computer Society, Washington DC (2006) Baire Metric for High Dimensional Clustering 431 Fern, X.Z., Brodly, C.: Random projection for high dimensional data clustering: a cluster ensemble approach In: Proceedings of the Twentieth International Conference on Machine Learning AAAI Press, Washington DC (2007) Fradkin, D., Madigan, D.: Experiments with random projections for machine learning In: KDD 2003: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 517–522 ACM, New York (2003) Johnson, W.B., Lindenstrauss, J.: Extensions of Lipschitz maps into a Hilbert space In: Conference in Modern Analysis and Probabilities Contemporary Mathematics, vol 26, pp 189–206 American Mathematical Society, Providence (1984) 10 Hall, P., Marron, J.S., Neeman, A.: Geometric representation of high dimension, low sample size data Journal of the Royal Statistical Society Series B 67, 427–444 (2005) 11 Critchley, F., Heiser, W.: Hierarchical trees can be perfectly scaled in one dimension Journal of Classification 5, 5–20 (1988) 12 Kaski, S.: Dimensionality reduction by random mapping: fast similarity computation for clustering In: Proceedings of the 1998 IEEE International Joint Conference on Neural Networks, pp 413–418 (1998) 13 Li, P., Hastie, T., Church, K.: Very sparse random projections In: KDD 2006: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, vol 1, pp 287–296 ACM, New York (2006) 14 Lin, J., Gunopulos, D.: Dimensionality reduction by random projection and latent semantic indexing In: 3rd SIAM International Conference on Data Mining SIAM, San Francisco (2003) 15 Murtagh, F.: Correspondence Analysis and Data Coding with R and Java Chapman and Hall (2005) 16 Murtagh, F., Contreras, P.: Algorithms for hierarchical clustering: an overview Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2(1), 86–97 (2012) 17 Murtagh, F., Contreras, P.: Fast, linear time, m-adic hierarchical clustering for search and retrieval using the Baire metric, with linkages to generalized ultrametrics, hashing, Formal Concept Analysis, and precision of data measurement p-Adic Numbers, Ultrametric Analysis and Applications 4, 45–56 (2012) 18 Murtagh, F., Contreras, P.: Linear storage and potentially constant time hierarchical clustering using the Baire metric and random spanning paths In: Proceedings, European Conference on Data Analysis Springer (forthcoming, 2015) 19 Murtagh, F., Contreras, P.: Search and retrieval in massive data: sparse p-adic coding for linear time hierarchical clustering and linear space storage (in preparation, 2015) 20 Starck, J.L., Murtagh, F., Fadili, J.M.: Sparse Image and Signal Processing: Wavelets, Morphological Diversity Cambridge University Press, Curvelets (2010) 21 Terada, Y.: Clustering for high-dimension, low-sample size data using distance vectors, 16 p (2013) http://arxiv.org/abs/1312.3386 22 Urruty, T., Djeraba, C., Simovici, D.A.: Clustering by random projections In: Perner, P (ed.) ICDM 2007 LNCS (LNAI), vol 4597, pp 107–119 Springer, Heidelberg (2007) 23 Vempala, S.S.: The Random Projection Method, DIMACS: Series in Discrete Mathematics and Theoretical Computer Science, vol 65 American Mathematical Society, Providence (2004) Optimal Coding for Discrete Random Vector Bernard Colin1 , Jules de Tibeiro2(B) , and Fran¸cois Dubeau1 Département de Mathématiques, Université de Sherbrooke, Sherbrooke J1K 2R1, Canada Secteur Sciences, Université de Moncton a ` Shippagan, Shippagan E8S 1P6, Canada jules.de.tibeiro@umoncton.ca Abstract Based on the notion of mutual information between the components of a discrete random vector, we construct, for data reduction reasons, an optimal quantization of the support of its probability measure More precisely, we propose a simultaneous discretization of the whole set of the components of the discrete random vector which takes into account, as much as possible, the stochastic dependence between them Computationals aspects and example are presented Keywords: Divergence · Mutual analysis · Optimal quantization information · Correspondence Introduction and Motivation In statistics and data analysis, it is usual to take into account simultaneously, a given number of discrete random variables or discrete random vectors This is particularly the case in surveys, censuses, data mining, etc but also when some multidimensional discrete probabilistic models seem well suited to the phenomenon under study In this context, one often wants to use data to adjust some statistical forecasting models or simply to account for the stochastic dependence between random variables or random vectors For example, in the non-parametric framework, descriptive and exploratory models of data analysis as, among others, correspondence analysis, are dedicated to determining the stochastic dependence between random variables or random vectors, by means of their associations between their respective categories Similarly, the parametric framework of usual discrete multidimensional distributions, leads to the estimation of the parameters of the joint distribution in order to estimate the stochastic dependence between random variables or vectors However for various reasons (easy use, clearness of results and graphical displays, confidentiality of the data, etc.), one has in practice to create for each random variable or components of a random vector, new categories by grouping old ones For instance, educational level such as “Primary”, “Secondary”, “College” and “University” is more often used than the exact number of years of studies For this purpose, one creates usually classes for each variable, regardless of the stochastic dependence that may exist between them This approach however, c Springer International Publishing Switzerland 2015 A Gammerman et al (Eds.): SLDS 2015, LNAI 9047, pp 432–441, 2015 DOI: 10.1007/978-3-319-17091-6 38 Optimal Coding for Discrete Random Vector 433 although very widespread, deprives the statistician of information that could be crucial in a predictive model since, in doing so, it degrades arbitrarily the information on the stochastic relationship between variables which could consequently, affect the quality of the forecast To alleviate this problem, we propose to adapt to the discrete case, the approach introduced in the continuous case, by Colin, Dubeau, Khreibani and de Tibeiro [7], which is based on the existence of an optimal partition of the support of the probability measure of a random vector, corresponding to a minimal loss of mutual information between its components resulting from data reduction 2.1 Theoretical Framework Generalities We briefly present hereafter the theoretical frame relative to determining a finite optimal partition of the support SP of the probability measure of a discrete random vector Let (Ω, F, μ) be any probability space, where Ω is a given set, where F is a σ-field of subsets of Ω and where the probability measure μ is supposed to be absolutely continuous with respect to a reference measure Let X = (X1 , X2 , , Xk ) be a random vector defined on (Ω, F, μ) with values in a k countable set X , usually identical to Nk , N∗ = (N ∪ {0})k or Zk If P is the probability measure image of μ under the mapping X, one has: P(x1 , x2 , , xk ) = P(X1 = x1 , X2 = x2 , , Xk = xk ) = μ {ω ∈ Ω : X1 (ω) = x1 , X2 (ω) = x2 , , Xk (ω) = xk } where x = (x1 , x2 , , xk ) ∈ X Finally, PX1 , PX2 , , PXk denote respectively the marginal probability measures of the components X1 , X2 , , Xk of the random vector X 2.2 Mutual Information As defined in the continuous case, the mutual information Iϕ (X1 , X2 , , Xk ) between the random variables X1 , X2 , , Xk is nothing else than the i=k ϕ-divergence Iϕ P, ⊗i=k i=1 PXi between the probabiliy measures P and ⊗i=1 PXi given by: Iϕ (X1 , X2 , , Xk ) = Iϕ P, ⊗i=k i=1 PXi , = x∈X i=k ϕ P(x) ⊗i=k i=1 PXi (x) = E⊗i=1 PXi ϕ ⊗i=k i=1 PXi (x) , P(X) ⊗i=k i=1 PXi (X) where ϕ is a convex function from R+ \{0} to R (see Csiszár[10], Aczél and Dar´ oczy[1], Rényi[20] for details) It is easy to check, using some elementary calculations, that all the properties of divergence and mutual information, as 434 B Colin et al set out in the continuous framework, are also valid in a discrete setting (as positivity if ϕ(1) ≥ 0, convexity with respect to the joint probability measure, independence of components, etc.) and, in particular, the one relating to the loss of information arising from a transformation of the random vector X This property, known as “data-processing theorem”, states that it is not possible to strictly increase the mutual information between random variables or random vectors using a transformation on these last (see [1], [9], [10], [23]) 3.1 Optimal Partition Mutual Information Explained by a Partition Without loss of generality, and for sake of simplicity, it is assumed that, for i = 1, 2, , k, each component Xi of the random vector X, is a random variable with value in N∗ Moreover, for every i = 1, 2, , k, we denote by ηili ∈ N∗ any integer value of the random variable Xi , where li ∈ N∗ Given k integers n1 , n2 , , nk , we consider for every i = 1, 2, , k a partition Pi of the support SPXi of the random variable Xi , obtained by using a set {γiji } of ni intervals of the form: γiji = [xi(ji −1) , xiji [ where ji = 1, 2, , ni − and γini = [xi(ni −1) , ∞[ where the bounds of the intervals are real numbers such that: = xi0 < xi1 < xi2 < < xi(ni −1) < ∞ Remark: the choice for real bounds for the half-open intervals γiji follows from the fact that it is not a priori excluded that one of the elements of the optimum k partition be a single point x of N∗ The “product partition” P of the support SP in n = n1 × n2 × × nk cells, is then given by: i=k P = ⊗i=k i=1 Pi = { ×i=1 γiji } where ji = 1, 2, , ni , for i = 1, 2, , k If σ(P) is the σ-algebra generated by P (the algebra generated by P in the present case), the restriction of P to σ(P) is given by: for every j1 , j2 , , jk P ×i=k i=1 γiji and for which it follows easily that the marginal probability measures are given for i = 1, 2, , k, by: PXi (γiji ) where ji = 1, 2, , ni The mutual information, denoted by Iϕ (P), explained by the partition P of the support SP is then defined by: Iϕ (P) = j1 ,j2 , ,jk ϕ P ×i=k i=1 γiji i=k i=1 PXi (γiji ) i=k i=1 PXi (γiji ) and the loss of the mutual information arising from data reduction due to the partition P, is given by: Iϕ (X1 , X2 , , Xk ) − Iϕ (P) which is positive due to the “data-processing theorem” Optimal Coding for Discrete Random Vector 3.2 435 Existence of an Optimal Partition For any given sequence n1 , n2 , , nk of integers and for every i = 1, 2, , k, let Pi,ni be the class of partitions of SPXi in ni disjoint intervals γiji , as introduced in the previous subsection, and let Pn be the class of partitions of SP of the form: Pn = ⊗i=k i=1 Pi,ni where n is the multi-index (n1 , n2 , , nk ) of size |n| = k We define an optimal partition P as a member of Pn for which the loss of mutual information Iϕ (X1 , X2 , , Xk ) − Iϕ (P) is minimum Therefore, we have to solve the following optimization problem: (Iϕ (X1 , X2 , , Xk ) − Iϕ (P)) P∈P n or equivalently, the dual optimization problem: max Iϕ (P) = max P∈P n P∈P n j1 ,j2 , ,jk ϕ P ×i=k i=1 γiji i=k i=1 PXi (γiji ) i=k i=1 PXi (γiji ) which is, in turn, equivalent to the problem of finding the real bounds xiji of the intervals γiji for every i = 1, 2, , k and for every ji = 1, 2, , ni If the support SP is finite, we have a finite number of members of Pn so an optimal partition automatically exists, while if the support SP is denumerably infinite, the set Pn is denumerable In that case, the countable set of the real numbers Iϕ (Pn ) where Pn ∈ Pn for all n ∈ N∗ , may be ordered according to a non-decreasing sequence, bounded above by Iϕ (X1 , X2 , , Xk ) So in this case, the existence of an upper bound for the sequence (Pn )n≥0 ensures the existence of an optimal partition or possibly the existence of a “quasi optimal” partition, if the upper bound is not attained by a member of Pn As an example, let X = (X1 , X2 ) be a bivariate discrete random vector with a finite support SP = [0, 1, 2, , p]×[0, 1, 2, , q] If n1 and n2 are respectively the given numbers of elements of the partitions P1 and P2 of the sets {0, 1, 2, , p} and {0, 1, 2, , q}, then P = P1 ⊗P2 is a “joint” partition of the set [0, 1, 2, , p]× [0, 1, 2, , q] ⊆ N∗ with exactly n1 × n2 non-empty cells It is easy to check that the finite number of elements of the set Pn where n = (n1 , n2 ), is given by: card(Pn ) = p n1 − q n2 − In order to find the optimal partition, it is sufficient to consider all possible values of Iϕ (Pn ) where Pn ∈ Pn , for all n = 1, 2, , card(Pn ) If the support SP is unbounded, it is possible for numerical reasons, to bring back the problem to one with a bounded support by the means of the transformation: T = U = F (X1 ) V = G(X2 ) 436 B Colin et al from N∗ × N∗ in Im(F ) × Im(G) ⊂ [0, 1] × [0, 1], where F and G are respectively the cumulative distribution functions of the random variables X1 and X2 One can check easily that for such transformation, one has: Iϕ (X1 , X2 ) = Iϕ (U, V ) In other words, T is a “mutual information invariant” transformation 4.1 Computational Aspects Example In order to illustrate the process to determining the optimal quantization, let us consider the following contingency table N = {nij }, involving two random categorical variables X1 and X2 : Table Contingency table involving two random categorical variables X1 X2 10 80 60 168 470 236 145 166 160 180 125 36 20 74 191 99 52 64 60 60 55 87 89 185 502 306 190 207 160 211 187 47 61 127 304 187 91 194 110 111 115 99 60 137 400 264 133 193 110 194 53 42 21 112 449 188 65 33 51 97 82 23 19 96 427 93 70 94 29 79 43 28 10 53 164 56 30 23 20 30 17 11 11 21 45 36 20 28 21 27 15 10 58 62 79 87 54 129 80 49 41 from which we can easily deduce a joint probability measure P = {pij } by 10 10 dividing each entry by n = i=1 j=1 nij Let us suppose that we want to find an optimal partition of the support of this discrete probability measure in × elements To this end, let {[1, α1 [, [α1 , α2 [, [α2 , 10]} and {[1, β1 [, [β1 , β2 [, [β2 , 10]} be respectively the elements of the partitions of the support of the probability measures PX1 and PX2 , where α1 , α2 , β1 and β2 are of the form: k + 12 with k = 1, 2, , Let us suppose that function ϕ is the χ2 -metrics given by: ϕ(t) = (1 − t)2 where: t= pij pi• p•j The mutual information is then given by: Iϕ (X1 , X2 ) = Iϕ P, ⊗k=2 k=1 PXk = = 10 i=1 p2ij 10 j=1 pi• p•j 10 i=1 10 j=1 − = 0650 1− pij pi• p•j pi• p•j Optimal Coding for Discrete Random Vector 437 For every choice of α = (α1 , α2 ) and β = (β1 , β2 ) (there are exactly 36 × 36 = 1296 possible choices) resulting in a 9-cells {Clk , l, k = 1, 2, 3} partition Pα,β of the support SP of the probability measure P, we have to calculate, using standard notations, the following probabilities: πlk = (i,j)∈Clk pij , πl• = k=1 πlk and π•k = l=1 πlk It follows that: Iϕ (Pα,β ) = l=1 k=1 1− πlk πl• π•k πl• π•k and we have to solve the following optimization problem: maxIϕ (Pα,β ) α,β For this elementary example, methods of exhaustion allow to find easily a solution (α∗ , β ∗ ) which satisfy the equality: Iϕ (Pα∗ ,β ∗ ) = max α,β l=1 k=1 1− πlk πl• π•k πl• π•k In the present case, using an enumeration algorithm1 , we have found: α∗ = (3.5, 4.5) and β ∗ = (5.5, 8.5) for which the partitions of the supports PX1 and PX2 are repectively given by: ({1, 2, 3}, {4}, {5, 6, 7, 8, 9, 10}) and ({1, 2, 3, 4, 5}, {6, 7, 8}, {9, 10}) Furthermore, the ratio of the initial mutual information explained by Pα∗ ,β ∗ is given by: 0326 Iϕ (Pα∗ ,β ∗ ) = ≈ 51 Iϕ (X1 , X2 ) 0650 Similarly, for an optimal partition of SP in × cells, we obtain respectively for α∗ = (α1∗ , α2∗ , α3∗ ) and β ∗ = (β1∗ , β2∗ , β3∗ ): α∗ = (3.5, 4.5, 9.5) and β ∗ = (4.5, 5.5, 8.5) for which the corresponding partitions of the supports PX1 and PX2 are repectively given by: ({1, 2, 3}, {4}, {5, 6, 7, 8, 9}, {10}) and ({1, 2, 3, 4, }, {5}, {6, 7, 8}, {9, 10}) Moreover: 0380 Iϕ (Pα∗ ,β ∗ ) = ≈ 59 Iϕ (X1 , X2 ) 0650 For the resolution of the optimization problems arising from the examples presented thereafter, some more sophisticated numericals methods are needed For example, we have to invoke in these cases some combinatorial algorithms (see B Korte and J Vygen [17]) or metaheuristics methods (see J Dréo et al [11] or T Ibakari, K Nonobe and M Yagiura [15]) MathLab programs are available cois.dubeau@usherbrooke.ca on request at the address: fran- 438 4.2 B Colin et al Some Usual Multivariate Distributions Multinomial Distribution: One says that the random vector X = (X1 , X2 , , Xk ) is distributed as a multinomial distribution, denoted by Mn(n; p1 , p2 , , pk ), with parameters n, p1 , p2 , , pk , if its probability density function, is expressed in the following form: n x1 , x2 , , xk P (X1 = x1 , X2 = x2 , , Xk = xk ) = × where n x1 , x2 , , xk k i=1 = pxi i I{SP ⊆[0,1,2, ,n]k } (x1 , x2 , , xk ) n! x1 !x2 ! xk ! k where ≤ pi ≤ for all i = 1, 2, , k with i=1 pi = and where SP ⊆ k [0, 1, 2, , n]k = {(x1 , x2 , , xk ) : i=1 xi = n} In this case, each member of Pn k has exactly i=1 nin−1 non-empty cells Multivariate Poisson Distribution: Multivariate Poisson distribution of type I : Given two independent random variables X1 and X2 distributed as two Poisson distribution P(λ) and P(μ) with parameters λ and μ, one says that the bivariate random vector X = (X1 , X2 ) is distributed as a bivariate Poisson distribution of type I P(λ, μ; γ), with parameters λ, μ and γ, if its cumulative distribution function H(x1 , x2 ) is given by: H(x1 , x2 ) = F (x1 )G(x2 ) [1 + γ (1 − F (x1 )) (1 − G(x2 ))] where −1 ≤ γ ≤ 1, and where F and G are respectively the cumulative distribution functions of X1 and X2 Multivariate Poisson distribution of type II : According to Johnson, Kotz and Balakrishnan [16] , one says that the random vector X = (X1 , X2 , , Xk ) is distributed as a multivariate Poisson distribution of type II P(λ1 , λ2 , , λk ), with parameters λ1 , λ2 , , λk , if its probability density function is given by: P ∩ki=1 {Xi = xi } = x e−λi λi i k i=1 xi ! ⎧ ⎨ × ⎫ λij C(xi )C(xj ) ⎬ exp + i j l λijl C(xi )C(xj )C(xl ) IN∗ k (x1 , x2 , , xk ) ⎩ ⎭ + + λ12 k C(x1 )C(x2 ) C(xk ) i j where λi > for all i = 1, 2, , k, where C(•) is a Gram-Charlier polynomial of type B (see [16]) and where λijl = E [Xi Xj Xl ] for all i, j, l, ∈ 1, 2, , k See also Holgate [13],[14], Campbell [6], Aitken and Gonin [3], Aitken [2], Consael [8], for more details Optimal Coding for Discrete Random Vector 439 Multivariate Hypergeometric Distribution: This distribution arises from random sampling (without replacement) in a population with k categories The random vector X = (X1 , X2 , , Xk ) is said to have a multivariate hypergeometric distribution, denoted by HM(n; m; m1 , m2 , , mk ), with parameters n, m, m1 , m2 , , mk , if its probability density function is given by: P (X1 = n1 , X2 = n2 , , Xk = nk ) = k k mi i=1 ni m n k where one has: i=1 mi = m; i=1 ni = n and ≤ ni ≤ min(n, mi ) ∀i = 1, 2, , k Moreover, each marginal distribution is a univariate hypergeometric distribution H(n, m) Negative Multinomial Distribution: The random vector X = (X1 , X2 , , Xk ) is distributed as a negative multinomial distribution, noted by MN(p0 , p1 , p2 , , pk ), with parameters p0 , p1 , p2 , , pk , if its probability density function is given by: P (X1 = n1 , X2 = n2 , , Xk = nk ) = Γ (n + ki=1 ni ) n p n1 !n2 ! nk !Γ (n) k ni i=1 pi IN∗k (n1 , n2 , , nk ) k where < pi < for i = 1, 2, , k and where i=0 pi = For more details and applications of this distribution one can see : Bates and Neyman [5], Sibuya, Yoshimura and Shimizu [21], Patil [18], Neyman [19], Sinoquet and Bonhomme [22], Guo [12], Arbous and Sichel [4] Conclusions Let X = (X1 , X2 , , Xk ) be a discrete random vector defined on a probability k space (Ω, F, μ) with values in a countable set X (usually Nk , N∗ = (N ∪ {0})k or Zk ), and with a joint probability measure P absolutely continuous with respect to a counting measure For a given measure of mutual information Iϕ (X1 , X2 , , Xk ) between the components of X, we have shown, using a criterion based on minimization of the mutual information loss, that there exists for k given integers n1 , n2 , , nk , an optimal partition of the support SP of P in i=1 ni elements, given by the Cartesian product of the elements of the partitions of the support of each components X1 , X2 , , Xk in, respectively, n1 , n2 , , nk classes This procedure allows to retain the stochastic dependence between the random variables (X1 , X2 , , Xk ) as much as possible and this may be significantly important for some data analysis or statistical inference as tests of independence As illustrated by an example, this optimal partition performs, from this point of view, better than any others having the same number of classes Although this way of carrying out a quantization of the support of a probability measure is less usual than those associated with marginal classes of equal “width” or of “equal probabilities”, one thinks that practitioners could seriously consider it, at 440 B Colin et al least, in the case where the conservation of the stochastic dependence between the random variables seems to be important Finally, from a practical point of view, we have paid attention to some semiparametric cases for which one can assume the probability measure P is a member of a given family depending on the unknown parameter θ References Aczél, J., Dar´ oczy, Z.: On Measures of Information and Their Charaterizations Academic Press, New York (1975) Aitken, A.C.: A further note on multivariate selection Proceedings of the Edinburgh Mathematical Society 5, 37–40 (1936) Aitken, A.C., Gonin, H.T.: On fourfold sampling with and without replacement Proceedings of the Royal Society of Edinburgh 55, 114–125 (1935) Arbous, A.G., Sichel, H.S.: New techniques for the analysis of absenteeism data Biometrika 41, 77–90 (1954) Bates, G.E., Neyman, J.: Contributions to the theory of accident proneness University of California, Publications in Statistics 1, 215–253 (1952) Campbell, J.T.: The Poisson correlation function Proceedings of the Edinburgh Mathematical Society (Series 2) 4, 18–26 (1938) Colin, B., Dubeau, F., Khreibani, H., de Tibeiro, J.: Optimal Quantization of the Support of a Continuous Multivariate Distribution based on Mutual Information Journal of Classification 30, 453–473 (2013) Consael, R.: Sur les processus composés de Poisson ` a deux variables aléatoires Académie Royale de Belgique, Classe des Sciences, Mémoires 7, 4–43 (1952) Csisz´ ar, I.: A class of measures of informativity of observations channels Periodica Mathematica Hungarica 2(1–4), 191–213 (1972) 10 Csisz´ ar, I.: Information Measures: A Critical Survey Transactions of the Seventh Prague Conference on Information Theory, Statistical Decision Functions and Random Proccesses A, 73–86 (1977) 11 Dréo, J., Pétrowski, A., Siarry, P., Taillard, E.: Metaheuristics for Hard Optimization Springer (2006) 12 Guo, G.: Negative multinomial regression models for clustered event counts, Technical Report, Department of Sociology, University of North Carolina, Chapel Hill, NC (1995) 13 Holgate, P.: Estimation for the bivariate Poisson distribution Biometrika 51, 241–245 (1964) 14 Holgate, P.: Bivariate generalizations of Neyman’s type A distribution Biometrika 53, 241–245 (1966) 15 Ibakari, T., Nonobe, K., Yagiura, M.: Metaheuristics: Progress as Real Problem Solvers Springer (2005) 16 Johnson, N.L., Kotz, S., Balakrishnan, N.: Discrete Multivariate Distribution John Wiley & Sons, Inc (1997) 17 Korte, B., Vygen, J.: Combinatorial Optimization: Theory and Algorithms, 21 Algorithms and Combinatorics, Fourth Edition Springer (2008) 18 Patil, G.P.: On sampling with replacement from populations with multiple characters Sankhya, Series B 30, 355–364 (1968) 19 Neyman, J.: Certain chance mechanisms involving discrete distributions (inaugural address) In: Proceedings of the International Symposium on Discrete Distributions, pp 4–14, Montréal (1963) Optimal Coding for Discrete Random Vector 441 20 Rényi, A.: On measures of entropy and information In: Proceedings of the Fourth Berkeley Symposium of Mathematical Statistics and Probability, (I), pp 547–561 University of California Press, Berkeley (1961) 21 Sibuya, M., Yoshimura, I., Shimizu, R.: Negative multinomial distribution Annals of the Institute of Statistical Mathematics 16, 409–426 (1964) 22 Sinoquet, H., Bonhomme, R.: A theoretical analysis of radiation interception in a two-species plant canopy Mathematical Biosciences 105, 23–45 (1991) 23 Zakai, J., Ziv, M.: On functionals satisfying a data-processing theorem IEEE Transactions IT–19, 275–282 (1973) Author Index Acorn, Erik 179 Ahlberg, Ernst 251, 271, 323 Angelov, Plamen 156 Angelov, Plamen Parvanov 169 Asker, Lars 96 Balasubramanian, Vineeth N 291 Belyaev, Mikhail 106 Bernstein, Alexander 414 Bezerra, Clauber Gomes 169 Boström, Henrik 96, 126, 251, 271 Bradley, Patrick E 406 Burnaev, Evgeny 86, 106, 116 Canto, Sebastián Dormido 376 Carlsson, Lars 251, 271, 323 Cassor, Frédérik 389 Cavallaro, Lorenzo 313 Cherubin, Giovanni 313 Colin, Bernard 432 Contreras, Pedro 424 Costa, Bruno Sielly Jales 169 Couceiro, Miguel 234 Craddock, Rachel 281 de Tibeiro, Jules 432 Demertzis, Konstantinos 223 Denis, Christophe 301 Díaz, Norman 356 Dipsis, Nikos 179 Dormido-Canto, Sebastiỏn 356, 366 Drago, Carlo 147 Dubeau, Franỗois 432 Farias, Gonzalo 356 Farid, Mohsen 397 Ferdous, Mohsina Mahmuda 214 Gammerman, Alexander 281, 313 Gaudio, P 347 Gelfusa, M 347 Guedes, Luiz Affonso 169 Hasselgren, Catrin 323 Hebiri, Mohamed 301 Henelius, Andreas 96 Iliadis, Lazaros 223 Izmailov, Rauf 3, 33 Jaiswal, Ritvik 291 Johansson, Ulf 251, 271 Jordaney, Roberto 313 Kangin, Dmitry 156 Kapushev, Yermek 106 Karatsiolis, Savvas 75 Karlsson, Isak 96, 126 Kuleshov, Alexander 414 Lauro, Carlo Natale 147 Le Roux, Brigitte 389 Linusson, Henrik 251, 271 Liu, Xiaohui 214 Liu, Xin 241 Logofatu, Doina 203 Lungaroni, M 347 Luo, Zhiyuan 241 Martínez, F 366 Murari, A 347 Murtagh, Fionn 397, 424 Nouretdinov, Ilia 241, 281, 313 Offer, Charles 281 Olaniyan, Rapheal 203 Panin, Ivan 86 Panov, Maxim 116 Papadopoulos, Harris 260 Papapetrou, Panagiotis 96 Papapetrou, Panagotis 126 Papini, Davide 313 Pastor, I 366 444 Author Index Pavón, Fernando 376 Peluso, E 347 Pincus, Tamar 179 Plavin, Alexander 193 Potapenko, Anna 193 Puolamäki, Kai 96 Rattá, G.A 337 Rodríguez, M.C 366 Romanenko, Alexey 137 Scepi, Germana 147 Schizas, Christos N 75 Smith, James 281 Sönströd, Cecilia 271 Spjuth, Ola 323 Stamate, Daniel 203 Stathis, Kostas 179 Vapnik, Vladimir 3, 33 Vega, Jesús 337, 356, 366, 376 Vinciotti, Veronica 214 Vorontsov, Konstantin 193 Waldhauser, Tamás 234 Wang, Huazhen 241 Wang, Zhi 313 Wilson, Paul 214 Yanovich, Yury 414 Zhao, Jing 96 ... problems and their rigorous solutions In: Gammerman, A., Vovk, V., Papadopoulos, H (eds.) Statistical Learning and Data Sciences SLDS 2015, LNCS (LNAI) , vol 9047, pp 33–71 Springer-Verlag, London (2015) ... · Vladimir Vovk Harris Papadopoulos (Eds.) Statistical Learning and Data Sciences Third International Symposium, SLDS 2015 Egham, UK, April 20–23, 2015 Proceedings ABC Editors Alexander Gammerman... Learning and Data Sciences, which was held at Royal Holloway, University of London, UK, during April 20–23, 2015 The original idea of the Symposium on Statistical Learning and Data Sciences is

IT training LNAI 9047 statistical learning and data sciences gammerman, vovk papadopoulos 2015 03 12

Thông tin tài liệu

Từ khóa liên quan

Mục lục

In memory of Alexey Chervonenkis

Preface

Organization

Contents

Invited Papers

Learning with Intelligent Teacher: Similarity Control and Knowledge Transfer

1 Introduction

2 Learning with Intelligent Teacher: Privileged Information

2.1 Classical Model of Learning

2.2 LUPI Model of Learning

3 Statistical Analysis of the Rate of Convergence

3.1 Key Observation: SVM with Oracle Teacher

3.2 From Ideal Oracle to Real Intelligent Teacher

4 SVM+ for Similarity Control in LUPI Paradigm

5 Three Examples of Similarity Control Using Privileged Information

5.1 Advanced Technical Model as Privileged Information

5.2 Future Events as Privileged Information

5.3 Holistic Description as Privileged Information

6 Transfer of Knowledge Obtained in Privileged Information Space to Decision Space

6.1 Knowledge Representation

6.2 Scheme of Knowledge Transfer Between Spaces

Finding Fundamental Elements of Knowledge.

Fundamental Elements of Knowledge for Homogenous Quadratic Kernel.

Finding Images of Frames in Space X.

6.3 Algorithms for Knowledge Transfer

6.4 Kernels Involved in Intelligent Learning

Tài liệu cùng người dùng

Tài liệu liên quan