**Principal** **Component** Analysis, Second Edition I.T Jolliffe **Springer** Preface to the Second Edition Since the ﬁrst edition of the book was published, a great deal of new material on **principal** **component** **analysis** (PCA) and related topics has been published, and the time is now ripe for a new edition Although the size of the book has nearly doubled, there are only two additional chapters All the chapters in the ﬁrst edition have been preserved, although two have been renumbered All have been updated, some extensively In this updating process I have endeavoured to be as comprehensive as possible This is reﬂected in the number of new references, which substantially exceeds those in the ﬁrst edition Given the range of areas in which PCA is used, it is certain that I have missed some topics, and my coverage of others will be too brief for the taste of some readers The choice of which new topics to emphasize is inevitably a personal one, reﬂecting my own interests and biases In particular, atmospheric science is a rich source of both applications and methodological developments, but its large contribution to the new material is partly due to my long-standing links with the area, and not because of a lack of interesting developments and examples in other ﬁelds For example, there are large literatures in psychometrics, chemometrics and computer science that are only partially represented Due to considerations of space, not everything could be included The main changes are now described Chapters to describing the basic theory and providing a set of examples are the least changed It would have been possible to substitute more recent examples for those of Chapter 4, but as the present ones give nice illustrations of the various aspects of PCA, there was no good reason to so One of these examples has been moved to Chapter One extra prop- vi Preface to the Second Edition erty (A6) has been added to Chapter 2, with Property A6 in Chapter becoming A7 Chapter has been extended by further discussion of a number of ordination and scaling methods linked to PCA, in particular varieties of the biplot Chapter has seen a major expansion There are two parts of Chapter concerned with deciding how many **principal** components (PCs) to retain and with using PCA to choose a subset of variables Both of these topics have been the subject of considerable research in recent years, although a regrettably high proportion of this research confuses PCA with factor analysis, the subject of Chapter Neither Chapter nor have been expanded as much as Chapter or Chapters and 10 Chapter in the ﬁrst edition contained three sections describing the use of PCA in conjunction with discriminant analysis, cluster **analysis** and canonical correlation **analysis** (CCA) All three sections have been updated, but the greatest expansion is in the third section, where a number of other techniques have been included, which, like CCA, deal with relationships between two groups of variables As elsewhere in the book, Chapter includes yet other interesting related methods not discussed in detail In general, the line is drawn between inclusion and exclusion once the link with PCA becomes too tenuous Chapter 10 also included three sections in ﬁrst edition on outlier detection, inﬂuence and robustness All have been the subject of substantial research interest since the ﬁrst edition; this is reﬂected in expanded coverage A fourth section, on other types of stability and sensitivity, has been added Some of this material has been moved from Section 12.4 of the ﬁrst edition; other material is new The next two chapters are also new and reﬂect my own research interests more closely than other parts of the book An important aspect of PCA is interpretation of the components once they have been obtained This may not be easy, and a number of approaches have been suggested for simplifying PCs to aid interpretation Chapter 11 discusses these, covering the wellestablished idea of rotation as well recently developed techniques These techniques either replace PCA by alternative procedures that give simpler results, or approximate the PCs once they have been obtained A small amount of this material comes from Section 12.4 of the ﬁrst edition, but the great majority is new The chapter also includes a section on physical interpretation of components My involvement in the developments described in Chapter 12 is less direct than in Chapter 11, but a substantial part of the chapter describes methodology and applications in atmospheric science and reﬂects my long-standing interest in that ﬁeld In the ﬁrst edition, Section 11.2 was concerned with ‘non-independent and time series data.’ This section has been expanded to a full chapter (Chapter 12) There have been major developments in this area, including functional PCA for time series, and various techniques appropriate for data involving spatial and temporal variation, such as (mul- Preface to the Second Edition vii tichannel) singular spectrum analysis, complex PCA, **principal** oscillation pattern analysis, and extended empirical orthogonal functions (EOFs) Many of these techniques were developed by atmospheric scientists and are little known in many other disciplines The last two chapters of the ﬁrst edition are greatly expanded and become Chapters 13 and 14 in the new edition There is some transfer of material elsewhere, but also new sections In Chapter 13 there are three new sections, on size/shape data, on quality control and a ﬁnal ‘odds-andends’ section, which includes vector, directional and complex data, interval data, species abundance data and large data sets All other sections have been expanded, that on common **principal** **component** **analysis** and related topics especially so The ﬁrst section of Chapter 14 deals with varieties of non-linear PCA This section has grown substantially compared to its counterpart (Section 12.2) in the ﬁrst edition It includes material on the Giﬁ system of multivariate analysis, **principal** curves, and neural networks Section 14.2 on weights, metrics and centerings combines, and considerably expands, the material of the ﬁrst and third sections of the old Chapter 12 The content of the old Section 12.4 has been transferred to an earlier part in the book (Chapter 10), but the remaining old sections survive and are updated The section on non-normal data includes independent **component** **analysis** (ICA), and the section on three-mode **analysis** also discusses techniques for three or more groups of variables The penultimate section is new and contains material on sweep-out components, extended components, subjective components, goodness-of-ﬁt, and further discussion of neural nets The appendix on numerical computation of PCs has been retained and updated, but, the appendix on PCA in computer packages has been dropped from this edition mainly because such material becomes out-of-date very rapidly The preface to the ﬁrst edition noted three general texts on multivariate **analysis** Since 1986 a number of excellent multivariate texts have appeared, including Everitt and Dunn (2001), Krzanowski (2000), Krzanowski and Marriott (1994) and Rencher (1995, 1998), to name just a few Two large specialist texts on **principal** **component** **analysis** have also been published Jackson (1991) gives a good, comprehensive, coverage of **principal** **component** **analysis** from a somewhat diﬀerent perspective than the present book, although it, too, is aimed at a general audience of statisticians and users of PCA The other text, by Preisendorfer and Mobley (1988), concentrates on meteorology and oceanography Because of this, the notation in Preisendorfer and Mobley diﬀers considerably from that used in mainstream statistical sources Nevertheless, as we shall see in later chapters, especially Chapter 12, atmospheric science is a ﬁeld where much development of PCA and related topics has occurred, and Preisendorfer and Mobley’s book brings together a great deal of relevant material viii Preface to the Second Edition A much shorter book on PCA (Dunteman, 1989), which is targeted at social scientists, has also appeared since 1986 Like the slim volume by Daultrey (1976), written mainly for geographers, it contains little technical material The preface to the ﬁrst edition noted some variations in terminology Likewise, the notation used in the literature on PCA varies quite widely Appendix D of Jackson (1991) provides a useful table of notation for some of the main quantities in PCA collected from 34 references (mainly textbooks on multivariate analysis) Where possible, the current book uses notation adopted by a majority of authors where a consensus exists To end this Preface, I include a slightly frivolous, but nevertheless interesting, aside on both the increasing popularity of PCA and on its terminology It was noted in the preface to the ﬁrst edition that both terms ‘principal **component** analysis’ and ‘principal components analysis’ are widely used I have always preferred the singular form as it is compatible with ‘factor analysis,’ ‘cluster analysis,’ ‘canonical correlation analysis’ and so on, but had no clear idea whether the singular or plural form was more frequently used A search for references to the two forms in key words or titles of articles using the Web of Science for the six years 1995–2000, revealed that the number of singular to plural occurrences were, respectively, 1017 to 527 in 1995–1996; 1330 to 620 in 1997–1998; and 1634 to 635 in 1999–2000 Thus, there has been nearly a 50 percent increase in citations of PCA in one form or another in that period, but most of that increase has been in the singular form, which now accounts for 72% of occurrences Happily, it is not necessary to change the title of this book I T Jolliﬀe April, **2002** Aberdeen, U K Preface to the First Edition **Principal** **component** **analysis** is probably the oldest and best known of the techniques of multivariate **analysis** It was ﬁrst introduced by Pearson (1901), and developed independently by Hotelling (1933) Like many multivariate methods, it was not widely used until the advent of electronic computers, but it is now well entrenched in virtually every statistical computer package The central idea of **principal** **component** **analysis** is to reduce the dimensionality of a data set in which there are a large number of interrelated variables, while retaining as much as possible of the variation present in the data set This reduction is achieved by transforming to a new set of variables, the **principal** components, which are uncorrelated, and which are ordered so that the ﬁrst few retain most of the variation present in all of the original variables Computation of the **principal** components reduces to the solution of an eigenvalue-eigenvector problem for a positive-semideﬁnite symmetric matrix Thus, the deﬁnition and computation 