informationtheory, pattern recognition and neural networks, mackay\

Information Theory, Inference, and Learning Algorithms David J.C MacKay c 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002 Draft 3.1.1 October 5, 2002 Information Theory, Pattern Recognition and Neural Networks Approximate roadmap for the eight-week course in Cambridge Lecture Before lecture Lecture 2–3 Lecture Lecture Lecture Lecture 7–8 Lecture Lecture Lecture Lecture Lecture Lecture Lecture 10-11 12 13 14 15 16 Introduction to Information Theory Chapter Work on exercise 3.8 (p.73) Read chapters and and work on exercises in chapter Information content & typicality Chapter Symbol codes Chapter Arithmetic codes Chapter Read chapter and the exercises Noisy channels Definition of mutual information and capacity Chapter The noisy channel coding theorem Chapter 10 Clustering Bayesian inference Chapter 3, 23, 25 Read chapter 34 (Ising models) Monte Carlo methods Chapter 32, 33 Variational methods Chapter 36 Neural networks – the single neuron Chapter 43 Capacity of the single neuron Chapter 44 Learning as inference Chapter 45 The Hopfield network Content-addressable memory Chapter 46 c David J.C MacKay Draft 3.1.1 October 5, 2002 CONTENTS Contents 21 27 46 56 68 Data Compression 72 I II Introduction to Information Theory Solutions to chapter 1’s exercises Probability, Entropy, and Inference Solutions to Chapter 2’s exercises More about Inference Solutions to Chapter 3’s exercises The Source Coding Theorem Solutions to Chapter 4’s exercises Symbol Codes Solutions to Chapter 5’s exercises Stream Codes Solutions to Chapter 6’s exercises Further Exercises on Data Compression Solutions to Chapter 7’s exercises Codes for Integers 74 94 100 117 125 142 150 154 158 Noisy-Channel Coding 165 Correlated Random Variables Solutions to Chapter 8’s exercises Communication over a Noisy Channel Solutions to Chapter 9’s exercises 10 The Noisy-Channel Coding Theorem Solutions to Chapter 10’s exercises 11 Error-Correcting Codes & Real Channels Solutions to Chapter 11’s exercises III 166 171 176 189 196 207 210 224 Further Topics in Information Theory 227 12 Hash Codes: Codes for Efficient Information Retrieval Solutions to Chapter 12’s exercises 13 Binary Codes 14 Very Good Linear Codes Exist c David J.C MacKay Draft 3.1.1 October 5, 2002 229 239 244 265 CONTENTS 15 Further Exercises on Information Theory Solutions to Chapter 15’s exercises 16 Message Passing 17 Communication over Constrained Noiseless Channels Solutions to Chapter 17’s exercises 18 Language Models and Crosswords 19 Cryptography and Cryptanalysis: Codes for Information Concealment 20 Units of Information Content 21 Why have sex? Information acquisition and evolution IV 268 277 279 284 295 299 303 305 310 Probabilities and Inference 325 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 V Introduction to Part IV An Example Inference Task: Clustering Exact Inference by Complete Enumeration Maximum Likelihood and Clustering Useful Probability Distributions Exact Marginalization Exact Marginalization in Trellises More on trellises Solutions to Chapter 28’s exercises Exact Marginalization in Graphs Laplace’s method Model Comparison and Occam’s Razor Monte Carlo methods Solutions to Chapter 32’s exercises Efficient Monte Carlo methods Ising Models Solutions to Chapter 34’s exercises Exact Monte Carlo Sampling Variational Methods Solutions to Chapter 36’s exercises Independent Component Analysis and Latent Variable Modelling Further exercises on inference Decision theory What Do You Know if You Are Ignorant? Bayesian Inference and Sampling Theory 326 328 337 344 351 358 363 370 373 376 379 381 391 418 421 434 449 450 458 471 473 475 479 482 484 Neural networks 491 42 Introduction to Neural Networks 43 The Single Neuron as a Classifier c David J.C MacKay Draft 3.1.1 October 5, 2002 492 495 44 45 46 47 48 49 50 51 VI CONTENTS Solutions to Chapter 43’s exercises Capacity of a single neuron Solutions to Chapter 44’s exercises Learning as Inference Solutions to Chapter 45’s exercises The Hopfield network Solutions to Chapter 46’s exercises From Hopfield Networks to Boltzmann Machines Supervised Learning in Multilayer Networks Gaussian processes Deconvolution More about Graphical models and belief propagation Complexity and Tractability 575 52 Valiant, PAC 53 NP completeness VII 506 509 518 519 533 536 552 553 558 565 566 570 Sparse Graph Codes 54 55 56 57 58 VIII 576 577 581 Introduction to sparse graph codes Low-density parity-check codes Convolutional codes Turbo codes Repeat-accumulate codes 582 585 586 596 600 Appendices 601 A Notation B Useful formulae, etc Bibliography c David J.C MacKay Draft 3.1.1 October 5, 2002 602 604 619 About Chapter I hope you will find the mathematics in the first chapter easy You will need to be familiar with the binomial distribution And to solve the exercises in the text – which I urge you to – you will need to remember Stirling’s approximation for the factorial function, x! xx e−x , and be able to apply it N! to N = (N −r)!r! These topics are reviewed below r Unfamiliar notation? A, p 602 See appendix The binomial distribution Example 0.1: A bent coin has probability f of coming up heads The coin is tossed N times What is the probability distribution of the number of heads, r? What are the mean and variance of r? Solution: The number of heads has a binomial distribution N r f (1 − f )N −r r P (r|f, N ) = (1) 0.3 0.25 0.2 0.15 0.1 0.05 The mean, E[r], and variance, var[r], of this distribution are defined by 10 N E[r] ≡ P (r|f, N ) r (2) r=0 0.1 0.01 0.001 0.0001 var[r] ≡ E (r − E[r]) (3) = E[r ] − (E[r]) = N r=0 P (r|f, N )r − (E[r]) (4) Rather than evaluating the sums over r (2,4) directly, it is easiest to obtain the mean and variance by noting that r is the sum of N independent random variables, namely, the number of heads in the first toss (which is either zero or one), the number of heads in the second toss, and so forth In general, E[x + y] = E[x] + E[y] for any random variables x and y; var[x + y] = var[x] + var[y] if x and y are independent (5) So the mean of r is the sum of the means of those random variables, and the variance of r is the sum of their variances The mean number of heads in a single toss is f × + (1 − f ) × = f , and the variance of the number of heads in a single toss is f × 12 + (1 − f ) × 02 − f = f − f = f (1 − f ), (6) so the mean and variance of r are: E[r] = N f and 1e-06 10 2 1e-05 var[r] = N f (1 − f ) c David J.C MacKay Draft 3.1.1 October 5, 2002 (7) r Figure The binomial distribution P (r|f =0.3, N =10), on a linear scale (top) and a logarithmic scale (bottom) About Chapter Approximating x! and N r Let’s derive Stirling’s approximation by an unconventional route We start from the Poisson distribution, 0.12 0.1 0.08 0.06 P (r|λ) = e−λ λr r! 0.04 r ∈ {0, 1, 2, } (8) 0.02 0 e−λ λr r! √ (r−λ) e− 2λ 2πλ (9) λλ λ! ⇒ λ! 10 15 20 25 10 15 20 25 0.1 0.01 0.001 0.0001 1e-05 1e-06 1e-07 Let’s plug r = λ into this formula e−λ For large λ, this distribution is well approximated – at least in the vicinity of r λ – by a Gaussian distribution with mean λ and variance λ: r 2πλ √ λλ e−λ 2πλ √ (10) (11) Figure The Poisson distribution P (r | λ=15), on a linear scale (top) and a logarithmic scale (bottom) This is Stirling’s approximation for the factorial function, including several of the correction terms that are usually forgotten √ x! xx e−x 2πx ⇔ ln x! x ln x − x + ln 2πx (12) We can use this approximation to approximate ln N r (N − r) ln N r ≡ N! (N −r)!r! N N + r ln N −r r (13) Since all the terms in this equation are logarithms, this result can be rewritten in any base We will denote natural logarithms (log e ) by ‘ln’, and logarithms to base (log2 ) by ‘log’ If we introduce the binary entropy function, H2 (x) ≡ x log 1 + (1 − x) log x (1 − x) (14) then we can rewrite the approximation (13) as log N r loge x loge 1 ∂ log2 x = Note that ∂x loge x Recall that log2 x = H2 (x) 0.8 N H2 (r/N ), (15) 0.6 0.4 or, equivalently, 0.2 N r 2N H2 (r/N ) (16) If we need a more accurate approximation, we can include terms of the next order: N N −r r log N H2 (r/N ) − log 2πN (17) r N N c David J.C MacKay Draft 3.1.1 October 5, 2002 0 0.2 0.4 0.6 0.8 Figure The binary entropy function x Introduction to Information Theory The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point (Claude Shannon, 1948) In the first half of this book we study how to measure information content; we learn how to compress data; and we learn how to communicate perfectly over imperfect communication channels We start by getting a feeling for this last problem 1.1 How can we achieve perfect communication over an imperfect, noisy commmunication channel? Some examples of noisy communication channels are: • an analogue telephone line, over which two modems communicate digital information; • the radio communication link from the Jupiter-orbiting spacecraft, Galileo, to earth; • reproducing cells, in which the daughter cells’s DNA contains information from the parent cells; • a disc drive The last example shows that communication doesn’t have to involve information going from one place to another When we write a file on a disc drive, we’ll read it off in the same location – but at a later time These channels are noisy A telephone line suffers from cross-talk with other lines; the hardware in the line distorts and adds noise to the transmitted signal The deep space network that listens to Galileo’s puny transmitter receives background radiation from terrestrial and cosmic sources DNA is subject to mutations and damage A disc drive, which writes a binary digit (a one or zero, also known as a bit) by aligning a patch of magnetic material in one of two orientations, may later fail to read out the stored binary digit: the patch of material might spontaneously flip magnetization, or a glitch of background noise might cause the reading circuit to report the wrong value c David J.C MacKay Draft 3.1.1 October 5, 2002 daughter cell parent cell d d daughter cell computer E disc E computer memory memory drive — Introduction to Information Theory for the binary digit, or the writing head might not induce the magnetization in the first place because of interference from neighbouring bits In all these cases, if we transmit data, e.g., a string of bits, over the channel, there is some probability that the received message will not be identical to the transmitted message We would prefer to have a communication channel for which this probability was zero – or so close to zero that for practical purposes it is indistinguishable from zero Let’s consider a noisy disc drive that transmits each bit correctly with probability (1 − f ) and incorrectly with probability f This model communication channel is known as the binary symmetric channel (figure 1.1) x E0 d d E1 y P (y=0|x=0) = − f ; P (y=0|x=1) = f ; P (y=1|x=0) = f ; P (y=1|x=1) = − f (1 − f ) E0 d df d d E1 (1 − f ) As an example, let’s imagine that f = 0.1, that is, ten per cent of the bits are flipped (figure 1.2) A useful disc drive would flip no bits at all in its entire lifetime If we expect to read and write a gigabyte per day for ten years, we require a bit error probability of the order of 10−15 , or smaller There are two approaches to this goal The physical solution The physical solution is to improve the physical characteristics of the communication channel to reduce its error probability We could improve our disc drive by using more reliable components in its circuitry; evacuating the air from the disc enclosure so as to eliminate the turbulent forces that perturb the reading head from the track; using a larger magnetic patch to represent each bit; or using higher-power signals or cooling the circuitry in order to reduce thermal noise These physical modifications typically increase the cost of the communication channel c David J.C MacKay Draft 3.1.1 October 5, 2002 Figure 1.1 The binary symmetric channel The transmitted symbol is x and the received symbol y The noise level, the probability of a bit’s being flipped, is f Figure 1.2 A binary data sequence of length 10000 transmitted over a binary symmetric channel with noise level f = 0.1 [Dilbert image Copyright c 1997 United Feature Syndicate, Inc., used with permission.] 1.2: Error-correcting codes for the binary symmetric channel Figure 1.3 The ‘system’ solution for achieving reliable communication over a noisy channel The encoding system introduces systematic redundancy into the transmitted vector t The decoding system uses this known redundancy to deduce from the received vector r both the original source vector and the noise introduced by the channel Source T s ˆ s c Encoder t Decoder T E Noisy channel r The ‘system’ solution Information theory and coding theory offer an alternative (and much more exciting) approach: we accept the given noisy channel and add communication systems to it so that we can detect and correct the errors introduced by the channel As shown in figure 1.3, we add an encoder before the channel and a decoder after it The encoder encodes the source message s into a transmitted message t, adding redundancy to the original message in some way The channel adds noise to the transmitted message, yielding a received message r The decoder uses the known redundancy introduced by the encoding system to infer both the original signal s and the added noise Whereas physical solutions give incremental channel improvements only at an ever-increasing cost, system solutions can turn noisy channels into reliable communication channels with the only cost being a computational requirement at the encoder and decoder Information theory is concerned with the theoretical limitations and potentials of such systems ‘What is the best error-correcting performance we could achieve?’ Coding theory is concerned with the creation of practical encoding and decoding systems 1.2 Error-correcting codes for the binary symmetric channel We now consider examples of encoding and decoding systems What is the simplest way to add useful redundancy to a transmission? [To make the rules of the game clear: we want to be able to detect and correct errors; and retransmission is not an option We get only one chance to encode, transmit, and decode.] Repetition codes A straightforward idea is to repeat every bit of the message a prearranged number of times – for example, three times, as shown in figure 1.4 We call this repetition code ‘R3 ’ Imagine that we transmit the source message s=0 1 over a binary symmetric channel with noise level f = 0.1 using this repetition code We can describe the channel as ‘adding’ a sparse noise vector n to the c David J.C MacKay Draft 3.1.1 October 5, 2002 Source sequence s Transmitted sequence t 000 111 Figure 1.4 The repetition code R3 618 B — Useful formulae, etc Examples B.8 Some numbers 28192 21024 21000 2500 102466 10308 10301 3×10150 2469 10141 2266 1080 1.6×1060 1057 3×1051 1030 2200 2190 2171 2100 Number of kilobyte files Number of states of a 2D Ising model with 32×32 spins Number of binary strings of length 1000 Number of binary strings of length 1000 having 100 1s and 900 0s Number of electrons in universe Number of electrons in solar system Number of electrons in the earth 298 3×1029 Age of universe / picoseconds 258 3×1017 1015 Age of universe / seconds 250 240 1012 Number of bits in the wheat genome Number of bits to list one human genome Population of earth 230 3×1010 6×109 5×109 109 2×108 2×108 Number of bits in C Elegans3 genome Number of bits in Arabidopsis thaliana4 genome One year / seconds Number of bits in the compressed postscript file that is this book Number of bits in unix kernel Number of bits to list one E Coli genome Number of years since human/chimpanzee divergence 1048576 232 225 3×107 2×107 2×107 107 6×106 220 106 2×105 × 104 × 104 210 1.5×103 103 20 100 2−10 10−3 2−20 Number of generations since human/chimpanzee divergence Number of genes in human genome Number of genes in Arabidopsis thaliana genome Number of base pairs in a gene 1024 10−6 3×10−8 2−30 2−60 probability of error in transmission of coding DNA, per nucleotide, per generation c David J.C MacKay Draft 3.1.1 October 5, 2002 10−9 10−18 probability of undetected error in a hard disc drive, after error correction and detection Bibliography Abramson, N (1963) Information theory and coding New York: McGraw-Hill Adler, S L (1981) Over-relaxation method for the Monte-Carlo evaluation of the partition function for multiquadratic actions Physical Review D – Particles and Fields 23 (12): 2901–2904 Aiyer, S V B., (1991) Solving Combinatorial Optimization Problems Using Neural Networks Cambridge University Engineering Department PhD dissertation CUED/F-INFENG/TR 89 Aji, S., Jin, H., Khandekar, A., McEliece, R J., and MacKay, D J C (2000) BSC thresholds for code ensembles based on ‘typical pairs’ decoding In Codes, Systems and Graphical Models, ed by B Marcus and J Rosenthal, volume 123 of IMA Volumes in Mathematics and its Applications New York: SpringerVerlag Amit, D J., Gutfreund, H., and Sompolinsky, H (1985) Storing infinite numbers of patterns in a spin glass model of neural networks Phys Rev Lett 55: 1530 Angel, J R P., Wizinowich, P., Lloyd-Hart, M., and Sandler, D (1990) Adaptive optics for array telescopes using neural-network techniques Nature 348: 221–224 Baldwin, J (1896) A new factor in evolution American Naturalist 30: 441–451 Baum, E., Boneh, D., and Garrett, C (1995) On genetic algorithms In Proceedings of the Eighth Annual Conference on Computational Learning Theory, pp 230– 239, New York ACM Baum, E B., and Smith, W D (1993) Best play for imperfect players and game tree search Technical report, NEC, Princeton, NJ Baum, E B., and Smith, W D (1997) A Bayesian approach to relevance in game playing Artificial Intelligence 97 (1-2): 195–242 Bell, A J., and Sejnowski, T J (1995) An information maximization approach to blind separation and blind deconvolution Neural Computation (6): 1129–1159 Bentley, J (2000) Programming Pearls Reading, Massachusetts: Addison-Wesley, second edition Berger, J (1985) Statistical Decision theory and Bayesian Analysis Springer Berlekamp, E R (1980) The technology of error–correcting codes IEEE Transactions on Information Theory Berlekamp, E R., McEliece, R J., and van Tilborg, H C A (1978) On the intractability of certain coding problems IEEE Transactions on Information Theory 24 (3): 384–386 Berrou, C., and Glavieux, A (1996) Near optimum error correcting coding and decoding: Turbo-codes IEEE Transactions on Communications 44: 1261–1271 Berzuini, C., Best, N G., Gilks, W R., and Larizza, C (1997) Dynamic conditional independence models and Markov chain Monte Carlo methods Journal of the American Statistical Association 92 (440): 1403–?? Berzuini, C., and Gilks, W R (2001) Following a moving target - monte carlo inference for dynamic bayesian models Journal of the Royal Statistical Society Series B-Statistical Methodology 63 (1): 127–146 c David J.C MacKay Draft 3.1.1 October 5, 2002 619 620 BIBLIOGRAPHY Bottou, L., Howard, P G., and Bengio, Y (1998) The Z-coder adaptive binary coder In Proceedings of the Data Compression Conference, Snowbird, Utah, March 1998 , pp 13–22 Box, G E P., and Tiao, G C (1973) Bayesian inference in statistical analysis Addison–Wesley Bretthorst, G (1988) Bayesian spectrum analysis and parameter estimation Springer Also available at bayes.wustl.edu Bridle, J S (1989) Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition In Neurocomputing: algorithms, architectures and applications, ed by F Fougelman-Soulie and J H´rault Springer–Verlag e Bulmer, M (1985) The Mathematical Theory of Quantitative Genetics Oxford: Oxford University Press Burrows, M., and Wheeler, D J (1994) A block-sorting lossless data compression algorithm Technical report, Digital SRC Research Report 124 10th May 1994 Byers, J., Luby, M., Mitzenmacher, M., and Rege, A (1998) A digital fountain approach to reliable distribution of bulk data In Proceedings of ACM SIGCOMM ’98, September 2-4, 1998 Calderbank, A R., and Shor, P W (1996) Good quantum error-correcting codes exist Phys Rev A 54: 1098 quant-ph/9512032 Carroll, L (1998) Alice’s adventures in Wonderland; and, Through the lookingglass: and what Alice found there London: Macmillan Children’s Books Childs, A M., Patterson, R B., and MacKay, D J C (2001) Exact sampling from non-attractive distributions using summary states Physical Review E Comon, P., Jutten, C., and Herault, J (1991) Blind separation of sources problems statement Signal Processing 24 (1): 11–20 Copas, J B (1983) Regression, prediction and shrinkage (with discussion) J R Statist.Soc B 45 (3): 311–354 Cover, T M (1965) Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition IEEE Transactions on Electronic Computers 14: 326–334 Cover, T M., and Thomas, J A (1991) Elements of Information Theory New York: Wiley Cowles, M K., and Carlin, B P (1996) Markov-chain Monte-Carlo convergence diagnostics — a comparative review Journal of the American Statistical Association 91 (434): 883–904 Cox, R (1946) Probability, frequency, and reasonable expectation Am J Physics 14: 1–13 Davey, M C., (1999) Error-correction using Low-Density Parity-Check Codes University of Cambridge dissertation Davey, M C., and MacKay, D J C (1998a) Low density parity check codes over GF(q) IEEE Communications Letters (6): 165–167 Davey, M C., and MacKay, D J C (1998b) Low density parity check codes over GF(q) In Proceedings of the 1998 IEEE Information Theory Workshop, pp 70–71 IEEE Davey, M C., and MacKay, D J C (2001) Reliable communication over channels with insertions, deletions and substitutions IEEE Transactions on Information Theory 47 (2): 687–698 Dayan, P., Hinton, G E., Neal, R M., and Zemel, R S (1995) The Helmholtz machine Neural Computation (5): 889–904 Divsalar, D., Jin, H., and McEliece, R J (1998) Coding theorems for ‘turbolike’ codes In Proceedings of the 36th Allerton Conference on Communication, Control, and Computing, Sept 1998 , pp 201–210, Monticello, Illinois Allerton House c David J.C MacKay Draft 3.1.1 October 5, 2002 BIBLIOGRAPHY 621 Doucet, A., de Freitas, J., and Gordon, N eds (2001) Sequential Monte Carlo Methods in Practice New York: Springer-Verlag Elias, P (1975) Universal codeword sets and representations of the integers IEEE Transactions on Information Theory 21 (2): 194–203 Eyre-Walker, A., and Keightley, P (1999) High genomic deleterious mutation rates in hominids Nature 397: 344–347 Felsenstein, J (1985) Recombination and sex: is Maynard Smith necessary? In Evolution Essays in honour of John Maynard Smith, ed by P J Greenwood, P H Harvey, and M Slatkin, pp 209–220 Cambridge: C.U.P Ferreira, H., Clarke, W., Helberg, A., Abdel-Ghaffar, K S., and Vinck, A H (1997) Insertion/deletion correction with spectral nulls IEEE Trans Info Theory 43 (2): 722–732 Fisher, R A (1930) The genetical theory of natural selection Oxford: Clarendon Forney, Jr., G D (2001) Codes on graphs: Normal realizations IEEE Transactions on Information Theory 47 (2): 520–548 Gallager, R G (1962) Low density parity check codes IRE Trans Info Theory IT-8: 21–28 Gallager, R G (1963) Low Density Parity Check Codes Number 21 in Research monograph series Cambridge, Mass.: MIT Press Gallager, R G (1968) Information Theory and Reliable Communication New York: Wiley Gallager, R G (1978) Variations on a theme by Huffman IEEE Trans on Information Theory IT-24 (6): 668–674 Gibbs, M N., and MacKay, D J C (2000) Variational Gaussian process classifiers IEEE Transactions on Neural Networks 11 (6): 1458–1464 Gilks, W., Roberts, G., and George, E (1994) Adaptive direction sampling Statistician 43: 179–189 Gilks, W., and Wild, P (1992) Adaptive rejection sampling for Gibbs sampling Applied Statistics 41: 337–348 Gilks, W R., Richardson, S., and Spiegelhalter, D J (1996) Markov Chain Monte Carlo in Practice Chapman and Hall Green, P J (1995) Reversible jump Markov chain Monte Carlo computation and Bayesian model determination Biometrika 82: 711–732 Gull, S F (1988) Bayesian inductive inference and maximum entropy In Maximum Entropy and Bayesian Methods in Science and Engineering, vol 1: Foundations, ed by G Erickson and C Smith, pp 53–74, Dordrecht Kluwer Gull, S F (1989) Developments in maximum entropy data analysis In Maximum Entropy and Bayesian Methods, Cambridge 1988 , ed by J Skilling, pp 53–71, Dordrecht Kluwer Gull, S F., and Daniell, G (1978) Image reconstruction from incomplete and noisy data Nature 272: 686–690 Hanson, R., Stutz, J., and Cheeseman, P (1991a) Bayesian classification theory Technical Report FIA–90-12-7-01, NASA Ames Hanson, R., Stutz, J., and Cheeseman, P (1991b) Bayesian classification with correlation and inheritance In Proceedings of the 12th International Joint Conference on Artificial Intelligence, Sydney, Australia, volume 2, pp 692–698 Morgan Kaufmann Harvey, M., and Neal, R M., (2000) Inference for belief networks using coupling from the past submitted to UAI 2000 Hendin, O., Horn, D., and Hopfield, J J (1994) Decomposition of a mixture of signals in a model of the olfactory bulb Proceedings of the National Academy of Sciences of the United States of America 91 (13): 5942–5946 Hertz, J., Krogh, A., and Palmer, R G (1991) Introduction to the Theory of Neural Computation Addison-Wesley c David J.C MacKay Draft 3.1.1 October 5, 2002 621 622 BIBLIOGRAPHY Hinton, G., (2000) Training products of experts by minimizing contrastive divergence Hinton, G., and Nowlan, S (1987) How learning can guide evolution Complex Systems 1: 495–502 Hinton, G E., Dayan, P., Frey, B J., and Neal, R M (1995) The wake-sleep algorithm for unsupervised neural networks Science 268 (5214): 1158–1161 Hinton, G E., and Ghahramani, Z (1997) Generative models for discovering sparse distributed representations Proc Roy Soc Hinton, G E., and Sejnowski, T J (1986) Learning and relearning in Boltzmann machines In Parallel Distributed Processing, ed by D E Rumelhart and J E McClelland, pp 282–317 Cambridge Mass.: MIT Press Hinton, G E., and van Camp, D (1993) Keeping neural networks simple by minimizing the description length of the weights In Proc 6th Annu Workshop on Comput Learning Theory, pp 5–13 ACM Press, New York, NY Hinton, G E., and Zemel, R S (1994) Autoencoders, minimum description length and Helmholtz free energy In Advances in Neural Information Processing Systems , ed by J D Cowan, G Tesauro, and J Alspector, San Mateo, California Morgan Kaufmann Isard, M., and Blake, A., (1996) Visual tracking by stochastic propagation of conditional density Isard, M., and Blake, A (1998) Condensation – conditional density propagation for visual tracking International Journal of Computer Vision 29 (1): 5–28 Jaakkola, T S., and Jordan, M I (1996a) Bayesian logistic regression: a variational approach Technical report, MIT Jaakkola, T S., and Jordan, M I (1996b) Computing upper and lower bounds on likelihoods in intractable networks In Proceedings of the Twelfth Conference on Uncertainty in AI Morgan Kaufman Jaakkola, T S., and Jordan, M I (2000) Bayesian parameter estimation via variational methods Statistics and Computing 10 (1): 25–37 Jaynes, E T (1983) Bayesian intervals versus confidence intervals In E.T Jaynes Papers on Probability, Statistics and Statistical Physics, ed by R D Rosenkrantz, p 151 Kluwer Jensen, F V (1996) An Introduction to Bayesian Networks London: UCL press Jordan, M I ed (1998) Learning in Graphical Models NATO Science Series Dordrecht: Kluwer Academic Publishers JPL, (1996) Turbo codes performance Available from http://www331.jpl.nasa.gov/public/TurboPerf.html Jutten, C., and Herault, J (1991) Blind separation of sources an adaptive algorithm based on neuromimetic architecture Signal Processing 24 (1): 1–10 Kelling, M J., and Rand, D A (1995) A spatial mechanism for the evolution and maintenance of sexual reproduction Oikos 74: 414–424 Kimura, M (1961) Natural selection as the process of accumulating genetic information in adaptive evolution Genetical Research Cambridge Kondrashov, A S (1988) Deleterious mutations and the evolution of sexual reproduction Nature 336 (6198): 435–440 Kschischang, F R., Frey, B J., and Loeliger, H.-A (2001) Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory 47 (2): 498–519 Kschischang, F R., and Sorokine, V (1995) On the trellis structure of block codes IEEE Trans in Inform Theory 41 (6): 1924–1937 Lauritzen, S L (1996) Graphical Models Number 17 in Oxford Statistical Science Series Oxford: Clarendon Press Levenshtein, V I (1966) Binary codes capable of correcting deletions, insertions, and reversals Soviet Physics - Doklady 10 (8): 707–710 c David J.C MacKay Draft 3.1.1 October 5, 2002 BIBLIOGRAPHY 623 Loredo, T J (1990) From Laplace to supernova SN 1987A: Bayesian inference in astrophysics In Maximum Entropy and Bayesian Methods, Dartmouth, U.S.A., 1989 , ed by P Fougere, pp 81–142 Kluwer Luby, M G., Mitzenmacher, M., Shokrollahi, M A., and Spielman, D A (1998) Improved low-density parity-check codes using irregular graphs and belief propagation In Proceedings of the IEEE International Symposium on Information Theory (ISIT), p 117 Luttrell, S P (1989) Hierarchical vector quantisation Proc IEE Part I 136: 405– 413 Luttrell, S P (1990) Derivation of a class of training algorithms IEEE Trans on Neural Networks (2): 229–232 MacKay, D J C., (1991) Bayesian Methods for Adaptive Models California Institute of Technology dissertation MacKay, D J C (1992a) Bayesian interpolation Neural Computation (3): 415– 447 MacKay, D J C (1992b) The evidence framework applied to classification networks Neural Computation (5): 698–714 MacKay, D J C (1992c) A practical Bayesian framework for backpropagation networks Neural Computation (3): 448–472 MacKay, D J C (1994) Bayesian non-linear modelling for the prediction competition In ASHRAE Transactions, V.100, Pt.2 , pp 1053–1062, Atlanta Georgia ASHRAE MacKay, D J C (1995) Free energy minimization algorithm for decoding and cryptanalysis Electronics Letters 31 (6): 446–447 MacKay, D J C., (1997) Ensemble learning for hidden Markov models http://www.inference.phy.cam.ac.uk/mackay/abstracts/ensemblePaper.html MacKay, D J C (1999) Good error correcting codes based on very sparse matrices IEEE Transactions on Information Theory 45 (2): 399–431 MacKay, D J C., (2000) An alternative to runlength-limiting codes: Turn timing errors into substitution errors available from http://www.inference.phy.cam.ac.uk/mackay/ MacKay, D J C., (2001) A problem with variational free energy minimization http://www.inference.phy.cam.ac.uk/mackay/abstracts/minima.html MacKay, D J C., and Neal, R M (1996) Near Shannon limit performance of low density parity check codes Electronics Letters 32 (18): 1645–1646 Reprinted Electronics Letters, 33(6):457–458, March 1997 MacKay, D J C., and Peto, L (1995) A hierarchical Dirichlet language model Natural Language Engineering (3): 1–19 Marinari, E., and Parisi, G (1992) Simulated tempering — a new Monte-Carlo scheme Europhysics Letters 19 (6): 451–458 Maynard Smith, J (1968) “Haldane’s dilemma” and the rate of evolution Nature 219 (5159): 1114–1116 Maynard Smith, J (1978) The Evolution of Sex Cambridge: C.U.P Maynard Smith, J (1988) Games, Sex and Evolution Hertfordshire: Harvester– Wheatsheaf ´ Maynard Smith, J., and Szathmary, E (1995) The Major Transitions in Evolution Oxford: Freeman McEliece, R J (1977) The Theory of Information and Coding: A Mathematical Framework for Communication Reading, Mass.: Addison-Wesley McEliece, R J., MacKay, D J C., and Cheng, J.-F (1998) Turbo decoding as an instance of Pearl’s ‘belief propagation’ algorithm IEEE Journal on Selected Areas in Communications 16 (2): 140–152 Mosteller, F., and Wallace, D L (1984) Applied Bayesian and Classical Inference The case of The Federalist papers Springer c David J.C MacKay Draft 3.1.1 October 5, 2002 623 624 BIBLIOGRAPHY ă Muhlenbein, H., and Schlierkamp-Voosen, D (1993) Predictive models for the breeder genetic algorithm I Continuous parameter optimization Evolutionary Computation 1: 25–50 Neal, R M (1993) Probabilistic inference using Markov chain Monte Carlo methods Technical Report CRG–TR–93–1, Dept of Computer Science, University of Toronto Neal, R M (1995) Suppressing random walks in Markov chain Monte Carlo using ordered overrelaxation Technical Report 9508, Dept of Statistics, University of Toronto Neal, R M (1996) Bayesian Learning for Neural Networks Number 118 in Lecture Notes in Statistics New York: Springer Neal, R M (1997) Markov chain Monte Carlo methods based on ‘slicing’ the density function Technical Report 9722, Dept of Statistics, Univ of Toronto Neal, R M (1998) Annealed importance sampling Technical Report 9805, Dept of Statistics, Univ of Toronto Neal, R M (2001) Monte Carlo decoding of LDPC codes Technical report, Dept of Computer Science, University of Toronto Presented at ICTP Workshop on Statistical Physics and Capacity-Approaching Codes Neal, R M (2002) Slice sampling Annals of Statistics In Press Neal, R M., and Hinton, G E (1993) A new view of the EM algorithm that justifies incremental, sparse, and other variants Biometrika submitted Pinto, R L., and Neal, R M (2001) Improving Markov chain Monte Carlo estimators by coupling to an approximating chain Technical Report 0101, Dept of Statistics, University of Toronto Polya, G (1954) Induction and Analogy in Mathematics New Jersey: Princeton University Press Volume of Mathematics and Plausible Reasoning Propp, J G., and Wilson, D B (1996) Exact sampling with coupled Markov chains and applications to statistical mechanics Random Structures and Algorithms (1-2): 223–252 Rasmussen, C E., and Ghahramani, Z (2002) Bayesian Monte Carlo Reif, F (1965) Fundamentals of Statistical and Thermal Physics McGraw–Hill Richardson, T., Shokrollahi, M A., and Urbanke, R (2001) Design of capacity-approaching irregular low-density parity check codes IEEE Transactions on Information Theory 47 (2): 619–637 Ridley, M (1993) The Red Queen London: Penguin Rumelhart, D E., Hinton, G E., and Williams, R J (1986) Learning representations by back–propagating errors Nature 323: 533–536 Russell, S., and Wefald, E (1991) Do the Right Thing: Studies in Limited Rationality MIT Press Schneier, B (1996) Applied Cryptography New York: Wiley Sejnowski, T J (1986) Higher order Boltzmann machines In Neural networks for computing, ed by J Denker, pp 398–403, New York American Institute of Physics Sejnowski, T J., and Rosenberg, C R (1987) Parallel networks that learn to pronounce English text Journal of Complex Systems (1): 145–168 Shannon, C E (1993a) The best detection of pulses In Collected Papers of Claude Shannon, ed by N J A Sloane and A D Wyner, pp 148–150 New York: IEEE Press Shannon, C E (1993b) Collected Papers New York: IEEE Press Edited by N J A Sloane and A D Wyner Skilling, J (1989) Classic maximum entropy In Maximum Entropy and Bayesian Methods, Cambridge 1988 , ed by J Skilling, Dordrecht Kluwer Skilling, J., and MacKay, D J (2001) Slice sampling – a binary implementation Annals of Statistics to appear c David J.C MacKay Draft 3.1.1 October 5, 2002 BIBLIOGRAPHY 625 Spiegel, M R (1988) Statistics Schaum’s outline series New York: McGraw-Hill, edition Spielman, D A (1996) Linear-time encodable and decodable error-correcting codes IEEE Transactions on Information Theory 42 (6.1): 1723–1731 Tanner, M A (1996) Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions Springer Series in Statistics Springer Verlag, 3rd edition Tanner, R M (1981) A recursive approach to low complexity codes IEEE Transactions on Information Theory 27 (5): 533–547 Thomas, A., Spiegelhalter, D J., and Gilks, W R (1992) BUGS: A program to perform Bayesian inference using Gibbs sampling In Bayesian Statistics , ed by J M Bernardo, J O Berger, A P Dawid, and A F M Smith, pp 837–842 Oxford: Clarendon Press Ward, D J., Blackwell, A F., and MacKay, D J C (2000) Dasher – A data entry interface using continuous gestures and language models In Proceedings of User Interface Software and Technology 2000 , pp 129–137 Ward, D J., and MacKay, D J C (2002) Fast hands-free writing by gaze direction Nature (6900): 838 Witten, I H., Neal, R M., and Cleary, J G (1987) Arithmetic coding for data compression Communications of the ACM 30 (6): 520–540 Worden, R P (1995) A speed limit for evolution Journal of Theoretical Biology 176 (1): 137–152 Yedidia, J S., Freeman, W T., and Weiss, Y (2000a) Bethe free energy, Kikuchi approximations and belief propagation algorithms Technical report, Mitsubishi MERL TR-2001-16 Yedidia, J S., Freeman, W T., and Weiss, Y (2000b) Generalized belief propagation Technical report, Mitsubishi MERL TR-2000-26 Yedidia, J S., Freeman, W T., and Weiss, Y (2002) Constructing free energy approximations and generalized belief propagation algorithms Technical report, Mitsubishi MERL TR-2002-?? Yeung, R W (1991) A new outlook on Shannon-information measures IEEE Transactions on Information Theory 37 (3.1): 466–474 Zipf, G K (1949) Human Behavior and the Principle of Least Effort AddisonWesley c David J.C MacKay Draft 3.1.1 October 5, 2002 625 Index Bayesian belief networks, 339 BCH codes, 18 belief, 66 Benford’s law, 487 bent coin, 59 Berlekamp, Elwyn, 253 Berrou, C., 221 bet, 238, 248, 276 Beta distribution, 359 Bhattacharyya parameter, 255 bias, 353 in neural net, 499 in statistics, 363 biased, 363 biexponential distribution, 356 bifurcation, 338 binary entropy function, 6, 20 binary erasure channel, 178, 181 binary images, 437 binary representations, 158 binary symmetric channel, 8, 178, 179–181, 250, 255 binomial distribution, bipartite graph, 24 birthday, 188, 193, 235, 238 bit (unit), 307 bits back, 116, 123 bivariate Gaussian, 426 Bletchley Park, 307 block code, 13, 182, see source code or error-correcting code block-sorting, 138 bombes, 308 bound, 93 bounded-distance decoder, 246, 252 boyish matters, 66 Bridge, 152 British, 302 broadcast channel, 274, 275 budget, 105, 106 burglar alarm and earthquake, 339 Burrows-Wheeler transform, 138 burst errors, 220 burst-error channels, 220 bus-stop fallacy, 43 busstop fallacy, 51, 121 Eb /N0 , 212 λ, 135 optimal input distribution, 181 acceptance rate, 432 accumulator, 293 activation, 499 activation function, 500 activity, 500 activity rule, 498, 499 adaptive direction sampling, 432 adaptive models, 112 adaptive rejection sampling, 406 address, 238, 497 Aiyer, Sree, 554 alchemists, 81 alphabetical ordering, 231 American, 274, 302 amino acid, 242 ape, 312 approximation by Gaussian, approximation of complex distribution, 220 approximation, Stirling’s, arabic, 152 arithmetic code, 288 uses beyond compression, 134, 135, 288, 294 arithmetic coding, 113, 125, 126 software, 137 arithmetic decoder, 134 associative memory, 541 assumptions, 31 asymptotic equipartition why it is a misleading term, 91 AutoClass, 353 average, 31 backpropagation, 503 balance, 73 ban, 308 ban (unit), 307 Banburismus, 308 Banbury, 307, 308 bandwidth, 212 base transitions, 410 bat, 253 battleships, 78 Bayes’s theorem, 10, 33, 57 Bayes, Rev Thomas, 59 Bayesian, 31 canonical, 98 capacity, 20, 177, 181, 182, 217 Hopfield network, 550 626 INDEX neural network, 513 neuron, 514 symmetry argument, 181 casting out nines, 235 Cauchy distribution, 93, 98, 99, 356, 398 caution importance sampling, 398 sampling theory, 71 cautions, 355 caveats, 91 cellular automaton, 149 centre of gravity, 41 channel AWGN, 211 binary erasure, 178, 181 binary symmetric, 176, 178, 179, 180, 250, 255 broadcast, 274, 275 bursty, 220 capacity, 20, 177, 181 complex, 219 discrete memoryless, 178 erasure, 260 extended, 185 fading, 220 Gaussian, 186, 211, 220 input ensemble, 181 multiple access, 273 multiterminal, 275 noisy, 176 noisy typewriter, 178, 183 others, 186 symmetric, 205 unknown noise level, 275 with correlated sources, 272 Z channel, 178, 179, 180 cheat, 238 Chebyshev inequality, 89, 93 Chernoff bound, 93 chess board, 272 chi-squared, 33, 47, 489 circle, 359 classical statistics, 71 criticisms, 37, 58 Clockville, 43 clustering, 330, 330, 351 coalescence, 454 code, see error-correcting code, source code (for data compression), symbol code, arithmetic coding, linear code, random code or hash code dual, 256, 258 error-correcting concatenated, 254 linear, 266 perfect, 250 product, 254 rate, 266 c David J.C MacKay Draft 3.1.1 October 5, 2002 627 Gallager, 24, 221, 258, 587, 590 interleaving, 220 low-density parity-check, 24, 221, 258, 587, 590 P3 , 258 random linear, 251 simple parity, 258 turbo, 588 codebreakers, 307 codeword, see source code, symbol code, or error-correcting code coding theory, 9, 255 collision, 238 coloured noise, 213 combination, 6, 521, 608 competitive learning, 331 complexity class, 583 complexity theory, 582 compress, 135 compression difficulty of, 149 lossless, 82 lossy, 82, 330, 331 of already-compressed files, 81 of any file, 81 concatenated code, 219, 220 concatenation error-correcting codes, 18, 23, 219 in compression, 102 in Markov chains, 410 Markov chains, 410 concave , 40 conditional entropy, 177 confidence interval, 493 confidence level, 493 confused gameshow host, 66 conjugate gradient, 508 conjugate prior, 361, 362 connection between vector quantization and error-correction, 331 between pattern recognition and error-correcting codes, 509, 510 between supervised and unsupervised learning, 551 error correcting code and latent variable model, 478 constrained channel, 296 constrained channels, 437 constraint satisfaction, 553 content-addressable, 497 content-addressable memory, 230, 497, 541 continuous channel, 212 control treatment, 488 conventions error function, 187 logarithms, matrices, 178 vectors, 178 convex hull, 114 627 628 convex , 40 convexity, 407 convolutional code, 218, 221 Conway, John, 95 cost, 582 cost function, 214 coupling from the past, 454, 456, see Monte Carlo methods, exact sampling Cover, Thomas, 276 Cox axioms, 31 crib, 308 crossword, 302 cryptography digital signatures, 237 tamper detection, 237 cumulative probability function, 187 cycles in graphs, 281 Dasher, 135 data compression, 80, see source code data entry, 135 data modelling, 331 deciban (unit), 307 decibel, 220 decision theory, 386 decoder, 9, 176, 182 bounded-distance, 246 decoder, bitwise, 366 decoder, codeword, 366 decoder, probability of error, 374 degrees of belief, 31 density modelling, 330 design theory, 248 detailed balance, 429 dictionary, 80, 135 difference-set cyclic code, 589 differentiator, 293 difficulty of compression, 149 diffusion, 359 digital signature, 237 dimer, 243 directory, 230 Dirichlet distribution, 359 Dirichlet model, 134 disc drive, discriminant function, 213 disease, 30, 488 distance DKL , 40 bad, 254 distance distribution, 245 entropy distance, 168 Gilbert-Varshamov, 252 Hamming, 245 isn’t everything, 255 of code, 245, 254 of code, and error probability, 374 c David J.C MacKay Draft 3.1.1 October 5, 2002 INDEX of concatenated code, 254 of product code, 254 distributions over periodic variables, 358 DNA, 7, 239, 242, 243, 296 replication, 312 dodecahedron code, 24, 245, 246 doors, on game show, 65, 66 dual, 256 dumb Metropolis, 432 Eb /N0 , 212 E-M algorithm, 472 earthquake and burglar alarm, 339 earthquake, during gameshow, 65 Ebert, Todd, 262 EM algorithm, 329 email, 238 empty string, 135 encoder, energy, 337, 439, 611 English, 80 Enigma, 308 ensemble, 74 extended, 83 entropy, 74, 611 entropy distance, 168 equipartition, 88 erasure channel, 260 erasure-correction, 224, 226, 227, 260 ergodic, 136, 410 error bars, 347 error correction, 241 error detection, 241 error function, 187, 501, 521, 609 error probability block, 182 in compression, 82 error-correcting code and compression, 149 bad, 217 block, 218 concatenated, 220, 221 concatenation, 219 convolutional, 218 decoding, 218 distance, see distance of code dodecahedron, 24, 245, 246 Golay, 248 good, 217, 218, 254 Hamming, 24 linear, 13, 13, 205, 218 maximum distance separable, 260 parity check code, 260 perfect, 247 practical, 217, 221 product, 219 INDEX random, 218 rectangular, 219 repetition, 217 sparse graph, 587 variable rate, 275 very good, 217 weight enumerator, 245 estimator, 56 estimators, 362 eugenics, 326 evidence, 35, 61, 344, 364 evolution, 312 evolutionary computing, 432, 434 exact sampling, 454, see Monte Carlo methods exchange rate, 611 expectation, 32, 41, 42 explaining away, 339, 341 exponential distribution, 51, 356 exponential distribution on integers, 354 expurgation, 201, 205 extended channel, 185, 190 extended code, 102 extended ensemble, 83 extra bit, 109, 113 factor graph, 378, 379 factorial, fading channel, 220 finger, 135 Finland, 248 fixed point, 545 fluctuations, 444 football pools, 248 forgery, 237 forward probability, 32 forward-backward algorithm, 368 Fourier transform, 99 free energy, 296, 448, 449 minimization, 463 variational, 463 frequency, 31 frequentist, 362 frustration, 445 frustrations, 445 full probabilistic model, 187 Galileo code, 221 Gallager codes, see low-density parity-check codes Gallager, Robert, 204, 206, 221 Galois fields, 220 game show, 65 gameshow, 66 gamma distribution, 356, 362 Gaussian channel, 186 Gaussian distribution, 210, 355, 363, 571 parameters, 361 generative model, 32, 33, 187 c David J.C MacKay Draft 3.1.1 October 5, 2002 629 generator matrix, 14, 218 genes, 239 genetic algorithms, 434 genome, 239, 312 geometric progression, 299 George, E.I., 432 Gibbs sampling, 406, 459, see Monte Carlo methods Gibbs’ inequality, 40 Gibbs’s inequality, 43, 50 Gilbert-Varshamov conjecture, 252 rate, 252 Gilbert-Varshamov conjecture, 252 Gilbert-Varshamov distance, 252 Gilks, W.R., 432 girlie stuff, 66 Glauber dynamics, 459, see Monte Carlo methods, Gibbs sampling Glavieux, A., 221 global warming, 31 Golay code, 248 Golgafrinchan, 274, 275 Good, Jack, 308 gradient descent, 508, 532 graph factor graph, 378 of code, 24 graphical model, 33 graphs and cycles, 281 graphs and error-correcting codes, 24 guessing game, 125 gzip, 135 Hamming code, 13, 17, 18, 22, 23, 218, 219, 247, 260 graph, 24 Hamming distance, 245 hash code, 230, 268 hash function, 232, 237, 265 linear, 268 one-way, 237 hat puzzle, 262 heat bath, 611, see Monte Carlo methods, Gibbs sampling heat capacity, 442 Hebb, Donald, 541 Hebbian learning, 541 Hertz, 212 hidden neurons, 561 hierarchical clustering, 330 hierarchical model, 416 high dimensions, life in, 44 Hinton, Geoffrey, 469, 472, 558 hints computing mutual information, 180 Hopfield network, 329, 542, 553 capacity, 550 Huffman code, 101, 110 ‘optimality’, 111, 113 629 630 disadvantages, 112, 130 general alphabet, 116, 122 human, 312 human-machine interfaces, 152 hybrid Monte Carlo, 424, 424, 528 hyperparameter, 416, 508 hypersphere, 52 identical twin, 126 ignorance, 487 image compression, 82, 330 image models, 437 image processing, 285 implicit assumptions, 220 implicit probabilities, 108, 109 importance sampling, 397, 398, see Monte Carlo methods weakness of, 418 independent component analysis, 356 inequality, 41, 89 inference, 32 inference problems bent coin, 59 information, 73 information content, 80 how to measure, 74 information retrieval, 230 information theory, inner code, 219 instance, 582 instantaneous, 103 integral image, 285 interleaving, 219, 220 internet, 224 invariance, 486 inverse probability, 32, 33 inverse-arithmetic-coder, 135 inverse-cosh distribution, 356 inverse-gamma distribution, 357 inversion of hash function, 237 Ising model, 148, 329, 437 Jensen’s inequality, 41, 50 Jet Propulsion Laboratory, 221 joint typicality, 196 joint typicality theorem, 197 jointly typical, 196 juggling, 20 junction tree algorithm, 380 jury, 31, 63 K-L distance, 40 K-means clustering, 331, 351 kaboom, 474 Knuth, Donald, 192 Kraft inequality, 105 Kullback-Leibler divergence, 40 Langevin method, 528, see Monte Carlo methods c David J.C MacKay Draft 3.1.1 October 5, 2002 INDEX language model, 135 Laplace model, 134 Laplace’s method (integration), 381 Laplace’s rule, 133 latent-variable modelling, 329 law of large numbers, 44 lawyer, 66, 68 learning, 499 Hebbian, 541 learning algorithms, 496 learning rule, 498 Lempel-Ziv coding, 125, 135–149 criticisms, 146 life in high dimensions, 44 likelihood, 10, 34, 58, 182, 366 contrasted with probability, 34 subjectivity, 36 likelihood principle, 37, 70 limit cycle, 545 linear block code, 13 linear feedback shift register, 218 log-normal, 358 logarithms, logit, 359 long thin strip, 449 lossy compression, 330, 331 lossy compressor, 202 low-density generator-matrix code, 247 low-density parity-check code, 221, 588 Lyapunov function, 272, 279, 333, 337, 545 macho, 361 MacKay, David, 221 majority vote, 10 MAP, 10 mapping, 102 marginal likelihood, 35, 344, 364 marginalization, 35 marginalize, 341 Markov chain, 169, 203 Markov chain Monte Carlo, see Monte Carlo methods acceptance rate, 432 Markov model, see Markov chain and hidden Markov model maximum distance separable, 260 maximum entropy, 573 maximum likelihood, 10 maximum likelihood decoder, 182 MCMC, see Monte Carlo methods MD5, 237 MDS, 260 mean, mean field, 465 melody, 238, 241 memory, 496 memory, address-based, 497 memory, associative, 497, 541 message-passing, 221, 222, 280, 329, 366, 448 631 INDEX metacode, 116, 122 metric, 548 Metropolis method, 528, see Monte Carlo methods dumb, 432 microsoftus, 488 microwave oven, 152 min-sum algorithm, 367 mine, 483 minimization, 501 minimum distance, 245, 254, see distance of code mixing coefficients, 344, 355 mixture in Markov chains, 410 mixture distribution, 410 mixture modelling, 328, 330 mixture of Gaussians, 355 mobile phone, 220 models of images, 560 moderation, 35 seemarginalization, 35 molecules, 239 momentum, 508 Monte Carlo methods, 528 coalescence, 454 dependence on dimensionality, 394 exact sampling, 454 Gibbs sampling, 406, 459 importance sampling, 397 weakness of, 418 Markov chain Monte Carlo, 402, 403 Metropolis-Hastings, 402 perfect simulation, 454 rejection sampling, 400 adaptive, 406 Morse, 296 multiple access channel, 273 multiterminal networks, 275 multivariate Gaussian, 210 murder, 31, 66, 68 music, 238, 241 mutual information, 176, 181 how to compute, 180 myth, 387 myths compression, 81 nat (unit), 307 natural selection, 312 Neal, Radford, 126, 137, 221, 411, 416, 430, 459–461, 469, 472, 527, 528, 533 neural network, 496 capacity, 513 neuron, 499 capacity, 513, 514 noise coloured, 213 spectral density, 212 c David J.C MacKay Draft 3.1.1 October 5, 2002 white, 213 noisy channel, see channel noisy channel coding theorem poor man’s version, 256 noisy typewriter, 178, 183, 185 noisy-channel coding theorem, 20, 183, 196, 205, 241 noisy-or, 340 non-confusable inputs, 183 noninformative, 361 normal, 355 normal graph, 260 normalizing constant, see partition function notation absolute value, 38, 609 conventions of this book, 178 expectation, 42 intervals, 100 logarithms, matrices, 178 set size, 38, 609 vectors, 178 NP, 584 NP-complete, 218, 367, 553, 584 NP-hard, 584 nucleotide, 242 nuisance parameters, 361 objective function, 501 Occam factor, 365 Ode to joy, 242 one-way hash function, 237 optimal decoder, 182 optimal input distribution, 181 optimization, 508, 541 optimization problems, 552 orthodox statistics, 362 outer code, 219 overfitting, 352 overrelaxation, 427 P-value, 71 packet, 224 paramagnetic, see Ising model parity, 13 parity check bits, 13, 241 parity check constraints, 24 parity check matrix, 16 parity check nodes, 24 parity-check matrix, 266 partial order, 459 partial partition functions, 448 particle filters, 435 partition function, 439, 445, 448, 449, 462–464, 611, 613 pentagon-ful code, 25 perfect code, 247, 249, 250 perfect codes, 260 631 632 perfect simulation, 454, see Monte Carlo methods, exact sampling periodic variable, 358 phase transition, 397, 611 philosophy, 31 phone directory, 230 phone number, 45, 146 pitchfork bifurcation, 338 plaintext, 308 point estimate, 473 pointer, 135 Poisson distribution, Poisson process, 43, 51 Poissonville, 43, 356 polymer, 296 positivity, 573 posterior probability, 10 power cost, 214 practical, 217 precision, 210, 215, 355, 362 predictive distribution, 126 prefix, 102 prefix code, 103, 107 prior, 10 subjectivity, 36 priority of bits in a message, 275 prize, on gameshow, 66 probabilistic model, 126, 136 probability, 31 Bayesian, 57 contrasted with likelihood, 34 probability of block error, 182 probability propagation, 221 problem, 582 product code, 219 proposal density, 400, 402 Propp, J G., 459 prosecutor’s fallacy, 30 prospecting, 483 protein, 239, 242 regulatory, 239, 243 puzzle southeast, 272 hat, 262 QWERTY, 135 R3 , see repetition code RAID, 260 random code, 188, 194, 199 random variable, 31 random walk, 403 random-coding exponent, 205 random-walk Metropolis method, 426 rant about classical statistics, 71 rants c David J.C MacKay Draft 3.1.1 October 5, 2002 INDEX against sampling theory, 71 rate, 182 recognition, 242 rectangular code, 219 redundancy, 9, 38 in channel code, 176 redundant constraints in code, 24 Reed-Solomon code, 220, 221 rejection sampling, 400, see Monte Carlo methods relative entropy, 40 reliability function, 205 repetition code, 9, 18, 20, 21, 46, 217 responsibility, 335, 336, 477 Roberts, G.O., 432 roman, 152 rule of thumb, 416 runlength-limited channel, 287 saddle-point approximation, 381 sampler density, 398 sampling distribution, 490 sampling theory, 362 satellite communications, 221 Schănberg, 242 o self-delimiting, 158 self-dual, 259 self-orthogonal, 258 self-punctuating, 103 sermon sampling theory, 71 set, 73 Shannon, 19, 20, 74, 183, 255, 256, see noisy-channel coding theorem, source coding theorem, information content shannon (unit), 307 shifter ensemble, 560 significance levels, 71 simplex, 208 Simpson, O.J., see wife-beaters Simpson, O.J., similar case to, 66, 68 simulated annealing, 430 Slepian-Wolf, 156 slice sampling, 411, see Monte Carlo methods multi-dimensional, 415 softmax, softmin, 336 software arithmetic coding, 137 source code, 80, 83 block code, 83 for integers, 158 Huffman, see Huffman code optimal lengths, 108 prefix code, 107 stream codes, 125–149 symbol code, 101 optimal, 101 ... B/K = 1/5 and N = 5, the expectation and variance of nB are and 4/5 The standard deviation is 0.89 When B/K = 1/5 and N = 400, the expectation and variance of nB are 80 and 64 The standard deviation... Entropy, and Inference Independence Two random variables X and Y are independent (sometimes written X⊥Y ) if and only if P (x, y) = P (x)P (y) (2.11) Exercise 2.2:A2 Are the random variables X and. .. P and Q: in general DKL (P ||Q) = DKL (Q||P ), so DKL , although it is sometimes called the ‘K-L distance’, is not strictly a ‘distance’ This quantity is important in pattern recognition and neural