Mathematics and Statistics
The necessary mathematical background for machine learning falls into two main areas:
Linear algebra
Analysis
Some of the works cited above contain introductory chapters that introduce the necessary concepts. Garrity [33] (Chapters 1 and 2, and Section 6.4) can be considered a ‘one-stop shop’ for a quick introduction to the necessary mathematical background. The book by Mahajan [48] and its associated course on MIT OpenCourseWare provide good exercises in how to think about mathematical approaches and approximating answers.
When thinking about mathematics:
Focus on the concepts rather than computation. Computers are there to do the computational work for you. What are most important for you to understand are the ‘what’ and the ‘why’ of mathematical ideas.
Think geometrically. Drawing or imagining pictures, especially when dealing with linear algebra, can help you understand what is going on.
The necessary statistical background includes:
Basic probability
Bayesian statistics
Information theory
The following concepts in linear algebra are important to understand for machine learning:
Vector spaces
Vectors and matrices, operations on vectors and matrices
Dimension, rank, and nullspace
Eigenvalues, singular value decomposition
Fundamental theorem of linear algebra.
The books by Bishop [1] and Goodfellow et al. [4] have particularly good coverage of linear algebra concepts required for machine learning. Strang [23] and its associated course on the MIT OpenCourseWare site provide a thorough grounding in linear algebra.
Analysis can be thought of as providing the underpinnings of calculus and further advanced mathematics. Important concepts from analysis include:
Set theory
Sequences and limits
Continuity
Manifolds
Most books on machine learning take it for granted that the reader has at least a passing familiarity with these concepts. The appendix in Schapiere et al. [21] provides adequate coverage of most of these concepts. Rudin [61] is a standard undergraduate text in analysis that provides everything and more that you will need on the subject for machine learning. Mattuck [49] is much less intense than Rudin’s book, but it provides a more than adequate background for machine learning. Alcock’s book [26] is intended to be a companion to an analysis course that emphasizes the ‘why’ more than the ‘what’ of analysis, but it covers enough of the basic ideas to serve as a stand-alone reference for purposes of machine learning.
Probability theory is at the heart of many modern machine learning methods. Key concepts to know include the following:
Random variables: discrete, continuous; expectation and variance
Chain rule
Important discrete and continuous distributions, including binomial, Dirichlet, exponential, Gaussian
Functions of random variables
Independence
Mixture distributions
Conditional distributions and conditional independence
Bertsekas and Tsitsklis [28] is intended as a text for engineering students encountering probability for the first time. The texts by Bishop [1] and Murphy [17] introduce key concepts of probability and important probability distributions.
Statistics in machine learning is almost entirely Bayesian. In fact, some authors, like Murphy [17], go out of their way to point out why traditional (frequentist) statistics is inappropriate for machine learning. And, as pointed out in Section 2.2 above, some areas of machine learning such as Bayesian networks are grounded in Bayesian statistics. For much of the 20th century Bayesian statistics was ignored or vigorously opposed by frequentist statisticians because of its supposed ‘subjectivity’; McGrayne’s book [52] is worth reading for a discussion of the issues between frequentists and Bayesians.
Bayes’ theorem is the key concept in Bayesian statistics: i.e.,
prior knowledge + new data = better knowledge
The biggest barrier to implementing Bayesian methods was that except in special cases, there was no analytical solution to most problems. The advent of low-cost computing combined with the development of efficient numerical tools for Bayesian analysis like BUGS (now JAGS [41]) have made it possible to carry out Bayesian analysis for a wide range of problems.
Gelman et al. [34] is one of the most comprehensive works available on Bayesian analysis, and is recommended for a thorough grounding in the subject. Kruschke [46] is a good reference for those wanting to get a quick start in implementing Bayesian analysis. McElreath [51] is a text on Bayesian statistics with examples using the Stan language.
Missing data are inevitable in the real world. For example, in household travel surveys, some respondents tend to refuse to answer questions about household income. Rather than simply throwing out observations with missing data items, the modern approach to impute missing data, usually through a technique known as multiple imputation. Chapter 18 of Gelman et al. [34] discusses missing data in a Bayesian context. Buuren [29] is a recent text on imputation of missing data.
Information theory deals with the quantification of information content of data. Machine learning makes use of a number of concepts from information theory, such as entropy, mutual information, and Kullback-Leibler divergence.
Cover and Thomas [32] is a standard text on information theory that provides more than the minimum needed for machine learning; Chapter 2 contains a thorough introduction to the basic concepts, and should be more than sufficient for machine learning. MacKay’s text [13] discusses information theory from the perspective of machine learning. Bishop [1] and Murphy [17] each provide a brief background in information theory.
Algorithms and data structures are at the heart of all machine learning work. This is where ‘the rubber meets the road’. Corman et al. [] is one of the most comprehensive, yet accessible, works on algorithms; examples are written in pseudocode so that they can be easily implemented in the language of the reader’s choice. Cormen et al. also discuss randomized algorithms, which often provide faster, ‘almost as good’ solutions that may not be achievable by traditional algorithms; Motwani and Raghavan [55] is devoted entirely to randomized algorithms. Sedgwick [63] is another useful reference on algorithms.
The series of texts by Knuth [45] has long been regarded as a classic on algorithms. The discussion on random number generation in Volume 2 is a must-read for anybody who uses random numbers in their work.
Almost any machine learning application will require some use of numerical analysis. As discussed below, there are software packages in machine learning that will do the numerical analysis (e.g., optimization, testing) for you. But you still need to know what is going on inside the computer so that you can fix things when they go wrong, as they inevitably will at times.
Press et al. [57] is an extensive reference on numerical analysis algorithms that shows implementations in computer code (C++ in the latest edition; earlier editions were written in C and Fortran). The book is especially valuable for discussions on traps one can fall into and how to avoid them.
The books by Nocedal and Wright [56] and Luenberger and Ye [47] cover both unconstrained and constrained optimization methods.
All numerical implementations will use floating point numbers, which can present some unpleasant surprises for the programmer who is unaware of the issues involved; for example, one should never, ever use equality comparisons for floating point numbers. The paper by Goldberg [38] should be read and understood by anybody who deals with floating point numbers.
Random number generation is used in many machine learning applications. But a number of random number generators in use are faulty. Volume 2 of Knuth [45] has an extensive discussion on random number generation, including the pitfalls of some of the most popular random number generators. The GNU site on random number generation algorithms [37] is also worth accessing for a more modern discussion of the issues.
Last updated