NPCDS/MITACS Spring School on Statistical and Machine Learning: Topics at the Interface

Research in Learning has a long history of cross-fertilization of ideas between Statistics and Computer Science, with examples such as tree growing algorithms developed in the 1980’s (statistically motivated CART and machine-learning motivated C4.5), graphical models (loglinear models in statistics, probabilistic (or Bayesian) nets in Computer Science), and the high profile treatment of machine learning in the popular statistical text "Elements of Statistical Learning: Data Mining, Inference and Prediction" by Hastie, Tibshirani and Friedman. We seek to continue this rich and mutually beneficial tradition, hosting an event taught by and directed at Statistical and Machine Learners.

The week will begin with a one-day introduction to key concepts in statistical and machine learning. This will be used as an opportunity to stress (and contrast) the underlying philosophy of both disciplines. For example, statisticians emphasize probabilistic models for learning, and techniques for quantifying variation in the estimated model that results from variation in the learning sample. For many machine learners, the algorithm is the model, and emphasis is placed on developing interpretable yet flexible methods of learning in challenging context (computer vision, natural language).

Following this introduction, several specific topics will be discussed in detail. Topics may originate in one of the two disciplines, but all contain significant contributions from each area. Neural networks will first be covered. Developed in a machine learning context, these flexible models have been used in a wide variety of contexts by researchers in many fields. These include statistical approaches (for example Bayesian methods) and a wide variety of machine learning approaches (for instance mixtures of experts approaches to neural nets). Next we will cover model based clustering, a family of unsupervised methods for identifying previously undiscovered clusters in data via mixtures of probability models.

Day four will focus on support vector machines, a machine learning technique that has received considerable attention in recent years across a wide variety of disciplines, including learning and AI, operations research, engineering, and statistics. It is a technique that is not well understood by many statisticians, despite bold claims of outstanding performance from its proponents. Presentations by both statisticians and machine learners with research expertise in this area will focus on the basic ideas, promise and perils of their use. The event will conclude on the fifth day with coverage of manifold learning, including its relation to semi-supervised learning. Increasingly popular amongst machine learners, manifold methods attempt to identify a nonlinear subspace (surface) in a high-dimensional space, such that the data are all close to this surface. Connections with earlier manifold methods such as principal surfaces and self-organizing maps will be briefly made. Connections also exist with kernel methods, which form the basis of support vector machines discussed earlier.