Supervised Machine Learning
Machine learning, also called Statistical Learning or Data Mining, refers to the practice of using computers to learn from data. A machine learning model uses statistical techniques or other algorithms to find patterns in data, and/or to make predictions based on past observations. It has become an important tool in many different fields, underlying innovations such as personalized recommendations, fraud detection, and medical diagnostics. Many of the techniques used in machine learning have been around for decades. Machine learning’s popularity and growth in recent years is a result of increased availability of data and more powerful computers, which make it feasible to analyze large amounts of data at acceptable speeds.
Types of Machine Learning
There are three fundamental types of machine learning: supervised learning, unsupervised learning, and reinforcement learning.
Supervised learning refers to learning to predict the value of a target variable from the values of other variables. Supervised learning models perform tasks such as spam detection, detecting fraudulent activity in credit card transactions, or determining whether a tumor is malignant or benign based on a medical image.
Unsupervised learning refers to the process of learning from data when there is no particular target variable of interest. Unsupervised learning models may be used to perform tasks such as find themes in a large collection blog posts, or segment customers into groups by similarity. Another example may be seen on Amazon.com. If a product has many reviews the main themes in the reviews are summarized at the top. It is likely an unsupervised machine learning model that creates the summary.
Reinforcement learning refers to when a learning agent observes its environment, selects and performs actions, and is either rewarded or penalized. The agent learns the best policy for selecting actions that increase net rewards. Examples of systems that use reinforcement learning are autonomous driving systems and game playing systems.
In this textbook we will be focusing on supervised learning.
Supervised Machine Learning
Let’s take a more in-depth look at supervised learning. Supervised learning involves constructing a model that predicts an outcome variable (which can be a number or a category) from one or more input variables by training the model on known input-output pairs, called the training set or training data. Because the values of the outcome variable are called labels, the training data for supervised learning is also called labeled training data.
The goal of supervised learning is to build a model that can make effective predictions for new, never-before-seen data. The performance of the model on such new, never-before-seen data is called generalization performance or out-of-sample performance.
Two Major Categories of Supervised Learning
There are two major categories of supervised learning, defined by the nature of the value the model is predicting: classification and regression.
Classification models are models which predict a class label, which is a choice from a pre-defined list of categories (such as will respond to offer and won’t respond to offer, or malignant and benign). Binary classification models choose between two different categories (class labels), and multiclass classification models choose between 3 or more categories (class labels).
Regression models predict a numeric quantity (such as yield of a corn farm predicted from previous yields, weather, and the number of employees). Note that the term regression in this context refers to a category of supervised learning model, not to the specific statistical technique called linear regression. Linear regression is, however, one of the underlying algorithms that can be used within supervised learning regression models.
How Does the Model Make its Predictions?
Supervised learning models can use many different underlying algorithms to make their predictions, including statistical models, tree-based models, distance-based models, or deep-learning models such as neural networks. Each algorithm has strengths and weaknesses related to how interpretable or explainable they are, how much computer resources they take to train and execute, and their effectiveness on different types of data. Because it is difficult to know ahead of time which underlying algorithm will perform best it is common practice to try several different algorithms to determine which one works best with your data.
Estimating a Model’s Generalization Performance
Recall that a model’s generalization performance is how well it performs when making predictions for new, previously unseen data. Since generalization performance is the ultimate goal for a supervised learning model you may be wondering how we can estimate generalization performance when we are building the model, well before it is actually used to make predictions for new data.
Remember that to build a supervised learning model we need to have some data for which both the inputs and the target variable are known. This is called our labeled data. To estimate generalization performance when we are building the model we need to divide our labeled data into parts. One part, the training data (also called training set) is used to train the model. We hold back another part on which to test the model after it is built. The part of the data that we hold back for testing is called test data (also called the test set). This is typically repeated several times to generate a robust estimate of generaliation performance. The process of dividing our labeled data into training and test data several times, building the model on the training data and testing it on the test data each time, and then using the average performance on the test data to estimage generalization performance is called cross validation.
Overfitting and Underfitting
The concept of performing cross validation to estimate generalization performance is one of the key concepts of supervised machine learning. Another fundamental concept is that the model’s flexibility to fit the training data can lead to underfitting if it is not flexible enough and overfitting if it is too flexible.
Overfitting refers to a model that fits the training data very closely, so that it can make very accurate predictions on the training data. This accuracy on the training data, however, comes at the expense of generalization performance. How does this happen. When a model’s underlying algorithm is very flexible to fit the training it can reflect random idiosyncrasies of the training data that aren’t representative of the population. When the model is then applied to new data its performance suffers, because the new data doesn’t share the same random idiosyncrasies. Thus, fitting the training data too closely can actually hurt performance on new, previously-unseen data.
Underfitting refers to a model that is too simple (i.e. not complex or flexible enough) to reach its performance potential. It is not flexible enough to even reflect the characteristics of the traning data that are reflective of the population.
Avoiding Overfitting and Underfitting
So, the goal of a supervised learning model is to achieve the best generalization performance possible, and to do that the model needs to avoid both underfitting and overfitting. How can this be accomplished?
First, we must remember that the model’s performance on the test data (data on which it was NOT trained) is the key performance metric, because its performance on the test data is our best proxy for how it will perform on new, previously-unseen data in the future. The model’s performance on the training data is irrelevant and should be ignored.
Second, many of the algorithms that can be used within supervised learning models have settings, called parameters or hyperparameters, that may be adjusted to control the algorithm’s flexibility to fit the training data. We will learn a technique called model tuning or hyperparameter tuning that will help us find the best settings for the algorithm’s parameters, that is, the settings that avoid both underfitting and overfitting. Using model tuning we can build models that are flexible enough to reach their performance potential, but not so flexible that they overfit the training data at the expense of generalization performance.
Tools for Supervised Learning
Supervised learning techniques are becoming more commonly-used. As a result, there are many tools available that can do some aspects of supervised learning. The tool that we will use is the scikit-learn package in python. Scikit-learn is a free, popular python package that provides a set of efficient and effective tools for building machine learning models.
There are many algorithms that may be used within supervised learning models. They were developed by different people at different times. Many even have their own python packages. One major contribution of scikit-learn is that it provides a consistent interface for building machine learning models that implement a large number of these machine learning algorithms. In addition to the consistent interface, scikit-learn also provides many other tools to support the model building process.
In scikit learn the underlying algorithms are called estimators and are implemented as python classes. There are over 100 estimator classes implemented in scikit-learn. Examples of the algorithms that are implemented as estimators include linear regression, logistic regression, decision trees, random forests, k-nearest neighbors, gradient-boosted decision trees, support vector machines, and neural networks. We won’t be covering all those algorithms in this book, but the list includes many commonly-used algorithms that you may read or hear about.
Each estimator has a constructor function that is used to create an instance of that estimator class. For example, the KNeighborsClassifier()
constructor is used to create an instance of a K-nearest-neighbors model to be used for classification. An instance of each estimator class has the following methods:
fit()
- used to fit the model to data. This means that the algorithm learns to make predictions on the data.predict()
- used to make predictions after fitting the model to the data.score()
- used to score the fitted model on data for which the labels (outcomes) are known. In other words, the fitted model makes predictions on new data, and then the predictions are compared to the actual target variable values for that data and a performance score of some type is calculated.
The fact that over 100 estimators can all be used with the same methods in scikit-learn is what we mean when we say that scikit-learn provides a common interface for many different machine learning algorithms.
Scikit-learn also provides classes that may be used to transform data. These are called transformers, and they are used to perform operations such as scaling numeric data and encoding categorical data. Transformers all have the following methods:
fit()
- used to fit the transformer to the data. The transformer extracts from the data what it needs to know in order to do the transformation.transform()
- used to transform the data.fit_transform()
- combines thefit()
andtransform()
methods.
Here, again, scikit-learn provides a consistent interface to a diverse group of classes that may be used to transform data.
Scikit-learn also provides a variety of tools to help with the following parts of the model-building process:
- estimating generalization performance
- model tuning (finding the best parameter settings for the algorithm in the model)
- evaluating model performance with various metrics
- combining model-building steps into an integrated pipeline
- datasets
- visualizations
Comprehensive documentation is available at the scikit-learn website In particular, check out the User Guide and the API reference.