import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
9 Preprocessing Data
Click here to download this chapter as a Jupyter (.ipynb) file.
The term Proprocessing refers to various things we do to the data to prepare it for the specific machine learning algorithms that we will use in our models. The purpose of preprocessing is to improve model performance or interpretability.
Several different tasks are considered part of preprocessing, including the following:
- Imputing missing data - Filling in missing data so that features with missing data may be used in the machine-learning model. There are several strategies that may be used to fill in missing data, including the following:
- Mean/Median/Mode imputation - For numeric data the missing values are sometimes replaced with the mean or median for that feature. For categorical data the missing values are sometimes filled with the most common value (“mode”) for that feature.
- K-nearest neighbors (K-nn) imputation - This strategy uses K-nn to predict the missing values. The missing value is estimated from the value of that feature for the k nearest neighbors of the instance with the missing value.
- Regression imputation - uses a regression model built with the other features as the independent variables to predict the missing values in a feature.
- Scaling data - Transforming the features to a particular scale.
- Min-max scaling - This transforms each feature so that it has a range of 0 to 1.
- Robust scaling - Similar to min-max scaling, but the transformed values for a feature are calculated as \(\frac{x - median}{IQR}\). This type of scaling is preferred if the data has significant outliers.
- Standard scaling - This transforms each feature so that it has a mean of 0 and a standard deviation of 1.
- Transforming data to reduce skewness - To prepare the data for models that assume that features are normally distributed some features may be transformed by taking a logarithm of that feature (log transformation) or using a power function of the feature (Box-Cox transformation or Yeo-Johnson transformation)
- Discretization of numeric variables - This is also called binning because it is a process of converting numeric variables into discrete intervals or bins. It is often used to reduce the effect of outliers or to improve model interpretability.
- Encoding of categorical variables - Some algorithms require that categorical variables are encoded with numbers.
- Adding polynomial terms and interaction terms
- Polynomial terms are polynomials of the original features. For example, if \(x\) is one of the original features, \(x^2\) and \(x^3\) are polynomial terms of \(x\). Polymomial terms are sometimes added so that linear models can capture non-linear relationships.
- Interaction terms are products of original features. For example, if \(a\) and \(b\) are features, their interaction term is \(a*b\). Interaction terms can capture conditional (moderation) relationships in the data.
- Dimensionality reduction - Another name for features is dimensions. Dimensionality reduction techniques are techniques used to reduce the number of features, either by keeping some features and eliminating others (feature selection) or combining multiple features together into a single feature using a technique such as principle component analysis (PCA).
In this chapter we will learn how to perform scaling of numeric features and encoding of categorical features using scikit-learn tools.
9.1 Module and Function Imports
9.2 Scaling
Scaling refers to changing the scale on which the features are measured. Several different types of scalers are provided by scikit-learn. We will focus on the following two:
StandardScaler()
- ensures that for each feature in the training data the mean is zero and the variance is 1. In other words, it converts every value of a feature to its z-score. Recall that the formula for a z-score is \(\frac{x_i - mean(x)}{stdev(x)}\)MinMaxScaler()
- shifts the data so that each feature in the training data ranges between 0 and 1. The formula for min-max scaling is \(\frac{x_i - min(x)}{max(x) - min(x)}\)
These scalers are types of transformers in scikit-learn, so each has a fit()
method, a transform()
method, and a fit_transform()
method that combines the fit and transform steps. An important principle to remember with scaling or any other transformation is that test data should always be transformed based on the characteristics of the training data. In other words, transformers such as scalers are always fit to the training data and then used to transform both the training and the test data. They are never fit to the test data.
9.3 Encoding of Categorical Variables
In machine learning the most common way to encode categorical variables is with one-hot encoding or one-out-of-N encoding. One-hot encoding is a strategy that replaces a single categorical feature with new binary features, one for each possible value of the categorical feature. For each row only one of the newly-created columns will have a 1 in it. The rest of the newly-created columns will have 0 fpr that row. The column with the 1 in it indicates the category for that row.
Note that this is different from how dummy variables are typically coded in statistics, where a categorical feature with \(n\) possible values is typically encoded into \(n-1\) features, with the last feature being represented by zeros on all the new features.
We can use scikit-learn’s OneHotEncoder()
class to one-hot-encode our data. As a transformer class, its instances have the same fit()
, transform()
and fit_transform()
methods as other transformers.
9.3.1 Be Careful: Numbers Can Encode Categoricals
Sometimes numbers are used to represent categories. For example, in a column for type of customer 1 could represent “retail,” 2 could represent “wholesale,” and 3 could represent “government.” Watch out for such variables and make sure to treat numbers that represent categories as categories rather than as numeric values.
9.4 Scaling and Encoding Categorical Variables in the Context of Model Building
Below we build a supervised machine learning model to predict whether passengers survived the Titanic disaster. We will use logistic regression as the prediction algorithm. Logistic regression is similar to linear regression, but it is used to predict categories rather than continuous numeric variables. We are building this model so that we can see how scaling and other transformations are used in the context of building a machine learning model. Before training the logistic regression model the features must be transformed. For logistic regression models it is recommended to standard-scale the numeric features and encode the categorical features. The transformer is always fit to the training data only and then used to transform both the training and test data.
Note that the model building process we use below is not the full machine learning model building process. We are using the default parameter settings for Logistic regression rather than tuning it, and we are only doing one train-test split. The purpose of the simplified model building process we present below is to show how preprocessing steps such as scaling numeric features and encoding categorical features fit within the common steps of the model building process.
9.4.1 Load the Titanic Data
We will use a subset of the Titanic data, which describes passengers on the Titanic. This will be a useful dataset with which to practice preprocessing because it is easy to understand and it has a mix of numeric and categorical features. Below is a brief description of the variables:
Survived
- the target variable. Has values 0 or 1, with 1 meaning that the passenger survived the disasterSibSp
- represents number of siblings and spouse traveling with the passengerParch
- represents the number of parents and children traveling with the passengerPclass
- represents class of ticket, first class, second class, or third classEmbarked
- represents the port from which the passenger embarked.
Note: This is real data, but it is only a subset of the actual Titanic dataset.
= pd.read_csv('https://neuronjolt.com/data/titanic.csv') titanic
titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 712 entries, 0 to 711
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Survived 712 non-null int64
1 Pclass 712 non-null int64
2 Sex 712 non-null object
3 Age 712 non-null float64
4 SibSp 712 non-null int64
5 Parch 712 non-null int64
6 Embarked 712 non-null object
dtypes: float64(1), int64(4), object(2)
memory usage: 39.1+ KB
5) titanic.sample(
Survived | Pclass | Sex | Age | SibSp | Parch | Embarked | |
---|---|---|---|---|---|---|---|
195 | 0 | 3 | female | 3.0 | 3 | 1 | S |
685 | 1 | 2 | female | 23.0 | 0 | 0 | C |
581 | 0 | 3 | male | 19.0 | 0 | 0 | S |
266 | 0 | 2 | male | 32.0 | 0 | 0 | S |
111 | 0 | 3 | male | 22.0 | 0 | 0 | S |
9.4.2 Split data into train and test data
We will start our model building by splitting the data into training and test data. Next, before we build the logistic regression model we need to set up transformers to standard-scale the numeric features (Age
, SibSp
, and Parch
) and one-hot encode the categorical features (Pclass
, Sex
, and Embarked
). After setting up those transformers we will fit them to the training data and use them to tranform the training and test data. Finally, we will create an instance of a logistic regression model, train it on the transformed training data, and then score it on the transformed test data.
Note that we need to call train_test_split()
on a DataFrame
containing the features because we need to have column names available so that we can specify transformations by column name. Below we create a DataFrame with the features and a Series with the target before we do the train-test split. By specifying the stratify = target
parameter within the train_test_split()
function we instruct the split to keep the ratio of target variable values constant in each part of the split. This means that the percentage who survived will be the same in both the training and test data and the percentage who died will be the same in both the training and test data. It is recommended to use stratified splits when building classification models.
# specify all columns except for Survived as the features
= titanic.drop(columns = 'Survived')
features
# Specify the target variable
= titanic['Survived']
target
# split into train and test sets with a stratified split
= train_test_split(features,
X_train, X_test, y_train, y_test
target, = target) stratify
# Check shape of untransformed features
X_train.shape
(534, 6)
9.4.3 Instantiate a ColumnTransformer
object
We can use scikit-learn’s ColumnTransformer
class to package together different transformers for different columns. After we define it, the ColumnTransformer object itself becomes a transformer, with the same fit()
, transform()
, and fit_transform()
methods as any other transformer.
We create an instance of a ColumnTransformer class with its ColumnTransformer()
constructor. Its first argument is a list of tuples specifying the transformer objects to be applied to subsets of the features. The tuples are of the format: (name, transformer, columns)
. Here we will apply standard scaling to the numeric columns and one-hot encoding to the categorical columns. The handle_unknown = 'ignore'
parameter is set so that when an unknown category is encountered during the transform (typically while transforming the test data), the resulting one-hot encoded column values for this feature will be all zeros. If this parameter is not set an unknown value encountered during the transform would result in an error.
= ColumnTransformer([
ct "scaling", StandardScaler(), ['Age', 'SibSp', 'Parch']),
("onehot", OneHotEncoder(handle_unknown='ignore'), ['Pclass', 'Sex', 'Embarked'])
( ])
9.4.4 Transform the training and test data
Now that we have instantiated the ColumnTransformer we can use it to fit the training data and transform both the training data and the test data.
# Fit the column transformer to the training data
ct.fit(X_train)
# Transform the training data
= ct.transform(X_train)
X_train_trans
# Transform the test data
= ct.transform(X_test) X_test_trans
# Check shape of transformed features
X_train_trans.shape
(534, 11)
9.4.4.1 Look at the names of the newly-created features
Notice that after transformation there are now 11 features instead of the original 6 features. It is hard to see what happened by looking directly at the transformed features, because they are numpy arrays without any column names. However, we can use the fitted ColumnTransformer
object’s get_feature_names_out()
method to see the column names for the features after transformation.
ct.get_feature_names_out()
array(['scaling__Age', 'scaling__SibSp', 'scaling__Parch',
'onehot__Pclass_1', 'onehot__Pclass_2', 'onehot__Pclass_3',
'onehot__Sex_female', 'onehot__Sex_male', 'onehot__Embarked_C',
'onehot__Embarked_Q', 'onehot__Embarked_S'], dtype=object)
We can see that the numeric columns have been replaced by scaled versions and the categorical columns have been replaced by multiple columns, one for each unique value of the original categorical variable. For example, Pclass
has been replaced by onehot__Pclass_1
, onehot__Pclass_2
, and onehot__Pclass_2
.
9.4.5 Fit a logistic regression model to the transformed training data
The max_iter
parameter defines the maximum number of iterations the logistic regression solver can perform as it looks for the best coefficients for the logistic regression equation. Setting this to a higher number can prevent the logistic regression from failing with a “could not converge” error message. A good number to set it to is 2,000.
# Instantiate the logistic regression estimator
= LogisticRegression(max_iter = 2_000)
logreg
# Fit the logistic regression estimator to the transformed training data
logreg.fit(X_train_trans, y_train)
LogisticRegression(max_iter=2000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(max_iter=2000)
9.4.6 Score the model on the test data
Score the logistic regression model on the transformed test data to get an estimate of the logistic model’s generalization performance.
logreg.score(X_test_trans, y_test)
0.7640449438202247
The score on the test data is our best estimate of how well the logistic regression model would perform if we applied it to other passengers on the Titanic to predict whether they survived.
This example shows how preprocessing transformations such as standard scaling and one-hot encoding fit into the context of machine-learning model building. Since it is common to have to apply different transformations to different columns the preprocessing transformations are typically packaged into a ColumnTransformer, which can be set up to apply different transformers to different columns.