9  Preprocessing Data

Click here to download this chapter as a Jupyter (.ipynb) file.

The term Proprocessing refers to various things we do to the data to prepare it for the specific machine learning algorithms that we will use in our models. The purpose of preprocessing is to improve model performance or interpretability.

Several different tasks are considered part of preprocessing, including the following:

In this chapter we will learn how to perform scaling of numeric features and encoding of categorical features using scikit-learn tools.

9.1 Module and Function Imports

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression

9.2 Scaling

Scaling refers to changing the scale on which the features are measured. Several different types of scalers are provided by scikit-learn. We will focus on the following two:

  • StandardScaler() - ensures that for each feature in the training data the mean is zero and the variance is 1. In other words, it converts every value of a feature to its z-score. Recall that the formula for a z-score is \(\frac{x_i - mean(x)}{stdev(x)}\)
  • MinMaxScaler() - shifts the data so that each feature in the training data ranges between 0 and 1. The formula for min-max scaling is \(\frac{x_i - min(x)}{max(x) - min(x)}\)

These scalers are types of transformers in scikit-learn, so each has a fit() method, a transform() method, and a fit_transform() method that combines the fit and transform steps. An important principle to remember with scaling or any other transformation is that test data should always be transformed based on the characteristics of the training data. In other words, transformers such as scalers are always fit to the training data and then used to transform both the training and the test data. They are never fit to the test data.

9.3 Encoding of Categorical Variables

In machine learning the most common way to encode categorical variables is with one-hot encoding or one-out-of-N encoding. One-hot encoding is a strategy that replaces a single categorical feature with new binary features, one for each possible value of the categorical feature. For each row only one of the newly-created columns will have a 1 in it. The rest of the newly-created columns will have 0 fpr that row. The column with the 1 in it indicates the category for that row.

Note that this is different from how dummy variables are typically coded in statistics, where a categorical feature with \(n\) possible values is typically encoded into \(n-1\) features, with the last feature being represented by zeros on all the new features.

We can use scikit-learn’s OneHotEncoder() class to one-hot-encode our data. As a transformer class, its instances have the same fit(), transform() and fit_transform() methods as other transformers.

9.3.1 Be Careful: Numbers Can Encode Categoricals

Sometimes numbers are used to represent categories. For example, in a column for type of customer 1 could represent “retail,” 2 could represent “wholesale,” and 3 could represent “government.” Watch out for such variables and make sure to treat numbers that represent categories as categories rather than as numeric values.

9.4 Scaling and Encoding Categorical Variables in the Context of Model Building

Below we build a supervised machine learning model to predict whether passengers survived the Titanic disaster. We will use logistic regression as the prediction algorithm. Logistic regression is similar to linear regression, but it is used to predict categories rather than continuous numeric variables. We are building this model so that we can see how scaling and other transformations are used in the context of building a machine learning model. Before training the logistic regression model the features must be transformed. For logistic regression models it is recommended to standard-scale the numeric features and encode the categorical features. The transformer is always fit to the training data only and then used to transform both the training and test data.

Note that the model building process we use below is not the full machine learning model building process. We are using the default parameter settings for Logistic regression rather than tuning it, and we are only doing one train-test split. The purpose of the simplified model building process we present below is to show how preprocessing steps such as scaling numeric features and encoding categorical features fit within the common steps of the model building process.

9.4.1 Load the Titanic Data

We will use a subset of the Titanic data, which describes passengers on the Titanic. This will be a useful dataset with which to practice preprocessing because it is easy to understand and it has a mix of numeric and categorical features. Below is a brief description of the variables:

  • Survived - the target variable. Has values 0 or 1, with 1 meaning that the passenger survived the disaster
  • SibSp - represents number of siblings and spouse traveling with the passenger
  • Parch - represents the number of parents and children traveling with the passenger
  • Pclass - represents class of ticket, first class, second class, or third class
  • Embarked - represents the port from which the passenger embarked.

Note: This is real data, but it is only a subset of the actual Titanic dataset.

titanic = pd.read_csv('https://neuronjolt.com/data/titanic.csv')
titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 712 entries, 0 to 711
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  712 non-null    int64  
 1   Pclass    712 non-null    int64  
 2   Sex       712 non-null    object 
 3   Age       712 non-null    float64
 4   SibSp     712 non-null    int64  
 5   Parch     712 non-null    int64  
 6   Embarked  712 non-null    object 
dtypes: float64(1), int64(4), object(2)
memory usage: 39.1+ KB
titanic.sample(5)
Survived Pclass Sex Age SibSp Parch Embarked
195 0 3 female 3.0 3 1 S
685 1 2 female 23.0 0 0 C
581 0 3 male 19.0 0 0 S
266 0 2 male 32.0 0 0 S
111 0 3 male 22.0 0 0 S

9.4.2 Split data into train and test data

We will start our model building by splitting the data into training and test data. Next, before we build the logistic regression model we need to set up transformers to standard-scale the numeric features (Age, SibSp, and Parch) and one-hot encode the categorical features (Pclass, Sex, and Embarked). After setting up those transformers we will fit them to the training data and use them to tranform the training and test data. Finally, we will create an instance of a logistic regression model, train it on the transformed training data, and then score it on the transformed test data.

Note that we need to call train_test_split() on a DataFrame containing the features because we need to have column names available so that we can specify transformations by column name. Below we create a DataFrame with the features and a Series with the target before we do the train-test split. By specifying the stratify = target parameter within the train_test_split() function we instruct the split to keep the ratio of target variable values constant in each part of the split. This means that the percentage who survived will be the same in both the training and test data and the percentage who died will be the same in both the training and test data. It is recommended to use stratified splits when building classification models.

# specify all columns except for Survived as the features
features = titanic.drop(columns = 'Survived')

# Specify the target variable
target = titanic['Survived']

# split into train and test sets with a stratified split
X_train, X_test, y_train, y_test = train_test_split(features, 
                                                    target, 
                                                    stratify = target)
# Check shape of untransformed features
X_train.shape
(534, 6)

9.4.3 Instantiate a ColumnTransformer object

We can use scikit-learn’s ColumnTransformer class to package together different transformers for different columns. After we define it, the ColumnTransformer object itself becomes a transformer, with the same fit(), transform(), and fit_transform() methods as any other transformer.

We create an instance of a ColumnTransformer class with its ColumnTransformer() constructor. Its first argument is a list of tuples specifying the transformer objects to be applied to subsets of the features. The tuples are of the format: (name, transformer, columns). Here we will apply standard scaling to the numeric columns and one-hot encoding to the categorical columns. The handle_unknown = 'ignore' parameter is set so that when an unknown category is encountered during the transform (typically while transforming the test data), the resulting one-hot encoded column values for this feature will be all zeros. If this parameter is not set an unknown value encountered during the transform would result in an error.

ct = ColumnTransformer([
    ("scaling", StandardScaler(), ['Age', 'SibSp', 'Parch']),
    ("onehot", OneHotEncoder(handle_unknown='ignore'), ['Pclass', 'Sex', 'Embarked'])
])

9.4.4 Transform the training and test data

Now that we have instantiated the ColumnTransformer we can use it to fit the training data and transform both the training data and the test data.

# Fit the column transformer to the training data
ct.fit(X_train)

# Transform the training data
X_train_trans = ct.transform(X_train)

# Transform the test data
X_test_trans = ct.transform(X_test)
# Check shape of transformed features
X_train_trans.shape
(534, 11)

9.4.4.1 Look at the names of the newly-created features

Notice that after transformation there are now 11 features instead of the original 6 features. It is hard to see what happened by looking directly at the transformed features, because they are numpy arrays without any column names. However, we can use the fitted ColumnTransformer object’s get_feature_names_out() method to see the column names for the features after transformation.

ct.get_feature_names_out()
array(['scaling__Age', 'scaling__SibSp', 'scaling__Parch',
       'onehot__Pclass_1', 'onehot__Pclass_2', 'onehot__Pclass_3',
       'onehot__Sex_female', 'onehot__Sex_male', 'onehot__Embarked_C',
       'onehot__Embarked_Q', 'onehot__Embarked_S'], dtype=object)

We can see that the numeric columns have been replaced by scaled versions and the categorical columns have been replaced by multiple columns, one for each unique value of the original categorical variable. For example, Pclass has been replaced by onehot__Pclass_1, onehot__Pclass_2, and onehot__Pclass_2.

9.4.5 Fit a logistic regression model to the transformed training data

The max_iter parameter defines the maximum number of iterations the logistic regression solver can perform as it looks for the best coefficients for the logistic regression equation. Setting this to a higher number can prevent the logistic regression from failing with a “could not converge” error message. A good number to set it to is 2,000.

# Instantiate the logistic regression estimator
logreg = LogisticRegression(max_iter = 2_000)

# Fit the logistic regression estimator to the transformed training data
logreg.fit(X_train_trans, y_train)
LogisticRegression(max_iter=2000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

9.4.6 Score the model on the test data

Score the logistic regression model on the transformed test data to get an estimate of the logistic model’s generalization performance.

logreg.score(X_test_trans, y_test)
0.7640449438202247

The score on the test data is our best estimate of how well the logistic regression model would perform if we applied it to other passengers on the Titanic to predict whether they survived.

This example shows how preprocessing transformations such as standard scaling and one-hot encoding fit into the context of machine-learning model building. Since it is common to have to apply different transformations to different columns the preprocessing transformations are typically packaged into a ColumnTransformer, which can be set up to apply different transformers to different columns.