1  Pandas Fundamentals

Click here to download this chapter as a Jupyter (.ipynb) file.

Pandas is a package that provides data structures and tools for practical, real-world data manipulation and analysis in python. It is designed to work with tabular data, that is, data structured with rows representing entities and columns representing variables related to those entities, like most spreadsheet data. Pandas is the most widely-used data manipulation package for python, and is a fundamental part of the business analyst’s toolset. In the Reddit thread linked below a user asks if pandas is used in industry and whether it is worthwhile to learn pandas. Take a look at the responses the user receives.

Reddit Thread: How much is Pandas used in the Industry? Is it worth learning?

This chapter introduces some fundamental features and tools of pandas. We will learn about the fundamental pandas data structures, the DataFrame and the Series, as well as tools for investigating the data in a DataFrame or Series and tools for filtering or subsetting data based on various criteria. This chapter also covers how to sort data, and introduces special methods for working with string data in columns and dates in columns.

It is important to recognize that the pandas skills and knowledge covered in this textbook are a useful and important subset, but we won’t be covering all of pandas, and we won’t present every feature of the pandas tools we do cover! To learn pandas really well you will need to continue to practice pandas skills frequently, and consult references to broaden your skillset from time to time. As you practice using the tools (pandas objects, properties and methods) presented in this textbook it is recommended that you take a look at their descriptions in the pandas documentation below to get a feel for all that they can do.

The pandas website has many good references to help you learn. You can consult the user guide or the API reference there. There is also a nice one-page pandas cheat sheet posted there.

If you like to learn from books I recommend the following book. A link is provided to the free edition, but it is also available on Amazon.com if you want a hard copy:

If you prefer to learn from videos the following two Youtube video series are highly regarded for learning pandas.

OK. Let’s start by importing the pandas package into our python session. When working with add-on packages in python you first need to install the package if it is not already installed, and then import the package or specific functions from the package into your python session to make the package and its functions available to your code. Pandas is typically imported with the following statement:

import pandas as pd

We will also import the numpy package in the standard manner to have access to numpy functions.

import numpy as np

1.1 Pandas data structures

Pandas has two primary data structures:

  • Series (1-dimensional) - An array of data of a single data type, akin to a column of data in a typical tabular data structure
  • DataFrame (2-dimensional) - Made up of multiple Series. Each Series can be a different data type. The DataFrame holds tabular data made up of rows and columns, like we might see in an Excel table or CSV file. Each of the columns is a Series object.

The DataFrame and Series data structures in pandas are implemented as objects. In the object-oriented programming paradigm objects can have properties, which are descriptive elements, and methods, which are functions that operate on the object or that the object can perform. You can think of properties as information about the object and methods as things the object can do. So, as you learn pandas you will be learning lots of different properties and methods that apply to Series or DataFrames. You will also learn pandas functions that can operate on Series or DataFrames.

A pandas DataFrame can be created in many ways. Below we will see how to create a DataFrame from a python dict object (“dictionary”) and from a CSV file. CSV is a common text representation of tabular data. CSV stands for “comma-separated values”. In a CSV file each row represents a row in a table, and the rows use commas to separate the columns within the rows.

1.2 Creating a DataFrame

There are many different ways of creating a pandas DataFrame. We will focus on two commonly-used methods: creating a DataFrame from a python dict (“dictionary”) and creating a DataFrame from a comma-separated-values (CSV) text file.

1.2.1 Creating a DataFrame from a python dict

The code below defines a basic python dict, then creates and displays a pandas DataFrame called student_data from that dict, using the pandas DataFrame() function. Since the DataFrame() function is supplied by pandas and is not a part of base python we need to preceed it with pd. That is because when we imported the pandas package we imported it with the statement import pandas as pd. The as pd part establishes pd as an alias for pandas, and in our code we should type pd instead of pandas when we want to access pandas functions.

The keys of the python dict become the column names in the new DataFrame, and the values of the python dict become the column values in the new DataFrame.

scores = {
    'name': ["Hannah", "Sam", "Bill", "Erin", "Latasha"], 
    'year': [2, 1, 1, 4, 4],
    'score': [98, 67, 75, 88, 99],
    'gender': ["F", "NB", "M", "F", "F"]
}

student_data = pd.DataFrame(scores)

student_data
name year score gender
0 Hannah 2 98 F
1 Sam 1 67 NB
2 Bill 1 75 M
3 Erin 4 88 F
4 Latasha 4 99 F

Notice that pandas automatically adds a row index to the DataFrame, in this case the integers from 0 to 3. We can see the row index displayed on the left side of the DataFrame. There is also a column index created from the column names. The index property of a DataFrame refers to the row index, and the columns property of a DataFrame refers to the column index.

Note, also, that row indexes and column indexes are special pandas data types. We will work with them more closely in later chapters.

The properties and methods of a DataFrame or Series object can be accessed with dot notation. The identifier (“name”) student_data refers to a specific pandas DataFrame object, the one we created above. Because student_data is an object of type DataFrame it has the properties and methods shared by all objects of the DataFrame type. To access a property or method, use the DataFrame name followed by a dot (.), followed by the name of the property or method. For example, the code below uses the index property of the student_data DataFrame to display information about the row index. We can see from the code output that the index is of type RangeIndex that goes from 0 to 5 by 1. Similar to a python range, the stop value, 5, is not included.

student_data.index
RangeIndex(start=0, stop=5, step=1)

Similarly, we can use the columns property to access information on the DataFrame’s column index.

student_data.columns
Index(['name', 'year', 'score', 'gender'], dtype='object')

We can also specify an index for a DataFrame using the set_index() method. In the code below we specify that the index for the student_data DataFrame should be the name column.

student_data = student_data.set_index('name')

Notice in the code above that we had to assign the result of applying the set_index() method on the student_data DataFrame back to the student_data object for the changes to persist. This is something to watch out for when using pandas: some methods or functions return a copy of the object with the operation applied, while other operations alter the object to which the operation is applied. Many of the functions or methods that return a copy of the object with the operation applied have a an inplace parameter that is by default set to False. If you set this parameter to True the operation will alter the object in place rather than returning a copy of the object with the operation applied.

Let’s re-create the student_data DataFrame and investigate this concept. When we re-create the student_data DataFrame from scratch with the code below the DataFrame is created using the default index of consecutive integers.

student_data = pd.DataFrame(scores)

student_data
name year score gender
0 Hannah 2 98 F
1 Sam 1 67 NB
2 Bill 1 75 M
3 Erin 4 88 F
4 Latasha 4 99 F

The code below uses the set_index() method to change the student_data DataFrame’s index to the name column. The set_index() method, though, is one of those methods that by default only returns a copy of the object with the method applied, rather than altering the object in place. So, here it creates and returns a copy of the student_data DataFrame that has the name column as the index, but that altered copy of the DataFrame is not assigned to anything, so when we display the student_data DataFrame we see that it is unchanged; it still has the integer index.

# This will not actually change the student_data DataFrame's index!
student_data.set_index('name')
student_data
name year score gender
0 Hannah 2 98 F
1 Sam 1 67 NB
2 Bill 1 75 M
3 Erin 4 88 F
4 Latasha 4 99 F

To make the index change persistent we need to tell the set_index() method to actually alter the DataFrame in place by setting the inplace parameter to True, as in the following code. Look at the left side of the DataFrame in the output and you wil see that the index for the DataFrame is now the data that was in the name column. Note that when a column becomes the index, like this, it is no longer a column and can no longer be accessed in the same way we normally acess a DataFrame column.

# This WILL change the student_data DataFrame's index.
student_data.set_index('name', inplace=True)
student_data
year score gender
name
Hannah 2 98 F
Sam 1 67 NB
Bill 1 75 M
Erin 4 88 F
Latasha 4 99 F

1.2.2 Creating a DataFrame from a CSV file

Creating a DataFrame from a CSV file is very useful, because organizations typically have a lot of data in spreadsheets, and spreadsheet applications like Microsoft Excel and Libreoffice Calc have the ability to easily save their data as a CSV file. Organizations that share free data also often use the CSV format to disseminate the data.

The code below reads a CSV file from a server on the Internet and then creates a new DataFrame object called boxscores from the CSV file. This data is boxscore data from the 2023-24 NBA season. Each row of the data represents one player’s performance in one game.

# Here we read a csv file from the web
url = "https://neuronjolt.com/data/nba_bs_2023-24_season_cleaned.csv"
boxscores = pd.read_csv(url)

Note that if the file is read successfully there is no feedback provided by pandas.

After reading the file into the DataFrame boxscores we should investigate the data. The following section introduces many ways to get more familiar with the data in a DataFrame. These are important, because, unlike GUI applications like Microsoft Excel, the data is not displayed for us to see.

1.3 Investigating a DataFrame

1.3.1 The info() method

The info() method shows summary information about the DataFrame. From the output of the ’info()` method we can see how many columns and rows the DataFrame has, the name and data type of each of the columns, the number of non-missing entriesin each column, and the amount of memory used to store the DataFrame.

boxscores.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26498 entries, 0 to 26497
Data columns (total 22 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   date      26498 non-null  int64 
 1   team      26498 non-null  object
 2   opponent  26498 non-null  object
 3   player    26498 non-null  object
 4   position  26498 non-null  object
 5   min       26498 non-null  int64 
 6   oreb      26498 non-null  int64 
 7   dreb      26498 non-null  int64 
 8   reb       26498 non-null  int64 
 9   ast       26498 non-null  int64 
 10  stl       26498 non-null  int64 
 11  blk       26498 non-null  int64 
 12  to        26498 non-null  int64 
 13  pf        26498 non-null  int64 
 14  plus_m    26498 non-null  int64 
 15  pts       26498 non-null  int64 
 16  fg_m      26498 non-null  int64 
 17  fg_a      26498 non-null  int64 
 18  tres_m    26498 non-null  int64 
 19  tres_a    26498 non-null  int64 
 20  ft_m      26498 non-null  int64 
 21  ft_a      26498 non-null  int64 
dtypes: int64(18), object(4)
memory usage: 4.4+ MB

1.3.2 The head(n) and tail(n) methods.

You can use these methods to view the first n rows or the last n rows of the DataFrame, respectively. n defaults to 5.

boxscores.head()
date team opponent player position min oreb dreb reb ast ... to pf plus_m pts fg_m fg_a tres_m tres_a ft_m ft_a
0 20231024 LAL DEN A. Davis PF 34 1 7 8 4 ... 2 3 -17 17 6 17 1 2 4 4
1 20231024 LAL DEN T. Prince PF 30 1 2 3 1 ... 1 0 -14 18 6 8 4 6 2 2
2 20231024 LAL DEN L. James SF 29 1 7 8 5 ... 0 1 7 21 10 16 1 4 0 1
3 20231024 LAL DEN D. Russell PG 36 0 4 4 7 ... 3 3 1 11 4 12 2 5 1 2
4 20231024 LAL DEN A. Reaves SG 31 4 4 8 4 ... 2 2 -14 14 4 11 1 2 5 7

5 rows × 22 columns

boxscores.tail(2)
date team opponent player position min oreb dreb reb ast ... to pf plus_m pts fg_m fg_a tres_m tres_a ft_m ft_a
26496 20240414 SAC POR M. Jones G 6 0 0 0 0 ... 1 2 -8 0 0 1 0 1 0 0
26497 20240414 SAC POR C. Jones G 4 0 1 1 1 ... 0 0 1 8 3 3 0 0 2 2

2 rows × 22 columns

1.3.3 The sample(n) method

The sample() method returns a randomly-selected n rows of the DataFrame. This often gives a better feel for the data than head() or tail(), especially if the DataFrame has any inherent order to it.

boxscores.sample(3)
date team opponent player position min oreb dreb reb ast ... to pf plus_m pts fg_m fg_a tres_m tres_a ft_m ft_a
22653 20240323 PHX SA B. Beal SG 28 0 1 1 12 ... 1 1 15 13 6 10 1 3 0 0
8764 20231221 MEM IND Z. Williams F 18 0 1 1 2 ... 2 2 6 16 6 9 3 4 1 2
26115 20240412 UTAH LAC D. Bazley F 23 2 3 5 1 ... 2 1 -1 12 5 7 1 2 1 1

3 rows × 22 columns

1.4 Setting options for the pandas environment

Notice that when we displayed subsets of the DataFrame with the head() and sample() methods there were too many columns for us to see, with the default settings for pandas. Some of the columns are left out of the display, and we see ellipses (...) in the center of the DataFrame output instead. We can use the pandas get_option() and set_option() functions to set various environment options. There is also a reset_option() function to reset an option or options to the default setting, and a describe_option() function that can be used to show more information about all options available or more information about a specific option.

To view all the available options you can use the describe_option() method with nothing specified in the parentheses. The code to run to see all the available options is shown below, but we will not run it here because it creates a long list of output!

pd.describe_option()

The option we want to set to control how many columns are displayed is display.max_columns. Let’s see the maximum number of columns being displayed by default and change the settings so that we can see all of the columns.

pd.get_option('display.max_columns')
20

We know from our investigation of the boxscores DataFrame above that it has 22 columns. We may also add some new columns later on. So, let’s set the pandas display.max_columns setting to 30 columns.

pd.set_option('display.max_columns', 30)

After changing that setting, when we display selected rows of the DataFrame we will now be able to see all the columns, because we have 22 columns of data, and the display.max_columns setting is now set to 30.

boxscores.sample(3)
date team opponent player position min oreb dreb reb ast stl blk to pf plus_m pts fg_m fg_a tres_m tres_a ft_m ft_a
22320 20240321 NY DEN I. Hartenstein C 26 3 5 8 3 1 2 3 6 -15 20 8 10 0 0 4 5
25152 20240407 SA PHI V. Wembanyama C 43 3 15 18 6 0 7 9 1 -1 33 11 20 5 10 6 7
19765 20240305 MIA DET N. Jovic F 21 0 4 4 2 2 0 0 2 8 8 3 5 1 2 1 2

We can also set other options to control the number of rows displayed, and number of digits to show after the decimal point.

pd.set_option('display.max_rows', 100)
pd.set_option('display.precision', 2)

1.5 Other properties and methods for investigating data

Other useful properties and methods for investigating your data include shape, columns, describe(), value_counts(), unique(), and nunique(). Some of these methods and properties can be applied to both DataFrame and Series objects, and others apply only to one or the other.

1.5.1 The shape property

The shape property returns a tuple that indicates the number of rows and number of columns. This can be useful if you need to programmatically extract the number of rows and columns.

Before going into more details, let’s read in some new data from a CSV file into a new DataFrame named store.

store = pd.read_csv("https://neuronjolt.com/data/superstore.csv")

Now, let’s use the shape property of the store DataFrame to return a tuple that indicates the number of rows and columns in the store DataFrame.

store.shape
(9994, 20)

Because we know the format of the tuple returned by the shape property we can extract parts of it with our code. The code below uses indexing to extract the first element of the tuple (the element at index 0, which is the number of rows in the store DataFrame), store it in a variable named num_rows, and include that variable in a string that is printed to output.

num_rows = store.shape[0]
print(f"The store DataFrame has {num_rows} rows.")
The store DataFrame has 9994 rows.

1.5.2 Accessing individual columns with []

Recall that the columns property returns the Index object with the column names.

boxscores.columns
Index(['date', 'team', 'opponent', 'player', 'position', 'min', 'oreb', 'dreb',
       'reb', 'ast', 'stl', 'blk', 'to', 'pf', 'plus_m', 'pts', 'fg_m', 'fg_a',
       'tres_m', 'tres_a', 'ft_m', 'ft_a'],
      dtype='object')

Individual columns or lists of columns can be accessed through square brackets, similar to values in a python dict. Notice that when one column is indicated in the square brackets a pandas Series is returned. When a list of columns is indicated in the square brackets, even if there is only one column in the list, a pandas DataFrame is returned. This is a common pattern in pandas.

In the code cell below one column is indicated, so a Series is returned.

boxscores['player'].head()
0      A. Davis
1     T. Prince
2      L. James
3    D. Russell
4     A. Reaves
Name: player, dtype: object

In the code cell below we use the type() function to verify that a Series is returned. The type() function returns the type or class of its argument.

type(boxscores['player'].head())
pandas.core.series.Series

In the next two examples a list of columns is indicated in the square brackets, so a DataFrame is returned.

boxscores[['player']].head()
player
0 A. Davis
1 T. Prince
2 L. James
3 D. Russell
4 A. Reaves
type(boxscores[['player']].head())
pandas.core.frame.DataFrame
boxscores[['player', 'position']].head()
player position
0 A. Davis PF
1 T. Prince PF
2 L. James SF
3 D. Russell PG
4 A. Reaves SG
type(boxscores[['player', 'position']].head())
pandas.core.frame.DataFrame

1.5.3 The describe() method

The describe() method gives summary information about each column (Series) in the DataFrame, such as the number of non-null values, the five-number summary, the mean, and the standard deviation.

boxscores.describe()
date min oreb dreb reb ast stl blk to pf plus_m pts fg_m fg_a tres_m tres_a ft_m ft_a
count 2.65e+04 26498.00 26498.00 26498.00 26498.00 26498.00 26498.00 26498.00 26498.00 26498.00 26498.00 26498.00 26498.00 26498.00 26498.00 26498.00 26498.00 26498.00
mean 2.02e+07 22.41 0.98 3.06 4.04 2.48 0.69 0.48 1.20 1.74 0.00 10.60 3.91 8.25 1.19 3.26 1.58 2.02
std 4.44e+03 11.17 1.35 2.72 3.47 2.66 0.94 0.86 1.37 1.45 11.57 8.99 3.35 6.19 1.49 3.06 2.30 2.75
min 2.02e+07 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -58.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
25% 2.02e+07 14.00 0.00 1.00 1.00 0.00 0.00 0.00 0.00 1.00 -7.00 3.00 1.00 3.00 0.00 1.00 0.00 0.00
50% 2.02e+07 23.00 1.00 2.00 3.00 2.00 0.00 0.00 1.00 2.00 0.00 9.00 3.00 7.00 1.00 3.00 1.00 1.00
75% 2.02e+07 32.00 1.00 4.00 6.00 4.00 1.00 1.00 2.00 3.00 7.00 16.00 6.00 12.00 2.00 5.00 2.00 3.00
max 2.02e+07 54.00 15.00 20.00 31.00 23.00 7.00 10.00 11.00 6.00 46.00 73.00 25.00 47.00 12.00 23.00 24.00 32.00


By default the describe() method will only show the summary for numeric variables. However, if the include="all" parameter is set it will also show a summary for other types of variables.

store.describe(include="all")
Order ID Order Date Ship Date Ship Mode Customer ID Customer Name Segment Country City State Postal Code Region Product ID Category Sub-Category Product Name Sales Quantity Discount Profit
count 9994 9994 9994 9994 9994 9994 9994 9994 9994 9994 9994.00 9994 9994 9994 9994 9994 9994.00 9994.00 9994.00 9994.00
unique 5009 1237 1334 4 793 793 3 1 531 49 NaN 4 1862 3 17 1850 NaN NaN NaN NaN
top CA-2017-100111 9/5/2016 12/16/2015 Standard Class WB-21850 William Brown Consumer United States New York City California NaN West OFF-PA-10001970 Office Supplies Binders Staple envelope NaN NaN NaN NaN
freq 14 38 35 5968 37 37 5191 9994 915 2001 NaN 3203 19 6026 1523 48 NaN NaN NaN NaN
mean NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 55190.38 NaN NaN NaN NaN NaN 229.86 3.79 0.16 28.66
std NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 32063.69 NaN NaN NaN NaN NaN 623.25 2.23 0.21 234.26
min NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1040.00 NaN NaN NaN NaN NaN 0.44 1.00 0.00 -6599.98
25% NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 23223.00 NaN NaN NaN NaN NaN 17.28 2.00 0.00 1.73
50% NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 56430.50 NaN NaN NaN NaN NaN 54.49 3.00 0.20 8.67
75% NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 90008.00 NaN NaN NaN NaN NaN 209.94 5.00 0.20 29.36
max NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 99301.00 NaN NaN NaN NaN NaN 22638.48 14.00 0.80 8399.98


The describe() method can also be used on a Series object.

boxscores['pts'].describe()
count    26498.00
mean        10.60
std          8.99
min          0.00
25%          3.00
50%          9.00
75%         16.00
max         73.00
Name: pts, dtype: float64

1.5.4 The value_counts() method

The value_counts() method returns the counts of all the unique values, sorted from most frequent to least frequent, as a pandas Series object.

store['Ship Mode'].value_counts()
Ship Mode
Standard Class    5968
Second Class      1945
First Class       1538
Same Day           543
Name: count, dtype: int64

The value_counts() method can even be used for numeric variables. The code below uses value_counts() to count how many player-games are in the boxscores data for each unique number of rebounds. The code uses something called method chaining to add a head() method to limit how many rows are displayed. Method chaining uses the dot (.) operator. The operations execute from right to left, and the object returned by each operation becomes the input (first argument) for the subsequent operation.

So, in the code below we start with the reb column in the boxscores DataFrame. We apply value_counts() to it to return a Series of counts for each number of rebounds. Then, to that Series the head() method is applied to that Series of counts, so that only the first 5 elements in the Series are displayed. Chaining things together like this is very common when working with pandas. To understand and use chained methods you need to get into the habit of thinking about each operation and what it returns, then thinking about what other operations can be aplied to the object returned by that operation.

boxscores['reb'].value_counts().head()
reb
2    3674
1    3594
3    3425
0    3258
4    2980
Name: count, dtype: int64

1.5.5 The unique() method

The unique() method returns the unique values of a Series. The code below returns all the unique values found in the Region column of the store DataFrame, as a numpy array.

store['Region'].unique()
array(['South', 'West', 'Central', 'East'], dtype=object)

1.5.6 The nunique() method

The nunique() method returns the number of unique values in a Series. From the code below we can see that there are 4 unique values in the Regions column in the store DataFrame.

store['Region'].nunique()
4

1.6 Methods to calculate summary statistics

Take a look at [this link] (https://pandas.pydata.org/docs/user_guide/basics.html#descriptive-statistics) to the pandas user guide to see some examples and a deeper explanation. These methods can be used on DataFrames or Series objects.

Some commonly-used methods include the following:

  • count() (counts the number of non-missing observations)
  • sum()
  • mean()
  • median()
  • min()
  • max()
  • std()

Note that these methods can take an axis parameter to indicate which axis they work across. If axis = 0 it means they work across the rows and thus apply to the columns. If axis = 1 it means they work across the columns and thus apply to the rows.

Let’s create a simple dataframe of quiz scores and take a look. Note that we are creating the quizzes DataFrame from a python dict with the pandas DataFrame function, as before, but we are using method chaining to direct the output of the DataFrame function directly to the set_index method with which we specify that the name column should be used as the DataFrame’s row index.

quizzes = {
    'name': ["Hannah", "Sam", "Anjali", "Erin", "Latasha"], 
    'quiz_1': [89, 74, 79, 90, 95],
    'quiz_2': [73, 75, 78, 88, 86],
    'quiz_3': [92, 88, 85, 95, 100],
    'quiz_4': [100, 90, 88, 92, 98]
}

quiz_df = pd.DataFrame(quizzes).set_index("name")

quiz_df
quiz_1 quiz_2 quiz_3 quiz_4
name
Hannah 89 73 92 100
Sam 74 75 88 90
Anjali 79 78 85 88
Erin 90 88 95 92
Latasha 95 86 100 98

Calculate the standard deviation of the scores for quiz 2.

quiz_df['quiz_2'].std()
6.670832032063167

To calculate the averages per student we need to calculate the means for the rows. The row’s go across the columns, which are axis 1, so we use the axis = 1 parameter to specify that we want the means calculated across the columns.

quiz_df.mean(axis = 1)
name
Hannah     88.50
Sam        81.75
Anjali     82.50
Erin       91.25
Latasha    94.75
dtype: float64

To calculate the averages per quiz we need to calculate the means for the columns, which go across the rows, which are axis 0. So, we specify axis = 0 to calculate the means across the rows.

quiz_df.mean(axis = 0)
quiz_1    85.4
quiz_2    80.0
quiz_3    92.0
quiz_4    93.6
dtype: float64

1.7 Sorting with the sort_values() method

This method can apply to a Series or a DataFrame. If applied to a DataFrame the by parameter indicates what to sort on (typically a column name or a list of column names). Both take an ascending parameter that defaults to True. To sort in descending order you need to set ascending = False.

Let’s take another look at the quiz_df DataFrame, but sort it by the quiz_1 score, in descending order.

quiz_df.sort_values(by = 'quiz_1', ascending = False)
quiz_1 quiz_2 quiz_3 quiz_4
name
Latasha 95 86 100 98
Erin 90 88 95 92
Hannah 89 73 92 100
Anjali 79 78 85 88
Sam 74 75 88 90

Note that sort_values() is another one of those DataFrame methods that returns an altered copy of the DataFrame instead of altering it in place. So, the code above showed us what quiz_df looks like if sorted in descending order of the score on the first quiz, but didn’t actually change the quiz_df DataFrame. To actually change the quiz_df DataFrame we would need to use the inplace = True parameter in the sort_values() method or assign the result of the sort back to quiz_df.

When we display quiz_df below we can see that it is not sorted.

quiz_df
quiz_1 quiz_2 quiz_3 quiz_4
name
Hannah 89 73 92 100
Sam 74 75 88 90
Anjali 79 78 85 88
Erin 90 88 95 92
Latasha 95 86 100 98

Let’s do the sort again, this time setting the inplace = True parameter. This time the quiz_df DataFrame is actually changed.

quiz_df.sort_values(by = 'quiz_1', ascending = False, inplace = True)

quiz_df
quiz_1 quiz_2 quiz_3 quiz_4
name
Latasha 95 86 100 98
Erin 90 88 95 92
Hannah 89 73 92 100
Anjali 79 78 85 88
Sam 74 75 88 90

Below we sort the boxscores DataFrame by position and pts in descending order, display only the player, position, and pts columns, and show the first 10 rows. We use a list to indicate the sort columns. Note that we use a list for the sort order as well, to show what sort order to use for each of the columns. Since position lines up with True, position is sorted in ascending order, and ‘C’ comes first. pts, however, are sorted in descending order within each position because pts is the second column in the sort order list and False is the second Boolean in the list for the ascending parameter.

(boxscores
 .sort_values(by = ['position', 'pts'], ascending = [True, False])
 .loc[:, ['player', 'position', 'pts']]
 .head(10)
)
player position pts
13726 J. Embiid C 70
13768 K. Towns C 62
8542 J. Embiid C 51
6396 J. Embiid C 50
2072 J. Embiid C 48
19829 A. Sengun C 45
7832 J. Embiid C 42
13588 N. Jokic C 42
24234 N. Jokic C 42
7277 J. Embiid C 41

For most of the examples in this textbook we will use a method-chaining approach, as in the code example above. As a reminder, though, sometimes you may want to break the operations into steps, particularly when you want to reuse one of the intermediate results. The code below shows what the code would look like if we broke the example above into two steps, naming the intermediate results sorted_pos_pts.

sorted_pos_pts = boxscores.sort_values(by = ['position', 'pts'], 
                                       ascending = [True, False])
sorted_pos_pts[['player', 'position', 'pts']].head(10)
player position pts
13726 J. Embiid C 70
13768 K. Towns C 62
8542 J. Embiid C 51
6396 J. Embiid C 50
2072 J. Embiid C 48
19829 A. Sengun C 45
7832 J. Embiid C 42
13588 N. Jokic C 42
24234 N. Jokic C 42
7277 J. Embiid C 41

1.8 Removing duplicate rows from a DataFrame

The drop_duplicates() method by default returns a copy of a DataFrame with the duplicate rows removed. To remove duplicates in place set the inplace = True parameter. You can use the subset parameter to indicate the column(s) to use to determine whether a row is a duplicate or not. By default all columns are included.

The code cell below creates a new version of the quiz_df DataFrame that has duplicate rows.

quiz_df_with_dupes = quiz_df.sample(10, replace = True)

quiz_df_with_dupes
quiz_1 quiz_2 quiz_3 quiz_4
name
Sam 74 75 88 90
Latasha 95 86 100 98
Anjali 79 78 85 88
Sam 74 75 88 90
Anjali 79 78 85 88
Anjali 79 78 85 88
Hannah 89 73 92 100
Erin 90 88 95 92
Latasha 95 86 100 98
Erin 90 88 95 92

Let’s now use drop_duplicates() to see what the quiz_df_with_dupes DataFrame looks like with duplicate rows removed.

quiz_df_with_dupes.drop_duplicates()
quiz_1 quiz_2 quiz_3 quiz_4
name
Sam 74 75 88 90
Latasha 95 86 100 98
Anjali 79 78 85 88
Hannah 89 73 92 100
Erin 90 88 95 92

Let’s take a look at another example. In the boxscores DataFrame each row represents a player-game, so we have multiple rows for each player. What if we wanted to see all the unique player-team combinations for the season? If we just select the player and team columns the result has lots of duplicates.

(boxscores[['player', 'team']]
 .sort_values(by = ['player', 'team'])
 .head(12)
)
player team
2888 A. Bailey CHA
7856 A. Bailey CHA
16641 A. Bailey CHA
21311 A. Bailey CHA
21927 A. Bailey CHA
22577 A. Bailey CHA
22839 A. Bailey CHA
23483 A. Bailey CHA
25894 A. Bailey CHA
26178 A. Bailey CHA
136 A. Black ORL
1177 A. Black ORL

We can use drop_duplicates() to reduce it to one row per unique player-team combination.

(boxscores[['player', 'team']]
 .drop_duplicates()
 .sort_values(by = ['player', 'team'])
 .head(12)
)
player team
2888 A. Bailey CHA
136 A. Black ORL
1294 A. Brooks BKN
165 A. Burks DET
16867 A. Burks NY
219 A. Caruso CHI
312 A. Coffey LAC
0 A. Davis LAL
220 A. Dosunmu CHI
217 A. Drummond CHI
181 A. Edwards MIN
18060 A. Flagler OKC

1.9 Selecting a subset of a pandas DataFrame

Selecting a subset of data according to some criteria is a very common data operation. For example, a business analyst might want to see the sales for one family of products in one region of the country. A sports analyst might want to look at stats for a particular player or set of players. This section demonstrates several different ways of extracting subsets from pandas DataFrames. The first two methods we cover, selecting by label with loc[] and selecting by position with iloc[] are fundamental, so we need to cover them, but they are actually not as interesting as or used as often as the third method we will present - using a Boolean array as a selector. So, as you read through the first two subsections below, know that there is more interesting material ahead!

loc[] can also use a callable, such as a function, as a selector, but that is a more advanced usage, and it is only helpful in specific circumstances. We haven’t yet covered the code skills we need to create those circumstances, so we will hold off on covering the use of loc[] with a callable until the next chapter.

1.9.1 Selecting by label with loc[]

loc[] is a method used to access a selection from a DataFrame by index (row label) and column name (column label, optional). It may also be used with Boolean arrays or callables (functions) as the selectors, as we will see below. loc[] may be used to access values in a DataFrame or to set values in a DataFrame.

In the code below we select the first four rows (rows with index 0 to 3). Note that, unlike base python slices, when slicing by label the last label is included in the slice.

# Note that both endpoints are included if you select a slice of index values
boxscores.loc[0:3]
date team opponent player position min oreb dreb reb ast stl blk to pf plus_m pts fg_m fg_a tres_m tres_a ft_m ft_a
0 20231024 LAL DEN A. Davis PF 34 1 7 8 4 0 2 2 3 -17 17 6 17 1 2 4 4
1 20231024 LAL DEN T. Prince PF 30 1 2 3 1 0 1 1 0 -14 18 6 8 4 6 2 2
2 20231024 LAL DEN L. James SF 29 1 7 8 5 1 0 0 1 7 21 10 16 1 4 0 1
3 20231024 LAL DEN D. Russell PG 36 0 4 4 7 1 0 3 3 1 11 4 12 2 5 1 2

In this next example we use loc[] to select according to a slice of rows and two specific columns, indicated in the list ['player', 'position']. The rows are indicated first, then there is a comma, and then the columns are indicated. If there is no comma included loc[] accesses the specified rows for all the columns.

boxscores.loc[0:3, ['player', 'position']]
player position
0 A. Davis PF
1 T. Prince PF
2 L. James SF
3 D. Russell PG

In this example we access the first and fourth row, specifying them by index label with the list [0, 3].

boxscores.loc[[0, 3]]
date team opponent player position min oreb dreb reb ast stl blk to pf plus_m pts fg_m fg_a tres_m tres_a ft_m ft_a
0 20231024 LAL DEN A. Davis PF 34 1 7 8 4 0 2 2 3 -17 17 6 17 1 2 4 4
3 20231024 LAL DEN D. Russell PG 36 0 4 4 7 1 0 3 3 1 11 4 12 2 5 1 2

In the next example we used lists to specify both the row and the column labels for the selection.

boxscores.loc[[0, 3], ['player', 'pts']]
player pts
0 A. Davis 17
3 D. Russell 11

In this final example we use loc[] to access two specific rows, and a slice of columns.

boxscores.loc[[0, 3], 'player':'reb']
player position min oreb dreb reb
0 A. Davis PF 34 1 7 8
3 D. Russell PG 36 0 4 4

1.9.2 Selecting by position with iloc[]

iloc[] selects by row and column position (0-based). It isn’t very useful with the box score data, but could be useful with other tpes of data. Recall that when using loc[] slices include both ends. With iloc[] slices behave more like base python slices; the endpoints are not included.

In the first code example we select the second and third rows (indexes 1 and 2) and the third and fourth columns (indexes 2 and 3).

boxscores.iloc[1:3, 2:4]
opponent player
1 DEN T. Prince
2 DEN L. James

In the next example we use a list to select the row indexes and another list to select the columns.

boxscores.iloc[[0,3,49], [1,2,4]]
team opponent position
0 LAL DEN PF
3 LAL DEN PG
49 ATL CHA PF

In the example below we select specific rows by their indexes, and then use a colon (:) to indicate all rows. We can also use a colon in the first position to indicate that all rows are selected.

boxscores.iloc[[1,5,7], :]
date team opponent player position min oreb dreb reb ast stl blk to pf plus_m pts fg_m fg_a tres_m tres_a ft_m ft_a
1 20231024 LAL DEN T. Prince PF 30 1 2 3 1 0 1 1 0 -14 18 6 8 4 6 2 2
5 20231024 LAL DEN R. Hachimura PF 15 2 1 3 0 0 0 0 2 -8 6 3 10 0 3 0 0
7 20231024 LAL DEN M. Lewis F 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1.9.3 Boolean Arrays as selectors with loc[]

This method of selecting a subset from a DataFrame is very powerful and frequently used. We can use a Boolean array to select rows and columns. A Boolean array is an array of True or False values. Where the array is True the row or column is included and where the array is False the row or column is not included. For example, let’s view all boxscore lines for player-games in which the player scored 60 or more points. First, take a look at the Boolean array we will use. This Boolean array is created by the inequality boxscores['pts'] >= 60. It will have as many True and False values in total as there are values in the pts column of the boxscores DataFrame. For every row where pts is greater than or equal to 60 the Boolean array will have a True value. For every other row it will have a False value.

boxscores['pts'] >= 60
0        False
1        False
2        False
3        False
4        False
         ...  
26493    False
26494    False
26495    False
26496    False
26497    False
Name: pts, Length: 26498, dtype: bool

Next, we can use that Boolean array within loc[] to display only the rows of the boxscores DataFrame for which pts is >= 60. Only the rows that are in the same positions (same row index) as the True values in the Boolean array will be included in the subset. In addition to using the Boolean array to select the rows, we use a list to select a subset of the columns to be included. Finally, we use method chaining to tack on a sort at the end that will sort the result in descending order of points.

boxscores.loc[boxscores['pts'] >= 60, ['date', 'team', 'opponent', 'player', 'pts']].sort_values(by = 'pts', ascending = False)
date team opponent player pts
14280 20240126 DAL ATL L. Doncic 73
13726 20240122 PHI SA J. Embiid 70
7396 20231213 MIL IND G. Antetokounmpo 64
13768 20240122 MIN CHA K. Towns 62
14327 20240126 PHX IND D. Booker 62
23635 20240329 NY SA J. Brunson 61
15660 20240203 GS ATL Steph Curry 60

You might have noticed that the code example above has a long line of code. For readability it is common practice in programming to try to keep code lines short. Ideally they should be 80 characters or less. However, recall that python is very picky about whitespace; line endings and indentation have meaning to the python interpreter. So, when building pandas code that includes a lot of method chaining one common practice to keep the lines shorter is to enclose the code within parentheses, since python ignores the whitespace for code within parentheses, brackets or braces. The code example below is the same code as the example above, but it is enclosed in parentheses to permit shorter lines, which results in more readable code.

(boxscores
 .loc[boxscores['pts'] >= 60, ['date', 'team', 'opponent', 'player', 'pts']]
 .sort_values(by = 'pts', ascending = False)
)
date team opponent player pts
14280 20240126 DAL ATL L. Doncic 73
13726 20240122 PHI SA J. Embiid 70
7396 20231213 MIL IND G. Antetokounmpo 64
13768 20240122 MIN CHA K. Towns 62
14327 20240126 PHX IND D. Booker 62
23635 20240329 NY SA J. Brunson 61
15660 20240203 GS ATL Steph Curry 60

A Note on Style: An alternative to method chaining is to create intermediate objects and perform operations on them step by step. For example, we could accomplish the same thing as we did in the code above with the following code. Which approach you prefer is a matter of style, and each has its advantages and disadvantages. I prefer the method-chaining approach because it is clear and concise, and because it doesn’t create names for the objects created in the intermediate steps. For example, we can see that the the code below we had to create a new name, gt_60 to refer to the intermediate DataFrame. If you need to use that intermediate DataFrame again it is great to give it a name like this, but if you don’t have a need to refer to it again, naming it creates something unneeded in your code, and introduces more opportunity for error. How does it create opportunity for error? As you may already have experienced, writing code is typically an iterative process. You write some code, try it out, fix errors, revise it, try it again, etc. If you have lots of intermediate names created in your code it is very easy to write code using the wrong intermediate names, or write code that uses names that you created in a step that has been removed in one of your revisions. If you refer to a name that has been removed in a subsequent revision your code will run just fine in the current session, but if you close your session and re-run the code there will be an error!

These observations are not just theory, by the way. I’ve spent thousands of hours helping students with code and grading student code and I have seen many many errors caused by losing track of intermediate names.

gt_60 = boxscores.loc[boxscores['pts'] >= 60, 
['date', 'team', 'opponent', 'player', 'pts']]
gt_60.sort_values(by = 'pts', ascending = False)
date team opponent player pts
14280 20240126 DAL ATL L. Doncic 73
13726 20240122 PHI SA J. Embiid 70
7396 20231213 MIL IND G. Antetokounmpo 64
13768 20240122 MIN CHA K. Towns 62
14327 20240126 PHX IND D. Booker 62
23635 20240329 NY SA J. Brunson 61
15660 20240203 GS ATL Steph Curry 60

When creating the Boolean array we can combine various logical tests with logical and (&) and logical or (|). When we do so each criterion has to be enclosed in parentheses. Let’s take a look at statlines for players who scored at least 40 points AND had at least 15 rebounds. Since there are two criteria that both need to be true, we enclose each in parentheses and connect them with the & character.

boxscores.loc[(boxscores['pts'] >= 40) & (boxscores['reb'] >= 15), 
['date', 'player', 'pts', 'reb']]
date player pts reb
3932 20231118 G. Antetokounmpo 40 15
7832 20231216 J. Embiid 42 15
11342 20240106 G. Antetokounmpo 48 17
13726 20240122 J. Embiid 70 18
17488 20240214 D. Avdija 43 15
19086 20240229 A. Davis 40 15
19253 20240301 G. Antetokounmpo 46 16
19829 20240305 A. Sengun 45 16
23645 20240329 V. Wembanyama 40 20
24234 20240402 N. Jokic 42 16

1.9.4 Use of the query() method to subset DataFrames

In addition to using using Boolean arrays with loc[] we can also subset DataFrames with the query() method and strings, similar to SQL, that specify the conditions we want to apply. The strings can refer to either column names or index names. This can lead to code that is more compact and, arguably, easier to read and understand. Notice that with the query() method we can use & or and for logical and and use | or or for logical or.

Let’s make a sample DataFrame named scores_df to use to investigate the query() method. Note that this sample DataFrame has some missing values. We will learn how to deal with those later in the chapter. The scores_df DataFrame shows quiz scores for students in multiple sections of a class.

scores_df = pd.DataFrame(
    {
        'section': ['A', 'B', 'A', 'A', 'B', 'B', 'A', 'A', 'B', 'A'],
        'student': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        'q1': [99, 87, 55, 100, 75, None, 88, 90, 92, np.nan],
        'q2': [95, 90, 78, 94, 85, 90, 90, 89, 99, 100],
        'q3': [None, np.nan, None, 75, 93, None, 85, 88, 90, 92]
    }
)

scores_df
section student q1 q2 q3
0 A 1 99.0 95 NaN
1 B 2 87.0 90 NaN
2 A 3 55.0 78 NaN
3 A 4 100.0 94 75.0
4 B 5 75.0 85 93.0
5 B 6 NaN 90 NaN
6 A 7 88.0 90 85.0
7 A 8 90.0 89 88.0
8 B 9 92.0 99 90.0
9 A 10 NaN 100 92.0

Next, let’s filter scores_df to only show scores for students in section A who scored 90 or above on the first quiz (the q1 column). The first code example below accomplished the filtering by using loc[] with a Boolean array created by two criteria combined with an &, as we learned to do earlier in this chapter.

scores_df.loc[(scores_df['section'] == 'A') & (scores_df['q1'] >= 90)]
section student q1 q2 q3
0 A 1 99.0 95 NaN
3 A 4 100.0 94 75.0
7 A 8 90.0 89 88.0

This next code example accomplishes the same filtering, but uses query() instead of loc[]. The query string is text, so it must be enclosed in single quotes or double quotes. Since one of the criteria refers to the section, which is a string value, we need to use different delimiters for the query string and for the section value. In the code example below single quotes are used to delimit the query string, and double quotes are used around the section value. We could also have used double quotes for the query string and single quotes for the section value and it would work the same way.

scores_df.query('section == "A" and q1 >= 90')
section student q1 q2 q3
0 A 1 99.0 95 NaN
3 A 4 100.0 94 75.0
7 A 8 90.0 89 88.0

If the column names are not legal python identifiers (i.e. they have spaces in them or begin with numbers) we need to add backticks around them in the query string. Let’s change the column names of scores_df so that they are not legal python identifiers, and then demonstrate how to refer to them with backticks.

scores_df.columns = ['section', 'student number', '1 quiz', '2 quiz', '3 quiz']
scores_df
section student number 1 quiz 2 quiz 3 quiz
0 A 1 99.0 95 NaN
1 B 2 87.0 90 NaN
2 A 3 55.0 78 NaN
3 A 4 100.0 94 75.0
4 B 5 75.0 85 93.0
5 B 6 NaN 90 NaN
6 A 7 88.0 90 85.0
7 A 8 90.0 89 88.0
8 B 9 92.0 99 90.0
9 A 10 NaN 100 92.0

We see above that the column names have been changed. The code below uses query() to select the data for students with student number >= 5 and score on the third quiz >=90. Since student number and 3 quiz are both invalid python identifiers (the first contains a space and the second begins with a number and contains a space) we need to put backticks around them in the query string.

scores_df.query('`student number` >= 5 and `3 quiz` >= 90')
section student number 1 quiz 2 quiz 3 quiz
4 B 5 75.0 85 93.0
8 B 9 92.0 99 90.0
9 A 10 NaN 100 92.0

You can refer to a variable within the query string by prefixing the variable name with @. This gives you the flexibility to make your code more dynamic. For example, in the code below we first restore the column names for scores_df to the original column names. The we create a variable named honors_thresh that contains the threshold score value for receiving honors. Finally, we use query() to select all the data for students who scored at or above the threshold score on the first quiz.

# First, restore sane column names to the scores_df DataFrame
scores_df.columns = ['section', 'student', 'q1', 'q2', 'q3']

# Assign threshold score for honors to a variable
honors_thresh = 93

quiz_1_honors = scores_df.query('q1 >= @honors_thresh')
quiz_1_honors
section student q1 q2 q3
0 A 1 99.0 95 NaN
3 A 4 100.0 94 75.0

Arithmetic operators can also be used within the query string.

In the code below we identify students who improved at least 10 points from quiz 1 to quiz 2.

scores_df.query('q2 >= q1 + 10')
section student q1 q2 q3
2 A 3 55.0 78 NaN
4 B 5 75.0 85 93.0

1.9.4.1 The isin() method

The isin() method returns a Boolean array based on whether or not a Series’s value is in a specified collection of values. It is very useful to specify multiple criteria without having to use lots fully-stated criteria connected with logical or.

For example, Let’s look at date, player, team, and pts for the top scoring performances by Lakers or Clippers, sorted in decreasing order of points scored. Only include rows for player-games in which 40 or more points were scored.

(boxscores
 .loc[(boxscores['team'].isin(['LAL', 'LAC'])) &
      (boxscores['pts'] >= 40), ['date', 'player', 'team', 'pts']]
 .sort_values(by = 'pts', ascending = False)
)
date player team pts
20318 20240308 D. Russell LAL 44
6855 20231208 K. Leonard LAC 41
11779 20240109 A. Davis LAL 41
23829 20240331 P. George LAC 41
9174 20231223 L. James LAL 40
9312 20231225 A. Davis LAL 40
19086 20240229 A. Davis LAL 40
21558 20240316 L. James LAL 40
23807 20240331 L. James LAL 40

1.10 Operations on columns

In this section we will learn how to create new columns from existing columns, how to remove (rows or) columns, and how to rename columns.

1.10.1 Adding new columns by assigning to them

You can create a new column simply by assigning to it. For example, in the code example below we create new columns named tres_pct, fg_pct, and ft_pct to represent percentage of three pointers made, percentage of shots made, and percentage of free throws made, respectively.

Operations are performed element-wise, so you don’t need to loop over rows. What does that mean? Take a look at the first line in the code below. What it means is that the value in each row of the new column tres_pct (three point shooting percentage) is calculated by taking that row’s tres_m (three pointers made) and dividing it by that row’s tres_a (three pointers attempted).

boxscores['tres_pct'] = boxscores['tres_m']/boxscores['tres_a']
boxscores['fg_pct'] = boxscores['fg_m']/boxscores['fg_a']
boxscores['ft_pct'] = boxscores['ft_m']/boxscores['ft_a']

Assigning directly to a column that did not previously exist in the DataFrame adds that column to the DataFrame.

1.10.2 Adding new columns with assign()

We can also use the assign() method to create a copy of the DataFrame with specified columns added. This is useful for method chaining. For example, below we calculate points per minute and view the player name, points per minute, points, rebounds, assists, steals, and minutes for the top-10 best games in which a player played for at least 28 minutes. Note that with this approach the new column ppm is not permanently added to the boxscores DataFrame.

(boxscores
 .assign(ppm = boxscores['pts'] / boxscores['min'])
 .loc[boxscores['min'] >= 28, ['player', 'ppm', 'pts', 'reb', 'ast', 'stl', 'min']]
 .sort_values(by = 'ppm', ascending = False)
 .head(10)
)
player ppm pts reb ast stl min
13726 J. Embiid 1.89 70 18 5 1 37
7396 G. Antetokounmpo 1.73 64 14 3 4 37
14327 D. Booker 1.63 62 5 4 1 38
13768 K. Towns 1.63 62 8 2 0 38
14280 L. Doncic 1.62 73 10 7 1 45
2072 J. Embiid 1.55 48 11 6 1 31
2560 G. Antetokounmpo 1.50 54 12 3 0 36
24354 M. Flynn 1.47 50 6 5 3 34
15660 Steph Curry 1.46 60 6 4 0 41
7782 J. Brunson 1.43 50 6 9 5 35

1.10.3 Using drop() to remove rows or columns from a DataFrame

With drop() we specify the labels to drop, the axis to which the labels refer (0 for rows and 1 for columns). drop() then returns a copy of the DataFrame with the specified rows and/or columns dropped. It does not affect the original DataFrame unless we assign the result back to it or set inplace = True.

As an example, let’s drop the section column from the scores_df DataFrame.

scores_df.drop('section', axis = 1, inplace = True)

scores_df
student q1 q2 q3
0 1 99.0 95 NaN
1 2 87.0 90 NaN
2 3 55.0 78 NaN
3 4 100.0 94 75.0
4 5 75.0 85 93.0
5 6 NaN 90 NaN
6 7 88.0 90 85.0
7 8 90.0 89 88.0
8 9 92.0 99 90.0
9 10 NaN 100 92.0

1.10.4 Renaming DataFrame columns

To rename columns in a DataFrame use the rename() method with the columns parameter set to a dict in the form {old name : new_name}

scores_df.rename(columns = {'q1': 'first_quiz',
                            'q2': 'second_quiz',
                            'q3': 'third_quiz'})
student first_quiz second_quiz third_quiz
0 1 99.0 95 NaN
1 2 87.0 90 NaN
2 3 55.0 78 NaN
3 4 100.0 94 75.0
4 5 75.0 85 93.0
5 6 NaN 90 NaN
6 7 88.0 90 85.0
7 8 90.0 89 88.0
8 9 92.0 99 90.0
9 10 NaN 100 92.0

Note that your change won’t persist unless you set inplace = True or assign the result of the rename() method back to scores_df. When we display scores_df below we see that it still has the old names for the columns that represent the quiz scores.

scores_df
student q1 q2 q3
0 1 99.0 95 NaN
1 2 87.0 90 NaN
2 3 55.0 78 NaN
3 4 100.0 94 75.0
4 5 75.0 85 93.0
5 6 NaN 90 NaN
6 7 88.0 90 85.0
7 8 90.0 89 88.0
8 9 92.0 99 90.0
9 10 NaN 100 92.0

To rename columns we can also just assign new column labels over the old ones with the columns property. This operation doesn’t return a copy of the DataFrame; it changes the column labels in place.

scores_df.columns = ['student', 'quiz_1', 'quiz_2', 'quiz_3']
scores_df
student quiz_1 quiz_2 quiz_3
0 1 99.0 95 NaN
1 2 87.0 90 NaN
2 3 55.0 78 NaN
3 4 100.0 94 75.0
4 5 75.0 85 93.0
5 6 NaN 90 NaN
6 7 88.0 90 85.0
7 8 90.0 89 88.0
8 9 92.0 99 90.0
9 10 NaN 100 92.0

1.11 Vectorized string methods with the str accessor

Series has a set of vectorized string processing methods. Vectorized means that they operate on each element of the Series individually and then return a Series of all the individual results. These methods also exclude missing/NA values automatically. These string methods are accessed via the Series object’s str accessor and generally have names matching python’s built-in string methods. For more information on these string methods see the pandas string methods documentation.

As an example, lets look at the top-10 scoring performances by players with first initial R. In the code below the .str after boxscores['player'] is used to let us access the string method startswith(). We use the startswith() method to check for player values that start with the R character. The code then uses method chaining to sort the result by pts descending and then show the first 10 rows.

(boxscores
 .loc[boxscores['player'].str.startswith('R'), ['player', 'pts']]
 .sort_values(by = 'pts', ascending = False)
 .head(10)
)
player pts
11513 R. Barrett 37
17582 R. Hachimura 36
5369 R. Jackson 35
25943 R. Barrett 35
23312 R. Hachimura 32
25218 R. Hachimura 30
20036 R. Hachimura 29
13748 R. Barrett 29
24136 R. Barrett 28
7009 R. Barrett 27

Another useful string method that may be applied to a Series element-wise with the str accessor is contains(). With this method we can check to see if a string contains a specific value. For example, the following code shows the unique player names for all players whose name includes the string ‘Ham’.

boxscores.loc[boxscores['player'].str.contains('Ham'), 'player'].unique()
array(['R.J. Hampton'], dtype=object)

The contains() method has a parameter, case, that defaults to True, which makes the test case-sensitive. If we don’t want it to be case sensitive we can set case = False. Note how the code example below is identical to the code example above, except that we have specified that the test is not case-sensitive by specifying case = False. Many more names are a match when we remove the case sensitivity.

boxscores.loc[boxscores['player'].str.contains('Ham', case = False), 'player'].unique()
array(['C. Cunningham', 'M. Branham', 'M. Beauchamp', 'R.J. Hampton',
       'Julian Champagnie', 'L. Shamet', 'D. Graham', 'Justin Champagnie'],
      dtype=object)

1.11.1 Extracting elements from lists, or strings with str.get()

When a vectorized string method results in a Series of lists or strings the str.get() method may be used to extract elements from the lists or strings by position.

For example, to extract the last names of the NBA players we could use str.split() to divide each name into a list, and then use the str.get() method to extract the second item in the list.

boxscores['player'].str.split().str.get(1)
0           Davis
1          Prince
2           James
3         Russell
4          Reaves
           ...   
26493         Len
26494    Mitchell
26495      Duarte
26496       Jones
26497       Jones
Name: player, Length: 26498, dtype: object

1.12 Accessing parts of datetime fields with the dt accessor

Base python has a datetime type that is used for storing data that represents dates and times. Pandas has a dt accessor for accessing parts of datetime objects in a Series that is similar to the str accessor that we use to apply string methods to each item in a Series. We can use dt to access components of datetime objects, including the following:

  • date - the date without the time values
  • month - month as an integer (1 - 12)
  • day - day as an integer (1 - 31)
  • year - year as an integer
  • isocalendar().week - week of the year as an integer
  • weekday - day of week returned as an integer (0 = Monday; 6 = Sunday)
  • day_name() - day of week returned as a string

Before we demonstrate some of these methods let’s first transform the date column in the boxscores DataFrame to a datetime type. We saw when we ran the info() method on the boxscores DataFrame previously that it is of type int64. There is a to_datetime() method that operates on pandas Series objects. It tries to infer the string format of the dates in the column of values it is to convert. If it can’t infer the string format of the dates you can indicate the format with the format parameter.

boxscores['date'] = pd.to_datetime(boxscores['date'], format = "%Y%m%d")

After that transformation, when we run the info() method of the boxscores DataFrame we see that the date column is now of type datetime.

boxscores.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26498 entries, 0 to 26497
Data columns (total 25 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   date      26498 non-null  datetime64[ns]
 1   team      26498 non-null  object        
 2   opponent  26498 non-null  object        
 3   player    26498 non-null  object        
 4   position  26498 non-null  object        
 5   min       26498 non-null  int64         
 6   oreb      26498 non-null  int64         
 7   dreb      26498 non-null  int64         
 8   reb       26498 non-null  int64         
 9   ast       26498 non-null  int64         
 10  stl       26498 non-null  int64         
 11  blk       26498 non-null  int64         
 12  to        26498 non-null  int64         
 13  pf        26498 non-null  int64         
 14  plus_m    26498 non-null  int64         
 15  pts       26498 non-null  int64         
 16  fg_m      26498 non-null  int64         
 17  fg_a      26498 non-null  int64         
 18  tres_m    26498 non-null  int64         
 19  tres_a    26498 non-null  int64         
 20  ft_m      26498 non-null  int64         
 21  ft_a      26498 non-null  int64         
 22  tres_pct  20559 non-null  float64       
 23  fg_pct    24895 non-null  float64       
 24  ft_pct    14232 non-null  float64       
dtypes: datetime64[ns](1), float64(3), int64(17), object(4)
memory usage: 5.1+ MB

To see a bit of how the dt accessor works, let’s view 5 random rows of the date column and then use the dt accessor and one of the datetime methods to extract just the day of the week from the dates for those rows.

bs_selection = boxscores.sample(5)
bs_selection['date']
10392   2024-01-01
10465   2024-01-01
19810   2024-03-05
7437    2023-12-13
13121   2024-01-18
Name: date, dtype: datetime64[ns]

Now let’s use the dt accessor and the day_name() method to see which days of the week those dates are.

bs_selection['date'].dt.day_name()
10392       Monday
10465       Monday
19810      Tuesday
7437     Wednesday
13121     Thursday
Name: date, dtype: object

Let’s look at the top 10 scoring performances in October, the first month of the season.

(boxscores.loc[boxscores['date'].dt.month == 10, ['date', 'player', 'pts']]
 .sort_values(by = 'pts', ascending = False)
 .head(10)
)
date player pts
575 2023-10-28 Z. LaVine 51
504 2023-10-27 L. Doncic 49
444 2023-10-27 Donovan Mitchell 43
1034 2023-10-30 Steph Curry 42
553 2023-10-27 Steph Curry 41
960 2023-10-30 D. Murray 41
563 2023-10-27 D. Fox 39
329 2023-10-26 D. Lillard 39
336 2023-10-26 K. Durant 39
871 2023-10-29 D. Fox 37

Next, let’s look at the top scoring performances that occurred on a Monday.

(boxscores
 .loc[boxscores['date'].dt.day_name() == 'Monday', ['date', 'player', 'pts']]
 .sort_values(by = 'pts', ascending = False)
 .head(10)
)
date player pts
13726 2024-01-22 J. Embiid 70
13768 2024-01-22 K. Towns 62
24107 2024-04-01 D. Booker 52
9344 2023-12-25 L. Doncic 50
2072 2023-11-06 J. Embiid 48
14975 2024-01-29 L. Doncic 45
4191 2023-11-20 J. Tatum 45
2100 2023-11-06 C. Thomas 45
8209 2023-12-18 C. Cunningham 43
13801 2024-01-22 K. Durant 43

1.13 Dealing with missing data

Pandas attempts to make working with missing data as painless as possible. For example, all of the descriptive statistics on pandas objects exclude missing data by default with the skipna = True parameter setting.

For numbers, pandas uses the floating-point value NaN (Not a Number) to represent missing data. The python special value None is also treated as missing data.

Let’s practice working with the scores_df DataFrame, which was created with some missing data.

scores_df
student quiz_1 quiz_2 quiz_3
0 1 99.0 95 NaN
1 2 87.0 90 NaN
2 3 55.0 78 NaN
3 4 100.0 94 75.0
4 5 75.0 85 93.0
5 6 NaN 90 NaN
6 7 88.0 90 85.0
7 8 90.0 89 88.0
8 9 92.0 99 90.0
9 10 NaN 100 92.0

1.13.1 Methods for identifying and working with missing values

The following methods are very useful for dealing with missing values:

  • dropna() - Filters axis labels based on whether values for each label have missing data. Can use the thresh parameter to set threshold for how many values must be non-missing for the row or column to be kept. Specify rows or columns with the axis parameter.
  • fillna() - Fills in missing data with a specified value or by using an interpolation method such as “ffill” or “bfill”.
  • isna() - Returns Boolean values, with True where values are missing. This can be combined with sum() to count missing values or mean() to calculate percentage of missing values.
  • notna() - Negation of isna(). Returns True for non-missing values and False for missing values.

In the code cell below we count the missing values in each column (the default) in the score_df DataFrame. We make use of an interesting trick that you can use in many different programming languages. scores_df.isna() returns a DataFrame of Booleans (True and False values) in the same size and shape of the scores_df DataFrame. The True values occur in the positions of the DataFrame that have missing values. When a math operation such as the sum() method is applied to Booleans the True values are treated as ones and the False values are treated as zeros. So, summing the Booleans gives the count of True values. The True values only occur where there are missing values, so we get a count of missing values.

scores_df.isna().sum()
student    0
quiz_1     2
quiz_2     0
quiz_3     4
dtype: int64

If we want to count the missing values by row we can set the axis = 1 parameter within the sum() method to count the missing values across the columns.

scores_df.isna().sum(axis = 1)
0    1
1    1
2    1
3    0
4    0
5    2
6    0
7    0
8    0
9    1
dtype: int64

Next we calculate the proportion of missing values by column. When we apply the mean() method to Booleans the True values are treated as ones and the False values are treated as zeros. The mean() method takes the column means by default. When you take the mean of a set of zeros and ones you end up with the proportion of ones! In this case the proportion of ones is also the proportion of missing values, because the True values were treated as ones.

Just to illustrate, if we have the values 1, 1, 0, 0, 0 and we take the mean we get 2/5, or 0.40. This is also the proportion of ones in the values (2 ones out of 5 total values).

scores_df.isna().mean()
student    0.0
quiz_1     0.2
quiz_2     0.0
quiz_3     0.4
dtype: float64

We can calculate the proportion of missing values in the rows by setting the axis parameter to 1.

scores_df.isna().mean(axis = 1)
0    0.25
1    0.25
2    0.25
3    0.00
4    0.00
5    0.50
6    0.00
7    0.00
8    0.00
9    0.25
dtype: float64

If we set the axis parameter to None we can calculate the proportion of missing values in the entire DataFrame.

scores_df.isna().mean(axis = None)
0.15

We can also use the Boolean array returned by isna() within a loc[] to select based on missing values. The code below selects just the rows that have missing values for quiz 1.

scores_df.loc[scores_df['quiz_1'].isna()]
student quiz_1 quiz_2 quiz_3
5 6 NaN 90 NaN
9 10 NaN 100 92.0

The dropna() method can be used to drop rows or columns that have missing values. By default the axis parameter is set to 0, so rows with missing values are dropped.

scores_df
student quiz_1 quiz_2 quiz_3
0 1 99.0 95 NaN
1 2 87.0 90 NaN
2 3 55.0 78 NaN
3 4 100.0 94 75.0
4 5 75.0 85 93.0
5 6 NaN 90 NaN
6 7 88.0 90 85.0
7 8 90.0 89 88.0
8 9 92.0 99 90.0
9 10 NaN 100 92.0
scores_df.dropna()
student quiz_1 quiz_2 quiz_3
3 4 100.0 94 75.0
4 5 75.0 85 93.0
6 7 88.0 90 85.0
7 8 90.0 89 88.0
8 9 92.0 99 90.0

If we set axis = 1 columns with missing values are dropped. Now we only see the columns for student and quiz 2. Quiz 1 and quiz 3 both have some missing values.

scores_df.dropna(axis = 1)
student quiz_2
0 1 95
1 2 90
2 3 78
3 4 94
4 5 85
5 6 90
6 7 90
7 8 89
8 9 99
9 10 100

We can use the thresh parameter to specify how many values must be non-missing to keep the column or row. In the code below we drop columns that do not have at least 7 non-missing values. The result now has the results for quiz 1 and quiz 2.

scores_df.dropna(axis = 1, thresh = 7)
student quiz_1 quiz_2
0 1 99.0 95
1 2 87.0 90
2 3 55.0 78
3 4 100.0 94
4 5 75.0 85
5 6 NaN 90
6 7 88.0 90
7 8 90.0 89
8 9 92.0 99
9 10 NaN 100

1.13.2 Examples of how missing data affects various calculations

When doing mathematical calculations it is important to consider the effects of missing data. The pandas basic statistical methods ignore missing data by default, because they have a skipna parameter that is set to True by default.

scores_df.loc[:, ['quiz_1', 'quiz_2', 'quiz_3']].mean()  # skipna = True is set by default
quiz_1    85.75
quiz_2    91.00
quiz_3    87.17
dtype: float64

Skipping the missing values may lead to misleading summary statistics if there are a lot of missing values. For example, imagine taking the mean score for a quiz for which 20 students have scores and 30 students do not yet have scores. You would get a mean, but it only reflects scores for 40% of the students. If we set skipna = False the summary statistic will only be calculated if there are no missing values. The code below only produces a mean score for quiz 2, because that is the only quiz for which none of the scores are missing.

scores_df.loc[:, ['quiz_1', 'quiz_2', 'quiz_3']].mean(skipna = False)
quiz_1     NaN
quiz_2    91.0
quiz_3     NaN
dtype: float64

Normal arithmetic operators, unlike the pandas summary functions, do not ignore missing values. The code below adds a new column called quiz_avg to the scores_df DataFrame that has the calculated average quiz score for each student. Note that it only creates an average for student who have no missing quiz scores.

scores_df['quiz_avg'] = (scores_df['quiz_1'] + scores_df['quiz_2'] + scores_df['quiz_3']) / 3
scores_df
student quiz_1 quiz_2 quiz_3 quiz_avg
0 1 99.0 95 NaN NaN
1 2 87.0 90 NaN NaN
2 3 55.0 78 NaN NaN
3 4 100.0 94 75.0 89.67
4 5 75.0 85 93.0 84.33
5 6 NaN 90 NaN NaN
6 7 88.0 90 85.0 87.67
7 8 90.0 89 88.0 89.00
8 9 92.0 99 90.0 93.67
9 10 NaN 100 92.0 NaN

To create an average quiz score for all students we could first use the fillna() method to replace the missing values with zeros and then make the new column.

# Here we fill the missing values with zeros first
scores_df.fillna(0, inplace = True)
scores_df['quiz_avg'] = (scores_df['quiz_1'] + scores_df['quiz_2'] + scores_df['quiz_3']) / 3
scores_df
student quiz_1 quiz_2 quiz_3 quiz_avg
0 1 99.0 95 0.0 64.67
1 2 87.0 90 0.0 59.00
2 3 55.0 78 0.0 44.33
3 4 100.0 94 75.0 89.67
4 5 75.0 85 93.0 84.33
5 6 0.0 90 0.0 30.00
6 7 88.0 90 85.0 87.67
7 8 90.0 89 88.0 89.00
8 9 92.0 99 90.0 93.67
9 10 0.0 100 92.0 64.00