import pandas as pd
1 Pandas Fundamentals
Click here to download this chapter as a Jupyter (.ipynb) file.
Pandas is a package that provides data structures and tools for practical, real-world data manipulation and analysis in python. It is designed to work with tabular data, that is, data structured with rows representing entities and columns representing variables related to those entities, like most spreadsheet data. Pandas is the most widely-used data manipulation package for python, and is a fundamental part of the business analyst’s toolset. In the Reddit thread linked below a user asks if pandas is used in industry and whether it is worthwhile to learn pandas. Take a look at the responses the user receives.
Reddit Thread: How much is Pandas used in the Industry? Is it worth learning?
This chapter introduces some fundamental features and tools of pandas. We will learn about the fundamental pandas data structures, the DataFrame and the Series, as well as tools for investigating the data in a DataFrame or Series and tools for filtering or subsetting data based on various criteria. This chapter also covers how to sort data, and introduces special methods for working with string data in columns and dates in columns.
It is important to recognize that the pandas skills and knowledge covered in this textbook are a useful and important subset, but we won’t be covering all of pandas, and we won’t present every feature of the pandas tools we do cover! To learn pandas really well you will need to continue to practice pandas skills frequently, and consult references to broaden your skillset from time to time. As you practice using the tools (pandas objects, properties and methods) presented in this textbook it is recommended that you take a look at their descriptions in the pandas documentation below to get a feel for all that they can do.
The pandas website has many good references to help you learn. You can consult the user guide or the API reference there. There is also a nice one-page pandas cheat sheet posted there.
If you like to learn from books I recommend the following book. A link is provided to the free edition, but it is also available on Amazon.com if you want a hard copy:
- Python for Data Analysis, 3rd Edition by Wes McKinney, the original author of pandas
If you prefer to learn from videos the following two Youtube video series are highly regarded for learning pandas.
OK. Let’s start by importing the pandas package into our python session. When working with add-on packages in python you first need to install the package if it is not already installed, and then import the package or specific functions from the package into your python session to make the package and its functions available to your code. Pandas is typically imported with the following statement:
We will also import the numpy package in the standard manner to have access to numpy functions.
import numpy as np
1.1 Pandas data structures
Pandas has two primary data structures:
Series
(1-dimensional) - An array of data of a single data type, akin to a column of data in a typical tabular data structureDataFrame
(2-dimensional) - Made up of multipleSeries
. EachSeries
can be a different data type. The DataFrame holds tabular data made up of rows and columns, like we might see in an Excel table or CSV file. Each of the columns is aSeries
object.
The DataFrame and Series data structures in pandas are implemented as objects. In the object-oriented programming paradigm objects can have properties, which are descriptive elements, and methods, which are functions that operate on the object or that the object can perform. You can think of properties as information about the object and methods as things the object can do. So, as you learn pandas you will be learning lots of different properties and methods that apply to Series or DataFrames. You will also learn pandas functions that can operate on Series or DataFrames.
A pandas DataFrame can be created in many ways. Below we will see how to create a DataFrame from a python dict
object (“dictionary”) and from a CSV file. CSV is a common text representation of tabular data. CSV stands for “comma-separated values”. In a CSV file each row represents a row in a table, and the rows use commas to separate the columns within the rows.
1.2 Creating a DataFrame
There are many different ways of creating a pandas DataFrame. We will focus on two commonly-used methods: creating a DataFrame from a python dict (“dictionary”) and creating a DataFrame from a comma-separated-values (CSV) text file.
1.2.1 Creating a DataFrame from a python dict
The code below defines a basic python dict, then creates and displays a pandas DataFrame called student_data
from that dict, using the pandas DataFrame()
function. Since the DataFrame()
function is supplied by pandas and is not a part of base python we need to preceed it with pd
. That is because when we imported the pandas package we imported it with the statement import pandas as pd
. The as pd
part establishes pd
as an alias for pandas, and in our code we should type pd
instead of pandas
when we want to access pandas functions.
The keys of the python dict become the column names in the new DataFrame, and the values of the python dict become the column values in the new DataFrame.
= {
scores 'name': ["Hannah", "Sam", "Bill", "Erin", "Latasha"],
'year': [2, 1, 1, 4, 4],
'score': [98, 67, 75, 88, 99],
'gender': ["F", "NB", "M", "F", "F"]
}
= pd.DataFrame(scores)
student_data
student_data
name | year | score | gender | |
---|---|---|---|---|
0 | Hannah | 2 | 98 | F |
1 | Sam | 1 | 67 | NB |
2 | Bill | 1 | 75 | M |
3 | Erin | 4 | 88 | F |
4 | Latasha | 4 | 99 | F |
Notice that pandas automatically adds a row index to the DataFrame, in this case the integers from 0 to 3. We can see the row index displayed on the left side of the DataFrame. There is also a column index created from the column names. The index
property of a DataFrame refers to the row index, and the columns
property of a DataFrame
refers to the column index.
Note, also, that row indexes and column indexes are special pandas data types. We will work with them more closely in later chapters.
The properties and methods of a DataFrame or Series object can be accessed with dot notation. The identifier (“name”) student_data
refers to a specific pandas DataFrame object, the one we created above. Because student_data
is an object of type DataFrame
it has the properties and methods shared by all objects of the DataFrame type. To access a property or method, use the DataFrame name followed by a dot (.
), followed by the name of the property or method. For example, the code below uses the index
property of the student_data
DataFrame to display information about the row index. We can see from the code output that the index is of type RangeIndex
that goes from 0 to 5 by 1. Similar to a python range, the stop value, 5, is not included.
student_data.index
RangeIndex(start=0, stop=5, step=1)
Similarly, we can use the columns
property to access information on the DataFrame’s column index.
student_data.columns
Index(['name', 'year', 'score', 'gender'], dtype='object')
We can also specify an index for a DataFrame using the set_index()
method. In the code below we specify that the index for the student_data
DataFrame should be the name
column.
= student_data.set_index('name') student_data
Notice in the code above that we had to assign the result of applying the set_index()
method on the student_data
DataFrame back to the student_data
object for the changes to persist. This is something to watch out for when using pandas: some methods or functions return a copy of the object with the operation applied, while other operations alter the object to which the operation is applied. Many of the functions or methods that return a copy of the object with the operation applied have a an inplace
parameter that is by default set to False
. If you set this parameter to True
the operation will alter the object in place rather than returning a copy of the object with the operation applied.
Let’s re-create the student_data
DataFrame and investigate this concept. When we re-create the student_data
DataFrame from scratch with the code below the DataFrame is created using the default index of consecutive integers.
= pd.DataFrame(scores)
student_data
student_data
name | year | score | gender | |
---|---|---|---|---|
0 | Hannah | 2 | 98 | F |
1 | Sam | 1 | 67 | NB |
2 | Bill | 1 | 75 | M |
3 | Erin | 4 | 88 | F |
4 | Latasha | 4 | 99 | F |
The code below uses the set_index()
method to change the student_data
DataFrame’s index to the name
column. The set_index()
method, though, is one of those methods that by default only returns a copy of the object with the method applied, rather than altering the object in place. So, here it creates and returns a copy of the student_data
DataFrame that has the name
column as the index, but that altered copy of the DataFrame is not assigned to anything, so when we display the student_data
DataFrame we see that it is unchanged; it still has the integer index.
# This will not actually change the student_data DataFrame's index!
'name')
student_data.set_index( student_data
name | year | score | gender | |
---|---|---|---|---|
0 | Hannah | 2 | 98 | F |
1 | Sam | 1 | 67 | NB |
2 | Bill | 1 | 75 | M |
3 | Erin | 4 | 88 | F |
4 | Latasha | 4 | 99 | F |
To make the index change persistent we need to tell the set_index()
method to actually alter the DataFrame in place by setting the inplace
parameter to True
, as in the following code. Look at the left side of the DataFrame in the output and you wil see that the index for the DataFrame is now the data that was in the name
column. Note that when a column becomes the index, like this, it is no longer a column and can no longer be accessed in the same way we normally acess a DataFrame column.
# This WILL change the student_data DataFrame's index.
'name', inplace=True)
student_data.set_index( student_data
year | score | gender | |
---|---|---|---|
name | |||
Hannah | 2 | 98 | F |
Sam | 1 | 67 | NB |
Bill | 1 | 75 | M |
Erin | 4 | 88 | F |
Latasha | 4 | 99 | F |
1.2.2 Creating a DataFrame from a CSV file
Creating a DataFrame from a CSV file is very useful, because organizations typically have a lot of data in spreadsheets, and spreadsheet applications like Microsoft Excel and Libreoffice Calc have the ability to easily save their data as a CSV file. Organizations that share free data also often use the CSV format to disseminate the data.
The code below reads a CSV file from a server on the Internet and then creates a new DataFrame object called boxscores
from the CSV file. This data is boxscore data from the 2023-24 NBA season. Each row of the data represents one player’s performance in one game.
# Here we read a csv file from the web
= "https://neuronjolt.com/data/nba_bs_2023-24_season_cleaned.csv"
url = pd.read_csv(url) boxscores
Note that if the file is read successfully there is no feedback provided by pandas.
After reading the file into the DataFrame boxscores
we should investigate the data. The following section introduces many ways to get more familiar with the data in a DataFrame. These are important, because, unlike GUI applications like Microsoft Excel, the data is not displayed for us to see.
1.3 Investigating a DataFrame
1.3.1 The info()
method
The info()
method shows summary information about the DataFrame. From the output of the ’info()` method we can see how many columns and rows the DataFrame has, the name and data type of each of the columns, the number of non-missing entriesin each column, and the amount of memory used to store the DataFrame.
boxscores.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26498 entries, 0 to 26497
Data columns (total 22 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 26498 non-null int64
1 team 26498 non-null object
2 opponent 26498 non-null object
3 player 26498 non-null object
4 position 26498 non-null object
5 min 26498 non-null int64
6 oreb 26498 non-null int64
7 dreb 26498 non-null int64
8 reb 26498 non-null int64
9 ast 26498 non-null int64
10 stl 26498 non-null int64
11 blk 26498 non-null int64
12 to 26498 non-null int64
13 pf 26498 non-null int64
14 plus_m 26498 non-null int64
15 pts 26498 non-null int64
16 fg_m 26498 non-null int64
17 fg_a 26498 non-null int64
18 tres_m 26498 non-null int64
19 tres_a 26498 non-null int64
20 ft_m 26498 non-null int64
21 ft_a 26498 non-null int64
dtypes: int64(18), object(4)
memory usage: 4.4+ MB
1.3.2 The head(n)
and tail(n)
methods.
You can use these methods to view the first n
rows or the last n
rows of the DataFrame, respectively. n
defaults to 5.
boxscores.head()
date | team | opponent | player | position | min | oreb | dreb | reb | ast | ... | to | pf | plus_m | pts | fg_m | fg_a | tres_m | tres_a | ft_m | ft_a | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20231024 | LAL | DEN | A. Davis | PF | 34 | 1 | 7 | 8 | 4 | ... | 2 | 3 | -17 | 17 | 6 | 17 | 1 | 2 | 4 | 4 |
1 | 20231024 | LAL | DEN | T. Prince | PF | 30 | 1 | 2 | 3 | 1 | ... | 1 | 0 | -14 | 18 | 6 | 8 | 4 | 6 | 2 | 2 |
2 | 20231024 | LAL | DEN | L. James | SF | 29 | 1 | 7 | 8 | 5 | ... | 0 | 1 | 7 | 21 | 10 | 16 | 1 | 4 | 0 | 1 |
3 | 20231024 | LAL | DEN | D. Russell | PG | 36 | 0 | 4 | 4 | 7 | ... | 3 | 3 | 1 | 11 | 4 | 12 | 2 | 5 | 1 | 2 |
4 | 20231024 | LAL | DEN | A. Reaves | SG | 31 | 4 | 4 | 8 | 4 | ... | 2 | 2 | -14 | 14 | 4 | 11 | 1 | 2 | 5 | 7 |
5 rows × 22 columns
2) boxscores.tail(
date | team | opponent | player | position | min | oreb | dreb | reb | ast | ... | to | pf | plus_m | pts | fg_m | fg_a | tres_m | tres_a | ft_m | ft_a | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
26496 | 20240414 | SAC | POR | M. Jones | G | 6 | 0 | 0 | 0 | 0 | ... | 1 | 2 | -8 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
26497 | 20240414 | SAC | POR | C. Jones | G | 4 | 0 | 1 | 1 | 1 | ... | 0 | 0 | 1 | 8 | 3 | 3 | 0 | 0 | 2 | 2 |
2 rows × 22 columns
1.3.3 The sample(n)
method
The sample()
method returns a randomly-selected n
rows of the DataFrame. This often gives a better feel for the data than head()
or tail()
, especially if the DataFrame has any inherent order to it.
3) boxscores.sample(
date | team | opponent | player | position | min | oreb | dreb | reb | ast | ... | to | pf | plus_m | pts | fg_m | fg_a | tres_m | tres_a | ft_m | ft_a | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
22653 | 20240323 | PHX | SA | B. Beal | SG | 28 | 0 | 1 | 1 | 12 | ... | 1 | 1 | 15 | 13 | 6 | 10 | 1 | 3 | 0 | 0 |
8764 | 20231221 | MEM | IND | Z. Williams | F | 18 | 0 | 1 | 1 | 2 | ... | 2 | 2 | 6 | 16 | 6 | 9 | 3 | 4 | 1 | 2 |
26115 | 20240412 | UTAH | LAC | D. Bazley | F | 23 | 2 | 3 | 5 | 1 | ... | 2 | 1 | -1 | 12 | 5 | 7 | 1 | 2 | 1 | 1 |
3 rows × 22 columns
1.4 Setting options for the pandas environment
Notice that when we displayed subsets of the DataFrame with the head()
and sample()
methods there were too many columns for us to see, with the default settings for pandas. Some of the columns are left out of the display, and we see ellipses (...
) in the center of the DataFrame output instead. We can use the pandas get_option()
and set_option()
functions to set various environment options. There is also a reset_option()
function to reset an option or options to the default setting, and a describe_option()
function that can be used to show more information about all options available or more information about a specific option.
To view all the available options you can use the describe_option()
method with nothing specified in the parentheses. The code to run to see all the available options is shown below, but we will not run it here because it creates a long list of output!
pd.describe_option()
The option we want to set to control how many columns are displayed is display.max_columns
. Let’s see the maximum number of columns being displayed by default and change the settings so that we can see all of the columns.
'display.max_columns') pd.get_option(
20
We know from our investigation of the boxscores
DataFrame above that it has 22 columns. We may also add some new columns later on. So, let’s set the pandas display.max_columns
setting to 30 columns.
'display.max_columns', 30) pd.set_option(
After changing that setting, when we display selected rows of the DataFrame we will now be able to see all the columns, because we have 22 columns of data, and the display.max_columns
setting is now set to 30.
3) boxscores.sample(
date | team | opponent | player | position | min | oreb | dreb | reb | ast | stl | blk | to | pf | plus_m | pts | fg_m | fg_a | tres_m | tres_a | ft_m | ft_a | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
22320 | 20240321 | NY | DEN | I. Hartenstein | C | 26 | 3 | 5 | 8 | 3 | 1 | 2 | 3 | 6 | -15 | 20 | 8 | 10 | 0 | 0 | 4 | 5 |
25152 | 20240407 | SA | PHI | V. Wembanyama | C | 43 | 3 | 15 | 18 | 6 | 0 | 7 | 9 | 1 | -1 | 33 | 11 | 20 | 5 | 10 | 6 | 7 |
19765 | 20240305 | MIA | DET | N. Jovic | F | 21 | 0 | 4 | 4 | 2 | 2 | 0 | 0 | 2 | 8 | 8 | 3 | 5 | 1 | 2 | 1 | 2 |
We can also set other options to control the number of rows displayed, and number of digits to show after the decimal point.
'display.max_rows', 100)
pd.set_option('display.precision', 2) pd.set_option(
1.5 Other properties and methods for investigating data
Other useful properties and methods for investigating your data include shape
, columns
, describe()
, value_counts()
, unique()
, and nunique()
. Some of these methods and properties can be applied to both DataFrame and Series objects, and others apply only to one or the other.
1.5.1 The shape
property
The shape
property returns a tuple that indicates the number of rows and number of columns. This can be useful if you need to programmatically extract the number of rows and columns.
Before going into more details, let’s read in some new data from a CSV file into a new DataFrame named store
.
= pd.read_csv("https://neuronjolt.com/data/superstore.csv") store
Now, let’s use the shape
property of the store
DataFrame to return a tuple that indicates the number of rows and columns in the store
DataFrame.
store.shape
(9994, 20)
Because we know the format of the tuple returned by the shape
property we can extract parts of it with our code. The code below uses indexing to extract the first element of the tuple (the element at index 0, which is the number of rows in the store
DataFrame), store it in a variable named num_rows
, and include that variable in a string that is printed to output.
= store.shape[0]
num_rows print(f"The store DataFrame has {num_rows} rows.")
The store DataFrame has 9994 rows.
1.5.2 Accessing individual columns with []
Recall that the columns
property returns the Index
object with the column names.
boxscores.columns
Index(['date', 'team', 'opponent', 'player', 'position', 'min', 'oreb', 'dreb',
'reb', 'ast', 'stl', 'blk', 'to', 'pf', 'plus_m', 'pts', 'fg_m', 'fg_a',
'tres_m', 'tres_a', 'ft_m', 'ft_a'],
dtype='object')
Individual columns or lists of columns can be accessed through square brackets, similar to values in a python dict. Notice that when one column is indicated in the square brackets a pandas Series is returned. When a list of columns is indicated in the square brackets, even if there is only one column in the list, a pandas DataFrame is returned. This is a common pattern in pandas.
In the code cell below one column is indicated, so a Series is returned.
'player'].head() boxscores[
0 A. Davis
1 T. Prince
2 L. James
3 D. Russell
4 A. Reaves
Name: player, dtype: object
In the code cell below we use the type()
function to verify that a Series is returned. The type()
function returns the type or class of its argument.
type(boxscores['player'].head())
pandas.core.series.Series
In the next two examples a list of columns is indicated in the square brackets, so a DataFrame is returned.
'player']].head() boxscores[[
player | |
---|---|
0 | A. Davis |
1 | T. Prince |
2 | L. James |
3 | D. Russell |
4 | A. Reaves |
type(boxscores[['player']].head())
pandas.core.frame.DataFrame
'player', 'position']].head() boxscores[[
player | position | |
---|---|---|
0 | A. Davis | PF |
1 | T. Prince | PF |
2 | L. James | SF |
3 | D. Russell | PG |
4 | A. Reaves | SG |
type(boxscores[['player', 'position']].head())
pandas.core.frame.DataFrame
1.5.3 The describe()
method
The describe()
method gives summary information about each column (Series) in the DataFrame, such as the number of non-null values, the five-number summary, the mean, and the standard deviation.
boxscores.describe()
date | min | oreb | dreb | reb | ast | stl | blk | to | pf | plus_m | pts | fg_m | fg_a | tres_m | tres_a | ft_m | ft_a | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 2.65e+04 | 26498.00 | 26498.00 | 26498.00 | 26498.00 | 26498.00 | 26498.00 | 26498.00 | 26498.00 | 26498.00 | 26498.00 | 26498.00 | 26498.00 | 26498.00 | 26498.00 | 26498.00 | 26498.00 | 26498.00 |
mean | 2.02e+07 | 22.41 | 0.98 | 3.06 | 4.04 | 2.48 | 0.69 | 0.48 | 1.20 | 1.74 | 0.00 | 10.60 | 3.91 | 8.25 | 1.19 | 3.26 | 1.58 | 2.02 |
std | 4.44e+03 | 11.17 | 1.35 | 2.72 | 3.47 | 2.66 | 0.94 | 0.86 | 1.37 | 1.45 | 11.57 | 8.99 | 3.35 | 6.19 | 1.49 | 3.06 | 2.30 | 2.75 |
min | 2.02e+07 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | -58.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
25% | 2.02e+07 | 14.00 | 0.00 | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | -7.00 | 3.00 | 1.00 | 3.00 | 0.00 | 1.00 | 0.00 | 0.00 |
50% | 2.02e+07 | 23.00 | 1.00 | 2.00 | 3.00 | 2.00 | 0.00 | 0.00 | 1.00 | 2.00 | 0.00 | 9.00 | 3.00 | 7.00 | 1.00 | 3.00 | 1.00 | 1.00 |
75% | 2.02e+07 | 32.00 | 1.00 | 4.00 | 6.00 | 4.00 | 1.00 | 1.00 | 2.00 | 3.00 | 7.00 | 16.00 | 6.00 | 12.00 | 2.00 | 5.00 | 2.00 | 3.00 |
max | 2.02e+07 | 54.00 | 15.00 | 20.00 | 31.00 | 23.00 | 7.00 | 10.00 | 11.00 | 6.00 | 46.00 | 73.00 | 25.00 | 47.00 | 12.00 | 23.00 | 24.00 | 32.00 |
By default the describe()
method will only show the summary for numeric variables. However, if the include="all"
parameter is set it will also show a summary for other types of variables.
="all") store.describe(include
Order ID | Order Date | Ship Date | Ship Mode | Customer ID | Customer Name | Segment | Country | City | State | Postal Code | Region | Product ID | Category | Sub-Category | Product Name | Sales | Quantity | Discount | Profit | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 9994 | 9994 | 9994 | 9994 | 9994 | 9994 | 9994 | 9994 | 9994 | 9994 | 9994.00 | 9994 | 9994 | 9994 | 9994 | 9994 | 9994.00 | 9994.00 | 9994.00 | 9994.00 |
unique | 5009 | 1237 | 1334 | 4 | 793 | 793 | 3 | 1 | 531 | 49 | NaN | 4 | 1862 | 3 | 17 | 1850 | NaN | NaN | NaN | NaN |
top | CA-2017-100111 | 9/5/2016 | 12/16/2015 | Standard Class | WB-21850 | William Brown | Consumer | United States | New York City | California | NaN | West | OFF-PA-10001970 | Office Supplies | Binders | Staple envelope | NaN | NaN | NaN | NaN |
freq | 14 | 38 | 35 | 5968 | 37 | 37 | 5191 | 9994 | 915 | 2001 | NaN | 3203 | 19 | 6026 | 1523 | 48 | NaN | NaN | NaN | NaN |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 55190.38 | NaN | NaN | NaN | NaN | NaN | 229.86 | 3.79 | 0.16 | 28.66 |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 32063.69 | NaN | NaN | NaN | NaN | NaN | 623.25 | 2.23 | 0.21 | 234.26 |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1040.00 | NaN | NaN | NaN | NaN | NaN | 0.44 | 1.00 | 0.00 | -6599.98 |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 23223.00 | NaN | NaN | NaN | NaN | NaN | 17.28 | 2.00 | 0.00 | 1.73 |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 56430.50 | NaN | NaN | NaN | NaN | NaN | 54.49 | 3.00 | 0.20 | 8.67 |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 90008.00 | NaN | NaN | NaN | NaN | NaN | 209.94 | 5.00 | 0.20 | 29.36 |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 99301.00 | NaN | NaN | NaN | NaN | NaN | 22638.48 | 14.00 | 0.80 | 8399.98 |
The describe()
method can also be used on a Series
object.
'pts'].describe() boxscores[
count 26498.00
mean 10.60
std 8.99
min 0.00
25% 3.00
50% 9.00
75% 16.00
max 73.00
Name: pts, dtype: float64
1.5.4 The value_counts()
method
The value_counts()
method returns the counts of all the unique values, sorted from most frequent to least frequent, as a pandas Series object.
'Ship Mode'].value_counts() store[
Ship Mode
Standard Class 5968
Second Class 1945
First Class 1538
Same Day 543
Name: count, dtype: int64
The value_counts()
method can even be used for numeric variables. The code below uses value_counts()
to count how many player-games are in the boxscores data for each unique number of rebounds. The code uses something called method chaining to add a head()
method to limit how many rows are displayed. Method chaining uses the dot (.
) operator. The operations execute from right to left, and the object returned by each operation becomes the input (first argument) for the subsequent operation.
So, in the code below we start with the reb
column in the boxscores
DataFrame. We apply value_counts()
to it to return a Series of counts for each number of rebounds. Then, to that Series the head()
method is applied to that Series of counts, so that only the first 5 elements in the Series are displayed. Chaining things together like this is very common when working with pandas. To understand and use chained methods you need to get into the habit of thinking about each operation and what it returns, then thinking about what other operations can be aplied to the object returned by that operation.
'reb'].value_counts().head() boxscores[
reb
2 3674
1 3594
3 3425
0 3258
4 2980
Name: count, dtype: int64
1.5.5 The unique()
method
The unique()
method returns the unique values of a Series
. The code below returns all the unique values found in the Region
column of the store
DataFrame, as a numpy array.
'Region'].unique() store[
array(['South', 'West', 'Central', 'East'], dtype=object)
1.5.6 The nunique()
method
The nunique()
method returns the number of unique values in a Series. From the code below we can see that there are 4 unique values in the Regions
column in the store
DataFrame.
'Region'].nunique() store[
4
1.6 Methods to calculate summary statistics
Take a look at [this link] (https://pandas.pydata.org/docs/user_guide/basics.html#descriptive-statistics) to the pandas user guide to see some examples and a deeper explanation. These methods can be used on DataFrames or Series objects.
Some commonly-used methods include the following:
count()
(counts the number of non-missing observations)sum()
mean()
median()
min()
max()
std()
Note that these methods can take an axis
parameter to indicate which axis they work across. If axis = 0
it means they work across the rows and thus apply to the columns. If axis = 1
it means they work across the columns and thus apply to the rows.
Let’s create a simple dataframe of quiz scores and take a look. Note that we are creating the quizzes
DataFrame from a python dict with the pandas DataFrame
function, as before, but we are using method chaining to direct the output of the DataFrame
function directly to the set_index
method with which we specify that the name
column should be used as the DataFrame’s row index.
= {
quizzes 'name': ["Hannah", "Sam", "Anjali", "Erin", "Latasha"],
'quiz_1': [89, 74, 79, 90, 95],
'quiz_2': [73, 75, 78, 88, 86],
'quiz_3': [92, 88, 85, 95, 100],
'quiz_4': [100, 90, 88, 92, 98]
}
= pd.DataFrame(quizzes).set_index("name")
quiz_df
quiz_df
quiz_1 | quiz_2 | quiz_3 | quiz_4 | |
---|---|---|---|---|
name | ||||
Hannah | 89 | 73 | 92 | 100 |
Sam | 74 | 75 | 88 | 90 |
Anjali | 79 | 78 | 85 | 88 |
Erin | 90 | 88 | 95 | 92 |
Latasha | 95 | 86 | 100 | 98 |
Calculate the standard deviation of the scores for quiz 2.
'quiz_2'].std() quiz_df[
6.670832032063167
To calculate the averages per student we need to calculate the means for the rows. The row’s go across the columns, which are axis 1, so we use the axis = 1
parameter to specify that we want the means calculated across the columns.
= 1) quiz_df.mean(axis
name
Hannah 88.50
Sam 81.75
Anjali 82.50
Erin 91.25
Latasha 94.75
dtype: float64
To calculate the averages per quiz we need to calculate the means for the columns, which go across the rows, which are axis 0. So, we specify axis = 0
to calculate the means across the rows.
= 0) quiz_df.mean(axis
quiz_1 85.4
quiz_2 80.0
quiz_3 92.0
quiz_4 93.6
dtype: float64
1.7 Sorting with the sort_values()
method
This method can apply to a Series or a DataFrame. If applied to a DataFrame
the by
parameter indicates what to sort on (typically a column name or a list of column names). Both take an ascending
parameter that defaults to True. To sort in descending order you need to set ascending = False
.
Let’s take another look at the quiz_df
DataFrame, but sort it by the quiz_1
score, in descending order.
= 'quiz_1', ascending = False) quiz_df.sort_values(by
quiz_1 | quiz_2 | quiz_3 | quiz_4 | |
---|---|---|---|---|
name | ||||
Latasha | 95 | 86 | 100 | 98 |
Erin | 90 | 88 | 95 | 92 |
Hannah | 89 | 73 | 92 | 100 |
Anjali | 79 | 78 | 85 | 88 |
Sam | 74 | 75 | 88 | 90 |
Note that sort_values()
is another one of those DataFrame methods that returns an altered copy of the DataFrame instead of altering it in place. So, the code above showed us what quiz_df
looks like if sorted in descending order of the score on the first quiz, but didn’t actually change the quiz_df
DataFrame. To actually change the quiz_df
DataFrame we would need to use the inplace = True
parameter in the sort_values()
method or assign the result of the sort back to quiz_df
.
When we display quiz_df
below we can see that it is not sorted.
quiz_df
quiz_1 | quiz_2 | quiz_3 | quiz_4 | |
---|---|---|---|---|
name | ||||
Hannah | 89 | 73 | 92 | 100 |
Sam | 74 | 75 | 88 | 90 |
Anjali | 79 | 78 | 85 | 88 |
Erin | 90 | 88 | 95 | 92 |
Latasha | 95 | 86 | 100 | 98 |
Let’s do the sort again, this time setting the inplace = True
parameter. This time the quiz_df
DataFrame is actually changed.
= 'quiz_1', ascending = False, inplace = True)
quiz_df.sort_values(by
quiz_df
quiz_1 | quiz_2 | quiz_3 | quiz_4 | |
---|---|---|---|---|
name | ||||
Latasha | 95 | 86 | 100 | 98 |
Erin | 90 | 88 | 95 | 92 |
Hannah | 89 | 73 | 92 | 100 |
Anjali | 79 | 78 | 85 | 88 |
Sam | 74 | 75 | 88 | 90 |
Below we sort the boxscores
DataFrame by position
and pts
in descending order, display only the player
, position
, and pts
columns, and show the first 10 rows. We use a list to indicate the sort columns. Note that we use a list for the sort order as well, to show what sort order to use for each of the columns. Since position
lines up with True
, position
is sorted in ascending order, and ‘C’ comes first. pts
, however, are sorted in descending order within each position because pts
is the second column in the sort order list and False
is the second Boolean in the list for the ascending
parameter.
(boxscores= ['position', 'pts'], ascending = [True, False])
.sort_values(by 'player', 'position', 'pts']]
.loc[:, [10)
.head( )
player | position | pts | |
---|---|---|---|
13726 | J. Embiid | C | 70 |
13768 | K. Towns | C | 62 |
8542 | J. Embiid | C | 51 |
6396 | J. Embiid | C | 50 |
2072 | J. Embiid | C | 48 |
19829 | A. Sengun | C | 45 |
7832 | J. Embiid | C | 42 |
13588 | N. Jokic | C | 42 |
24234 | N. Jokic | C | 42 |
7277 | J. Embiid | C | 41 |
For most of the examples in this textbook we will use a method-chaining approach, as in the code example above. As a reminder, though, sometimes you may want to break the operations into steps, particularly when you want to reuse one of the intermediate results. The code below shows what the code would look like if we broke the example above into two steps, naming the intermediate results sorted_pos_pts
.
= boxscores.sort_values(by = ['position', 'pts'],
sorted_pos_pts = [True, False])
ascending 'player', 'position', 'pts']].head(10) sorted_pos_pts[[
player | position | pts | |
---|---|---|---|
13726 | J. Embiid | C | 70 |
13768 | K. Towns | C | 62 |
8542 | J. Embiid | C | 51 |
6396 | J. Embiid | C | 50 |
2072 | J. Embiid | C | 48 |
19829 | A. Sengun | C | 45 |
7832 | J. Embiid | C | 42 |
13588 | N. Jokic | C | 42 |
24234 | N. Jokic | C | 42 |
7277 | J. Embiid | C | 41 |
1.8 Removing duplicate rows from a DataFrame
The drop_duplicates()
method by default returns a copy of a DataFrame with the duplicate rows removed. To remove duplicates in place set the inplace = True
parameter. You can use the subset
parameter to indicate the column(s) to use to determine whether a row is a duplicate or not. By default all columns are included.
The code cell below creates a new version of the quiz_df
DataFrame that has duplicate rows.
= quiz_df.sample(10, replace = True)
quiz_df_with_dupes
quiz_df_with_dupes
quiz_1 | quiz_2 | quiz_3 | quiz_4 | |
---|---|---|---|---|
name | ||||
Sam | 74 | 75 | 88 | 90 |
Latasha | 95 | 86 | 100 | 98 |
Anjali | 79 | 78 | 85 | 88 |
Sam | 74 | 75 | 88 | 90 |
Anjali | 79 | 78 | 85 | 88 |
Anjali | 79 | 78 | 85 | 88 |
Hannah | 89 | 73 | 92 | 100 |
Erin | 90 | 88 | 95 | 92 |
Latasha | 95 | 86 | 100 | 98 |
Erin | 90 | 88 | 95 | 92 |
Let’s now use drop_duplicates()
to see what the quiz_df_with_dupes
DataFrame looks like with duplicate rows removed.
quiz_df_with_dupes.drop_duplicates()
quiz_1 | quiz_2 | quiz_3 | quiz_4 | |
---|---|---|---|---|
name | ||||
Sam | 74 | 75 | 88 | 90 |
Latasha | 95 | 86 | 100 | 98 |
Anjali | 79 | 78 | 85 | 88 |
Hannah | 89 | 73 | 92 | 100 |
Erin | 90 | 88 | 95 | 92 |
Let’s take a look at another example. In the boxscores
DataFrame each row represents a player-game, so we have multiple rows for each player. What if we wanted to see all the unique player-team combinations for the season? If we just select the player
and team
columns the result has lots of duplicates.
'player', 'team']]
(boxscores[[= ['player', 'team'])
.sort_values(by 12)
.head( )
player | team | |
---|---|---|
2888 | A. Bailey | CHA |
7856 | A. Bailey | CHA |
16641 | A. Bailey | CHA |
21311 | A. Bailey | CHA |
21927 | A. Bailey | CHA |
22577 | A. Bailey | CHA |
22839 | A. Bailey | CHA |
23483 | A. Bailey | CHA |
25894 | A. Bailey | CHA |
26178 | A. Bailey | CHA |
136 | A. Black | ORL |
1177 | A. Black | ORL |
We can use drop_duplicates()
to reduce it to one row per unique player-team combination.
'player', 'team']]
(boxscores[[
.drop_duplicates()= ['player', 'team'])
.sort_values(by 12)
.head( )
player | team | |
---|---|---|
2888 | A. Bailey | CHA |
136 | A. Black | ORL |
1294 | A. Brooks | BKN |
165 | A. Burks | DET |
16867 | A. Burks | NY |
219 | A. Caruso | CHI |
312 | A. Coffey | LAC |
0 | A. Davis | LAL |
220 | A. Dosunmu | CHI |
217 | A. Drummond | CHI |
181 | A. Edwards | MIN |
18060 | A. Flagler | OKC |
1.9 Selecting a subset of a pandas DataFrame
Selecting a subset of data according to some criteria is a very common data operation. For example, a business analyst might want to see the sales for one family of products in one region of the country. A sports analyst might want to look at stats for a particular player or set of players. This section demonstrates several different ways of extracting subsets from pandas DataFrames. The first two methods we cover, selecting by label with loc[]
and selecting by position with iloc[]
are fundamental, so we need to cover them, but they are actually not as interesting as or used as often as the third method we will present - using a Boolean array as a selector. So, as you read through the first two subsections below, know that there is more interesting material ahead!
loc[]
can also use a callable, such as a function, as a selector, but that is a more advanced usage, and it is only helpful in specific circumstances. We haven’t yet covered the code skills we need to create those circumstances, so we will hold off on covering the use of loc[]
with a callable until the next chapter.
1.9.1 Selecting by label with loc[]
loc[]
is a method used to access a selection from a DataFrame by index (row label) and column name (column label, optional). It may also be used with Boolean arrays or callables (functions) as the selectors, as we will see below. loc[]
may be used to access values in a DataFrame or to set values in a DataFrame.
In the code below we select the first four rows (rows with index 0 to 3). Note that, unlike base python slices, when slicing by label the last label is included in the slice.
# Note that both endpoints are included if you select a slice of index values
0:3] boxscores.loc[
date | team | opponent | player | position | min | oreb | dreb | reb | ast | stl | blk | to | pf | plus_m | pts | fg_m | fg_a | tres_m | tres_a | ft_m | ft_a | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20231024 | LAL | DEN | A. Davis | PF | 34 | 1 | 7 | 8 | 4 | 0 | 2 | 2 | 3 | -17 | 17 | 6 | 17 | 1 | 2 | 4 | 4 |
1 | 20231024 | LAL | DEN | T. Prince | PF | 30 | 1 | 2 | 3 | 1 | 0 | 1 | 1 | 0 | -14 | 18 | 6 | 8 | 4 | 6 | 2 | 2 |
2 | 20231024 | LAL | DEN | L. James | SF | 29 | 1 | 7 | 8 | 5 | 1 | 0 | 0 | 1 | 7 | 21 | 10 | 16 | 1 | 4 | 0 | 1 |
3 | 20231024 | LAL | DEN | D. Russell | PG | 36 | 0 | 4 | 4 | 7 | 1 | 0 | 3 | 3 | 1 | 11 | 4 | 12 | 2 | 5 | 1 | 2 |
In this next example we use loc[]
to select according to a slice of rows and two specific columns, indicated in the list ['player', 'position']
. The rows are indicated first, then there is a comma, and then the columns are indicated. If there is no comma included loc[]
accesses the specified rows for all the columns.
0:3, ['player', 'position']] boxscores.loc[
player | position | |
---|---|---|
0 | A. Davis | PF |
1 | T. Prince | PF |
2 | L. James | SF |
3 | D. Russell | PG |
In this example we access the first and fourth row, specifying them by index label with the list [0, 3]
.
0, 3]] boxscores.loc[[
date | team | opponent | player | position | min | oreb | dreb | reb | ast | stl | blk | to | pf | plus_m | pts | fg_m | fg_a | tres_m | tres_a | ft_m | ft_a | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20231024 | LAL | DEN | A. Davis | PF | 34 | 1 | 7 | 8 | 4 | 0 | 2 | 2 | 3 | -17 | 17 | 6 | 17 | 1 | 2 | 4 | 4 |
3 | 20231024 | LAL | DEN | D. Russell | PG | 36 | 0 | 4 | 4 | 7 | 1 | 0 | 3 | 3 | 1 | 11 | 4 | 12 | 2 | 5 | 1 | 2 |
In the next example we used lists to specify both the row and the column labels for the selection.
0, 3], ['player', 'pts']] boxscores.loc[[
player | pts | |
---|---|---|
0 | A. Davis | 17 |
3 | D. Russell | 11 |
In this final example we use loc[]
to access two specific rows, and a slice of columns.
0, 3], 'player':'reb'] boxscores.loc[[
player | position | min | oreb | dreb | reb | |
---|---|---|---|---|---|---|
0 | A. Davis | PF | 34 | 1 | 7 | 8 |
3 | D. Russell | PG | 36 | 0 | 4 | 4 |
1.9.2 Selecting by position with iloc[]
iloc[]
selects by row and column position (0-based). It isn’t very useful with the box score data, but could be useful with other tpes of data. Recall that when using loc[]
slices include both ends. With iloc[]
slices behave more like base python slices; the endpoints are not included.
In the first code example we select the second and third rows (indexes 1 and 2) and the third and fourth columns (indexes 2 and 3).
1:3, 2:4] boxscores.iloc[
opponent | player | |
---|---|---|
1 | DEN | T. Prince |
2 | DEN | L. James |
In the next example we use a list to select the row indexes and another list to select the columns.
0,3,49], [1,2,4]] boxscores.iloc[[
team | opponent | position | |
---|---|---|---|
0 | LAL | DEN | PF |
3 | LAL | DEN | PG |
49 | ATL | CHA | PF |
In the example below we select specific rows by their indexes, and then use a colon (:
) to indicate all rows. We can also use a colon in the first position to indicate that all rows are selected.
1,5,7], :] boxscores.iloc[[
date | team | opponent | player | position | min | oreb | dreb | reb | ast | stl | blk | to | pf | plus_m | pts | fg_m | fg_a | tres_m | tres_a | ft_m | ft_a | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 20231024 | LAL | DEN | T. Prince | PF | 30 | 1 | 2 | 3 | 1 | 0 | 1 | 1 | 0 | -14 | 18 | 6 | 8 | 4 | 6 | 2 | 2 |
5 | 20231024 | LAL | DEN | R. Hachimura | PF | 15 | 2 | 1 | 3 | 0 | 0 | 0 | 0 | 2 | -8 | 6 | 3 | 10 | 0 | 3 | 0 | 0 |
7 | 20231024 | LAL | DEN | M. Lewis | F | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1.9.3 Boolean Arrays as selectors with loc[]
This method of selecting a subset from a DataFrame is very powerful and frequently used. We can use a Boolean array to select rows and columns. A Boolean array is an array of True
or False
values. Where the array is True
the row or column is included and where the array is False
the row or column is not included. For example, let’s view all boxscore lines for player-games in which the player scored 60 or more points. First, take a look at the Boolean array we will use. This Boolean array is created by the inequality boxscores['pts'] >= 60
. It will have as many True
and False
values in total as there are values in the pts
column of the boxscores
DataFrame. For every row where pts
is greater than or equal to 60 the Boolean array will have a True
value. For every other row it will have a False
value.
'pts'] >= 60 boxscores[
0 False
1 False
2 False
3 False
4 False
...
26493 False
26494 False
26495 False
26496 False
26497 False
Name: pts, Length: 26498, dtype: bool
Next, we can use that Boolean array within loc[]
to display only the rows of the boxscores
DataFrame for which pts
is >= 60. Only the rows that are in the same positions (same row index) as the True
values in the Boolean array will be included in the subset. In addition to using the Boolean array to select the rows, we use a list to select a subset of the columns to be included. Finally, we use method chaining to tack on a sort at the end that will sort the result in descending order of points.
'pts'] >= 60, ['date', 'team', 'opponent', 'player', 'pts']].sort_values(by = 'pts', ascending = False) boxscores.loc[boxscores[
date | team | opponent | player | pts | |
---|---|---|---|---|---|
14280 | 20240126 | DAL | ATL | L. Doncic | 73 |
13726 | 20240122 | PHI | SA | J. Embiid | 70 |
7396 | 20231213 | MIL | IND | G. Antetokounmpo | 64 |
13768 | 20240122 | MIN | CHA | K. Towns | 62 |
14327 | 20240126 | PHX | IND | D. Booker | 62 |
23635 | 20240329 | NY | SA | J. Brunson | 61 |
15660 | 20240203 | GS | ATL | Steph Curry | 60 |
You might have noticed that the code example above has a long line of code. For readability it is common practice in programming to try to keep code lines short. Ideally they should be 80 characters or less. However, recall that python is very picky about whitespace; line endings and indentation have meaning to the python interpreter. So, when building pandas code that includes a lot of method chaining one common practice to keep the lines shorter is to enclose the code within parentheses, since python ignores the whitespace for code within parentheses, brackets or braces. The code example below is the same code as the example above, but it is enclosed in parentheses to permit shorter lines, which results in more readable code.
(boxscores'pts'] >= 60, ['date', 'team', 'opponent', 'player', 'pts']]
.loc[boxscores[= 'pts', ascending = False)
.sort_values(by )
date | team | opponent | player | pts | |
---|---|---|---|---|---|
14280 | 20240126 | DAL | ATL | L. Doncic | 73 |
13726 | 20240122 | PHI | SA | J. Embiid | 70 |
7396 | 20231213 | MIL | IND | G. Antetokounmpo | 64 |
13768 | 20240122 | MIN | CHA | K. Towns | 62 |
14327 | 20240126 | PHX | IND | D. Booker | 62 |
23635 | 20240329 | NY | SA | J. Brunson | 61 |
15660 | 20240203 | GS | ATL | Steph Curry | 60 |
A Note on Style: An alternative to method chaining is to create intermediate objects and perform operations on them step by step. For example, we could accomplish the same thing as we did in the code above with the following code. Which approach you prefer is a matter of style, and each has its advantages and disadvantages. I prefer the method-chaining approach because it is clear and concise, and because it doesn’t create names for the objects created in the intermediate steps. For example, we can see that the the code below we had to create a new name, gt_60
to refer to the intermediate DataFrame. If you need to use that intermediate DataFrame again it is great to give it a name like this, but if you don’t have a need to refer to it again, naming it creates something unneeded in your code, and introduces more opportunity for error. How does it create opportunity for error? As you may already have experienced, writing code is typically an iterative process. You write some code, try it out, fix errors, revise it, try it again, etc. If you have lots of intermediate names created in your code it is very easy to write code using the wrong intermediate names, or write code that uses names that you created in a step that has been removed in one of your revisions. If you refer to a name that has been removed in a subsequent revision your code will run just fine in the current session, but if you close your session and re-run the code there will be an error!
These observations are not just theory, by the way. I’ve spent thousands of hours helping students with code and grading student code and I have seen many many errors caused by losing track of intermediate names.
= boxscores.loc[boxscores['pts'] >= 60,
gt_60 'date', 'team', 'opponent', 'player', 'pts']]
[= 'pts', ascending = False) gt_60.sort_values(by
date | team | opponent | player | pts | |
---|---|---|---|---|---|
14280 | 20240126 | DAL | ATL | L. Doncic | 73 |
13726 | 20240122 | PHI | SA | J. Embiid | 70 |
7396 | 20231213 | MIL | IND | G. Antetokounmpo | 64 |
13768 | 20240122 | MIN | CHA | K. Towns | 62 |
14327 | 20240126 | PHX | IND | D. Booker | 62 |
23635 | 20240329 | NY | SA | J. Brunson | 61 |
15660 | 20240203 | GS | ATL | Steph Curry | 60 |
When creating the Boolean array we can combine various logical tests with logical and (&
) and logical or (|
). When we do so each criterion has to be enclosed in parentheses. Let’s take a look at statlines for players who scored at least 40 points AND had at least 15 rebounds. Since there are two criteria that both need to be true, we enclose each in parentheses and connect them with the &
character.
'pts'] >= 40) & (boxscores['reb'] >= 15),
boxscores.loc[(boxscores['date', 'player', 'pts', 'reb']] [
date | player | pts | reb | |
---|---|---|---|---|
3932 | 20231118 | G. Antetokounmpo | 40 | 15 |
7832 | 20231216 | J. Embiid | 42 | 15 |
11342 | 20240106 | G. Antetokounmpo | 48 | 17 |
13726 | 20240122 | J. Embiid | 70 | 18 |
17488 | 20240214 | D. Avdija | 43 | 15 |
19086 | 20240229 | A. Davis | 40 | 15 |
19253 | 20240301 | G. Antetokounmpo | 46 | 16 |
19829 | 20240305 | A. Sengun | 45 | 16 |
23645 | 20240329 | V. Wembanyama | 40 | 20 |
24234 | 20240402 | N. Jokic | 42 | 16 |
1.9.4 Use of the query()
method to subset DataFrames
In addition to using using Boolean arrays with loc[]
we can also subset DataFrames with the query()
method and strings, similar to SQL, that specify the conditions we want to apply. The strings can refer to either column names or index names. This can lead to code that is more compact and, arguably, easier to read and understand. Notice that with the query()
method we can use &
or and
for logical and and use |
or or
for logical or.
Let’s make a sample DataFrame named scores_df
to use to investigate the query()
method. Note that this sample DataFrame has some missing values. We will learn how to deal with those later in the chapter. The scores_df
DataFrame shows quiz scores for students in multiple sections of a class.
= pd.DataFrame(
scores_df
{'section': ['A', 'B', 'A', 'A', 'B', 'B', 'A', 'A', 'B', 'A'],
'student': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'q1': [99, 87, 55, 100, 75, None, 88, 90, 92, np.nan],
'q2': [95, 90, 78, 94, 85, 90, 90, 89, 99, 100],
'q3': [None, np.nan, None, 75, 93, None, 85, 88, 90, 92]
}
)
scores_df
section | student | q1 | q2 | q3 | |
---|---|---|---|---|---|
0 | A | 1 | 99.0 | 95 | NaN |
1 | B | 2 | 87.0 | 90 | NaN |
2 | A | 3 | 55.0 | 78 | NaN |
3 | A | 4 | 100.0 | 94 | 75.0 |
4 | B | 5 | 75.0 | 85 | 93.0 |
5 | B | 6 | NaN | 90 | NaN |
6 | A | 7 | 88.0 | 90 | 85.0 |
7 | A | 8 | 90.0 | 89 | 88.0 |
8 | B | 9 | 92.0 | 99 | 90.0 |
9 | A | 10 | NaN | 100 | 92.0 |
Next, let’s filter scores_df
to only show scores for students in section A who scored 90 or above on the first quiz (the q1
column). The first code example below accomplished the filtering by using loc[]
with a Boolean array created by two criteria combined with an &
, as we learned to do earlier in this chapter.
'section'] == 'A') & (scores_df['q1'] >= 90)] scores_df.loc[(scores_df[
section | student | q1 | q2 | q3 | |
---|---|---|---|---|---|
0 | A | 1 | 99.0 | 95 | NaN |
3 | A | 4 | 100.0 | 94 | 75.0 |
7 | A | 8 | 90.0 | 89 | 88.0 |
This next code example accomplishes the same filtering, but uses query()
instead of loc[]
. The query string is text, so it must be enclosed in single quotes or double quotes. Since one of the criteria refers to the section, which is a string value, we need to use different delimiters for the query string and for the section value. In the code example below single quotes are used to delimit the query string, and double quotes are used around the section value. We could also have used double quotes for the query string and single quotes for the section value and it would work the same way.
'section == "A" and q1 >= 90') scores_df.query(
section | student | q1 | q2 | q3 | |
---|---|---|---|---|---|
0 | A | 1 | 99.0 | 95 | NaN |
3 | A | 4 | 100.0 | 94 | 75.0 |
7 | A | 8 | 90.0 | 89 | 88.0 |
If the column names are not legal python identifiers (i.e. they have spaces in them or begin with numbers) we need to add backticks around them in the query string. Let’s change the column names of scores_df
so that they are not legal python identifiers, and then demonstrate how to refer to them with backticks.
= ['section', 'student number', '1 quiz', '2 quiz', '3 quiz']
scores_df.columns scores_df
section | student number | 1 quiz | 2 quiz | 3 quiz | |
---|---|---|---|---|---|
0 | A | 1 | 99.0 | 95 | NaN |
1 | B | 2 | 87.0 | 90 | NaN |
2 | A | 3 | 55.0 | 78 | NaN |
3 | A | 4 | 100.0 | 94 | 75.0 |
4 | B | 5 | 75.0 | 85 | 93.0 |
5 | B | 6 | NaN | 90 | NaN |
6 | A | 7 | 88.0 | 90 | 85.0 |
7 | A | 8 | 90.0 | 89 | 88.0 |
8 | B | 9 | 92.0 | 99 | 90.0 |
9 | A | 10 | NaN | 100 | 92.0 |
We see above that the column names have been changed. The code below uses query()
to select the data for students with student number >= 5 and score on the third quiz >=90. Since student number
and 3 quiz
are both invalid python identifiers (the first contains a space and the second begins with a number and contains a space) we need to put backticks around them in the query string.
'`student number` >= 5 and `3 quiz` >= 90') scores_df.query(
section | student number | 1 quiz | 2 quiz | 3 quiz | |
---|---|---|---|---|---|
4 | B | 5 | 75.0 | 85 | 93.0 |
8 | B | 9 | 92.0 | 99 | 90.0 |
9 | A | 10 | NaN | 100 | 92.0 |
You can refer to a variable within the query string by prefixing the variable name with @
. This gives you the flexibility to make your code more dynamic. For example, in the code below we first restore the column names for scores_df
to the original column names. The we create a variable named honors_thresh
that contains the threshold score value for receiving honors. Finally, we use query()
to select all the data for students who scored at or above the threshold score on the first quiz.
# First, restore sane column names to the scores_df DataFrame
= ['section', 'student', 'q1', 'q2', 'q3']
scores_df.columns
# Assign threshold score for honors to a variable
= 93
honors_thresh
= scores_df.query('q1 >= @honors_thresh')
quiz_1_honors quiz_1_honors
section | student | q1 | q2 | q3 | |
---|---|---|---|---|---|
0 | A | 1 | 99.0 | 95 | NaN |
3 | A | 4 | 100.0 | 94 | 75.0 |
Arithmetic operators can also be used within the query string.
In the code below we identify students who improved at least 10 points from quiz 1 to quiz 2.
'q2 >= q1 + 10') scores_df.query(
section | student | q1 | q2 | q3 | |
---|---|---|---|---|---|
2 | A | 3 | 55.0 | 78 | NaN |
4 | B | 5 | 75.0 | 85 | 93.0 |
1.9.4.1 The isin()
method
The isin()
method returns a Boolean array based on whether or not a Series’s value is in a specified collection of values. It is very useful to specify multiple criteria without having to use lots fully-stated criteria connected with logical or.
For example, Let’s look at date
, player
, team
, and pts
for the top scoring performances by Lakers or Clippers, sorted in decreasing order of points scored. Only include rows for player-games in which 40 or more points were scored.
(boxscores'team'].isin(['LAL', 'LAC'])) &
.loc[(boxscores['pts'] >= 40), ['date', 'player', 'team', 'pts']]
(boxscores[= 'pts', ascending = False)
.sort_values(by )
date | player | team | pts | |
---|---|---|---|---|
20318 | 20240308 | D. Russell | LAL | 44 |
6855 | 20231208 | K. Leonard | LAC | 41 |
11779 | 20240109 | A. Davis | LAL | 41 |
23829 | 20240331 | P. George | LAC | 41 |
9174 | 20231223 | L. James | LAL | 40 |
9312 | 20231225 | A. Davis | LAL | 40 |
19086 | 20240229 | A. Davis | LAL | 40 |
21558 | 20240316 | L. James | LAL | 40 |
23807 | 20240331 | L. James | LAL | 40 |
1.10 Operations on columns
In this section we will learn how to create new columns from existing columns, how to remove (rows or) columns, and how to rename columns.
1.10.1 Adding new columns by assigning to them
You can create a new column simply by assigning to it. For example, in the code example below we create new columns named tres_pct
, fg_pct
, and ft_pct
to represent percentage of three pointers made, percentage of shots made, and percentage of free throws made, respectively.
Operations are performed element-wise, so you don’t need to loop over rows. What does that mean? Take a look at the first line in the code below. What it means is that the value in each row of the new column tres_pct
(three point shooting percentage) is calculated by taking that row’s tres_m
(three pointers made) and dividing it by that row’s tres_a
(three pointers attempted).
'tres_pct'] = boxscores['tres_m']/boxscores['tres_a']
boxscores['fg_pct'] = boxscores['fg_m']/boxscores['fg_a']
boxscores['ft_pct'] = boxscores['ft_m']/boxscores['ft_a'] boxscores[
Assigning directly to a column that did not previously exist in the DataFrame adds that column to the DataFrame.
1.10.2 Adding new columns with assign()
We can also use the assign()
method to create a copy of the DataFrame with specified columns added. This is useful for method chaining. For example, below we calculate points per minute and view the player name, points per minute, points, rebounds, assists, steals, and minutes for the top-10 best games in which a player played for at least 28 minutes. Note that with this approach the new column ppm
is not permanently added to the boxscores
DataFrame.
(boxscores= boxscores['pts'] / boxscores['min'])
.assign(ppm 'min'] >= 28, ['player', 'ppm', 'pts', 'reb', 'ast', 'stl', 'min']]
.loc[boxscores[= 'ppm', ascending = False)
.sort_values(by 10)
.head( )
player | ppm | pts | reb | ast | stl | min | |
---|---|---|---|---|---|---|---|
13726 | J. Embiid | 1.89 | 70 | 18 | 5 | 1 | 37 |
7396 | G. Antetokounmpo | 1.73 | 64 | 14 | 3 | 4 | 37 |
14327 | D. Booker | 1.63 | 62 | 5 | 4 | 1 | 38 |
13768 | K. Towns | 1.63 | 62 | 8 | 2 | 0 | 38 |
14280 | L. Doncic | 1.62 | 73 | 10 | 7 | 1 | 45 |
2072 | J. Embiid | 1.55 | 48 | 11 | 6 | 1 | 31 |
2560 | G. Antetokounmpo | 1.50 | 54 | 12 | 3 | 0 | 36 |
24354 | M. Flynn | 1.47 | 50 | 6 | 5 | 3 | 34 |
15660 | Steph Curry | 1.46 | 60 | 6 | 4 | 0 | 41 |
7782 | J. Brunson | 1.43 | 50 | 6 | 9 | 5 | 35 |
1.10.3 Using drop()
to remove rows or columns from a DataFrame
With drop()
we specify the labels to drop, the axis to which the labels refer (0 for rows and 1 for columns). drop()
then returns a copy of the DataFrame with the specified rows and/or columns dropped. It does not affect the original DataFrame unless we assign the result back to it or set inplace = True
.
As an example, let’s drop the section
column from the scores_df
DataFrame.
'section', axis = 1, inplace = True)
scores_df.drop(
scores_df
student | q1 | q2 | q3 | |
---|---|---|---|---|
0 | 1 | 99.0 | 95 | NaN |
1 | 2 | 87.0 | 90 | NaN |
2 | 3 | 55.0 | 78 | NaN |
3 | 4 | 100.0 | 94 | 75.0 |
4 | 5 | 75.0 | 85 | 93.0 |
5 | 6 | NaN | 90 | NaN |
6 | 7 | 88.0 | 90 | 85.0 |
7 | 8 | 90.0 | 89 | 88.0 |
8 | 9 | 92.0 | 99 | 90.0 |
9 | 10 | NaN | 100 | 92.0 |
1.10.4 Renaming DataFrame columns
To rename columns in a DataFrame use the rename()
method with the columns
parameter set to a dict in the form {old name : new_name}
= {'q1': 'first_quiz',
scores_df.rename(columns 'q2': 'second_quiz',
'q3': 'third_quiz'})
student | first_quiz | second_quiz | third_quiz | |
---|---|---|---|---|
0 | 1 | 99.0 | 95 | NaN |
1 | 2 | 87.0 | 90 | NaN |
2 | 3 | 55.0 | 78 | NaN |
3 | 4 | 100.0 | 94 | 75.0 |
4 | 5 | 75.0 | 85 | 93.0 |
5 | 6 | NaN | 90 | NaN |
6 | 7 | 88.0 | 90 | 85.0 |
7 | 8 | 90.0 | 89 | 88.0 |
8 | 9 | 92.0 | 99 | 90.0 |
9 | 10 | NaN | 100 | 92.0 |
Note that your change won’t persist unless you set inplace = True
or assign the result of the rename()
method back to scores_df
. When we display scores_df
below we see that it still has the old names for the columns that represent the quiz scores.
scores_df
student | q1 | q2 | q3 | |
---|---|---|---|---|
0 | 1 | 99.0 | 95 | NaN |
1 | 2 | 87.0 | 90 | NaN |
2 | 3 | 55.0 | 78 | NaN |
3 | 4 | 100.0 | 94 | 75.0 |
4 | 5 | 75.0 | 85 | 93.0 |
5 | 6 | NaN | 90 | NaN |
6 | 7 | 88.0 | 90 | 85.0 |
7 | 8 | 90.0 | 89 | 88.0 |
8 | 9 | 92.0 | 99 | 90.0 |
9 | 10 | NaN | 100 | 92.0 |
To rename columns we can also just assign new column labels over the old ones with the columns
property. This operation doesn’t return a copy of the DataFrame; it changes the column labels in place.
= ['student', 'quiz_1', 'quiz_2', 'quiz_3']
scores_df.columns scores_df
student | quiz_1 | quiz_2 | quiz_3 | |
---|---|---|---|---|
0 | 1 | 99.0 | 95 | NaN |
1 | 2 | 87.0 | 90 | NaN |
2 | 3 | 55.0 | 78 | NaN |
3 | 4 | 100.0 | 94 | 75.0 |
4 | 5 | 75.0 | 85 | 93.0 |
5 | 6 | NaN | 90 | NaN |
6 | 7 | 88.0 | 90 | 85.0 |
7 | 8 | 90.0 | 89 | 88.0 |
8 | 9 | 92.0 | 99 | 90.0 |
9 | 10 | NaN | 100 | 92.0 |
1.11 Vectorized string methods with the str
accessor
Series
has a set of vectorized string processing methods. Vectorized means that they operate on each element of the Series individually and then return a Series of all the individual results. These methods also exclude missing/NA values automatically. These string methods are accessed via the Series object’s str
accessor and generally have names matching python’s built-in string methods. For more information on these string methods see the pandas string methods documentation.
As an example, lets look at the top-10 scoring performances by players with first initial R. In the code below the .str
after boxscores['player']
is used to let us access the string method startswith()
. We use the startswith()
method to check for player
values that start with the R
character. The code then uses method chaining to sort the result by pts
descending and then show the first 10 rows.
(boxscores'player'].str.startswith('R'), ['player', 'pts']]
.loc[boxscores[= 'pts', ascending = False)
.sort_values(by 10)
.head( )
player | pts | |
---|---|---|
11513 | R. Barrett | 37 |
17582 | R. Hachimura | 36 |
5369 | R. Jackson | 35 |
25943 | R. Barrett | 35 |
23312 | R. Hachimura | 32 |
25218 | R. Hachimura | 30 |
20036 | R. Hachimura | 29 |
13748 | R. Barrett | 29 |
24136 | R. Barrett | 28 |
7009 | R. Barrett | 27 |
Another useful string method that may be applied to a Series element-wise with the str
accessor is contains()
. With this method we can check to see if a string contains a specific value. For example, the following code shows the unique player names for all players whose name includes the string ‘Ham’.
'player'].str.contains('Ham'), 'player'].unique() boxscores.loc[boxscores[
array(['R.J. Hampton'], dtype=object)
The contains()
method has a parameter, case
, that defaults to True
, which makes the test case-sensitive. If we don’t want it to be case sensitive we can set case = False
. Note how the code example below is identical to the code example above, except that we have specified that the test is not case-sensitive by specifying case = False
. Many more names are a match when we remove the case sensitivity.
'player'].str.contains('Ham', case = False), 'player'].unique() boxscores.loc[boxscores[
array(['C. Cunningham', 'M. Branham', 'M. Beauchamp', 'R.J. Hampton',
'Julian Champagnie', 'L. Shamet', 'D. Graham', 'Justin Champagnie'],
dtype=object)
1.11.1 Extracting elements from lists, or strings with str.get()
When a vectorized string method results in a Series of lists or strings the str.get()
method may be used to extract elements from the lists or strings by position.
For example, to extract the last names of the NBA players we could use str.split()
to divide each name into a list, and then use the str.get()
method to extract the second item in the list.
'player'].str.split().str.get(1) boxscores[
0 Davis
1 Prince
2 James
3 Russell
4 Reaves
...
26493 Len
26494 Mitchell
26495 Duarte
26496 Jones
26497 Jones
Name: player, Length: 26498, dtype: object
1.12 Accessing parts of datetime fields with the dt
accessor
Base python has a datetime
type that is used for storing data that represents dates and times. Pandas has a dt
accessor for accessing parts of datetime objects in a Series that is similar to the str
accessor that we use to apply string methods to each item in a Series. We can use dt
to access components of datetime objects, including the following:
date
- the date without the time valuesmonth
- month as an integer (1 - 12)day
- day as an integer (1 - 31)year
- year as an integerisocalendar().week
- week of the year as an integerweekday
- day of week returned as an integer (0 = Monday; 6 = Sunday)day_name()
- day of week returned as a string
Before we demonstrate some of these methods let’s first transform the date
column in the boxscores
DataFrame to a datetime type. We saw when we ran the info()
method on the boxscores
DataFrame previously that it is of type int64
. There is a to_datetime()
method that operates on pandas Series objects. It tries to infer the string format of the dates in the column of values it is to convert. If it can’t infer the string format of the dates you can indicate the format with the format
parameter.
'date'] = pd.to_datetime(boxscores['date'], format = "%Y%m%d") boxscores[
After that transformation, when we run the info()
method of the boxscores
DataFrame we see that the date
column is now of type datetime.
boxscores.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26498 entries, 0 to 26497
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 26498 non-null datetime64[ns]
1 team 26498 non-null object
2 opponent 26498 non-null object
3 player 26498 non-null object
4 position 26498 non-null object
5 min 26498 non-null int64
6 oreb 26498 non-null int64
7 dreb 26498 non-null int64
8 reb 26498 non-null int64
9 ast 26498 non-null int64
10 stl 26498 non-null int64
11 blk 26498 non-null int64
12 to 26498 non-null int64
13 pf 26498 non-null int64
14 plus_m 26498 non-null int64
15 pts 26498 non-null int64
16 fg_m 26498 non-null int64
17 fg_a 26498 non-null int64
18 tres_m 26498 non-null int64
19 tres_a 26498 non-null int64
20 ft_m 26498 non-null int64
21 ft_a 26498 non-null int64
22 tres_pct 20559 non-null float64
23 fg_pct 24895 non-null float64
24 ft_pct 14232 non-null float64
dtypes: datetime64[ns](1), float64(3), int64(17), object(4)
memory usage: 5.1+ MB
To see a bit of how the dt
accessor works, let’s view 5 random rows of the date
column and then use the dt
accessor and one of the datetime methods to extract just the day of the week from the dates for those rows.
= boxscores.sample(5)
bs_selection 'date'] bs_selection[
10392 2024-01-01
10465 2024-01-01
19810 2024-03-05
7437 2023-12-13
13121 2024-01-18
Name: date, dtype: datetime64[ns]
Now let’s use the dt
accessor and the day_name()
method to see which days of the week those dates are.
'date'].dt.day_name() bs_selection[
10392 Monday
10465 Monday
19810 Tuesday
7437 Wednesday
13121 Thursday
Name: date, dtype: object
Let’s look at the top 10 scoring performances in October, the first month of the season.
'date'].dt.month == 10, ['date', 'player', 'pts']]
(boxscores.loc[boxscores[= 'pts', ascending = False)
.sort_values(by 10)
.head( )
date | player | pts | |
---|---|---|---|
575 | 2023-10-28 | Z. LaVine | 51 |
504 | 2023-10-27 | L. Doncic | 49 |
444 | 2023-10-27 | Donovan Mitchell | 43 |
1034 | 2023-10-30 | Steph Curry | 42 |
553 | 2023-10-27 | Steph Curry | 41 |
960 | 2023-10-30 | D. Murray | 41 |
563 | 2023-10-27 | D. Fox | 39 |
329 | 2023-10-26 | D. Lillard | 39 |
336 | 2023-10-26 | K. Durant | 39 |
871 | 2023-10-29 | D. Fox | 37 |
Next, let’s look at the top scoring performances that occurred on a Monday.
(boxscores'date'].dt.day_name() == 'Monday', ['date', 'player', 'pts']]
.loc[boxscores[= 'pts', ascending = False)
.sort_values(by 10)
.head( )
date | player | pts | |
---|---|---|---|
13726 | 2024-01-22 | J. Embiid | 70 |
13768 | 2024-01-22 | K. Towns | 62 |
24107 | 2024-04-01 | D. Booker | 52 |
9344 | 2023-12-25 | L. Doncic | 50 |
2072 | 2023-11-06 | J. Embiid | 48 |
14975 | 2024-01-29 | L. Doncic | 45 |
4191 | 2023-11-20 | J. Tatum | 45 |
2100 | 2023-11-06 | C. Thomas | 45 |
8209 | 2023-12-18 | C. Cunningham | 43 |
13801 | 2024-01-22 | K. Durant | 43 |
1.13 Dealing with missing data
Pandas attempts to make working with missing data as painless as possible. For example, all of the descriptive statistics on pandas objects exclude missing data by default with the skipna = True
parameter setting.
For numbers, pandas uses the floating-point value NaN
(Not a Number) to represent missing data. The python special value None
is also treated as missing data.
Let’s practice working with the scores_df
DataFrame, which was created with some missing data.
scores_df
student | quiz_1 | quiz_2 | quiz_3 | |
---|---|---|---|---|
0 | 1 | 99.0 | 95 | NaN |
1 | 2 | 87.0 | 90 | NaN |
2 | 3 | 55.0 | 78 | NaN |
3 | 4 | 100.0 | 94 | 75.0 |
4 | 5 | 75.0 | 85 | 93.0 |
5 | 6 | NaN | 90 | NaN |
6 | 7 | 88.0 | 90 | 85.0 |
7 | 8 | 90.0 | 89 | 88.0 |
8 | 9 | 92.0 | 99 | 90.0 |
9 | 10 | NaN | 100 | 92.0 |
1.13.1 Methods for identifying and working with missing values
The following methods are very useful for dealing with missing values:
dropna()
- Filters axis labels based on whether values for each label have missing data. Can use thethresh
parameter to set threshold for how many values must be non-missing for the row or column to be kept. Specify rows or columns with theaxis
parameter.fillna()
- Fills in missing data with a specified value or by using an interpolation method such as “ffill” or “bfill”.isna()
- Returns Boolean values, withTrue
where values are missing. This can be combined withsum()
to count missing values ormean()
to calculate percentage of missing values.notna()
- Negation ofisna()
. ReturnsTrue
for non-missing values andFalse
for missing values.
In the code cell below we count the missing values in each column (the default) in the score_df
DataFrame. We make use of an interesting trick that you can use in many different programming languages. scores_df.isna()
returns a DataFrame of Booleans (True
and False
values) in the same size and shape of the scores_df
DataFrame. The True
values occur in the positions of the DataFrame that have missing values. When a math operation such as the sum()
method is applied to Booleans the True
values are treated as ones and the False
values are treated as zeros. So, summing the Booleans gives the count of True
values. The True
values only occur where there are missing values, so we get a count of missing values.
sum() scores_df.isna().
student 0
quiz_1 2
quiz_2 0
quiz_3 4
dtype: int64
If we want to count the missing values by row we can set the axis = 1
parameter within the sum()
method to count the missing values across the columns.
sum(axis = 1) scores_df.isna().
0 1
1 1
2 1
3 0
4 0
5 2
6 0
7 0
8 0
9 1
dtype: int64
Next we calculate the proportion of missing values by column. When we apply the mean()
method to Booleans the True
values are treated as ones and the False
values are treated as zeros. The mean()
method takes the column means by default. When you take the mean of a set of zeros and ones you end up with the proportion of ones! In this case the proportion of ones is also the proportion of missing values, because the True
values were treated as ones.
Just to illustrate, if we have the values 1, 1, 0, 0, 0 and we take the mean we get 2/5, or 0.40. This is also the proportion of ones in the values (2 ones out of 5 total values).
scores_df.isna().mean()
student 0.0
quiz_1 0.2
quiz_2 0.0
quiz_3 0.4
dtype: float64
We can calculate the proportion of missing values in the rows by setting the axis
parameter to 1
.
= 1) scores_df.isna().mean(axis
0 0.25
1 0.25
2 0.25
3 0.00
4 0.00
5 0.50
6 0.00
7 0.00
8 0.00
9 0.25
dtype: float64
If we set the axis
parameter to None
we can calculate the proportion of missing values in the entire DataFrame.
= None) scores_df.isna().mean(axis
0.15
We can also use the Boolean array returned by isna()
within a loc[]
to select based on missing values. The code below selects just the rows that have missing values for quiz 1.
'quiz_1'].isna()] scores_df.loc[scores_df[
student | quiz_1 | quiz_2 | quiz_3 | |
---|---|---|---|---|
5 | 6 | NaN | 90 | NaN |
9 | 10 | NaN | 100 | 92.0 |
The dropna()
method can be used to drop rows or columns that have missing values. By default the axis
parameter is set to 0, so rows with missing values are dropped.
scores_df
student | quiz_1 | quiz_2 | quiz_3 | |
---|---|---|---|---|
0 | 1 | 99.0 | 95 | NaN |
1 | 2 | 87.0 | 90 | NaN |
2 | 3 | 55.0 | 78 | NaN |
3 | 4 | 100.0 | 94 | 75.0 |
4 | 5 | 75.0 | 85 | 93.0 |
5 | 6 | NaN | 90 | NaN |
6 | 7 | 88.0 | 90 | 85.0 |
7 | 8 | 90.0 | 89 | 88.0 |
8 | 9 | 92.0 | 99 | 90.0 |
9 | 10 | NaN | 100 | 92.0 |
scores_df.dropna()
student | quiz_1 | quiz_2 | quiz_3 | |
---|---|---|---|---|
3 | 4 | 100.0 | 94 | 75.0 |
4 | 5 | 75.0 | 85 | 93.0 |
6 | 7 | 88.0 | 90 | 85.0 |
7 | 8 | 90.0 | 89 | 88.0 |
8 | 9 | 92.0 | 99 | 90.0 |
If we set axis = 1
columns with missing values are dropped. Now we only see the columns for student and quiz 2. Quiz 1 and quiz 3 both have some missing values.
= 1) scores_df.dropna(axis
student | quiz_2 | |
---|---|---|
0 | 1 | 95 |
1 | 2 | 90 |
2 | 3 | 78 |
3 | 4 | 94 |
4 | 5 | 85 |
5 | 6 | 90 |
6 | 7 | 90 |
7 | 8 | 89 |
8 | 9 | 99 |
9 | 10 | 100 |
We can use the thresh
parameter to specify how many values must be non-missing to keep the column or row. In the code below we drop columns that do not have at least 7 non-missing values. The result now has the results for quiz 1 and quiz 2.
= 1, thresh = 7) scores_df.dropna(axis
student | quiz_1 | quiz_2 | |
---|---|---|---|
0 | 1 | 99.0 | 95 |
1 | 2 | 87.0 | 90 |
2 | 3 | 55.0 | 78 |
3 | 4 | 100.0 | 94 |
4 | 5 | 75.0 | 85 |
5 | 6 | NaN | 90 |
6 | 7 | 88.0 | 90 |
7 | 8 | 90.0 | 89 |
8 | 9 | 92.0 | 99 |
9 | 10 | NaN | 100 |
1.13.2 Examples of how missing data affects various calculations
When doing mathematical calculations it is important to consider the effects of missing data. The pandas basic statistical methods ignore missing data by default, because they have a skipna
parameter that is set to True
by default.
'quiz_1', 'quiz_2', 'quiz_3']].mean() # skipna = True is set by default scores_df.loc[:, [
quiz_1 85.75
quiz_2 91.00
quiz_3 87.17
dtype: float64
Skipping the missing values may lead to misleading summary statistics if there are a lot of missing values. For example, imagine taking the mean score for a quiz for which 20 students have scores and 30 students do not yet have scores. You would get a mean, but it only reflects scores for 40% of the students. If we set skipna = False
the summary statistic will only be calculated if there are no missing values. The code below only produces a mean score for quiz 2, because that is the only quiz for which none of the scores are missing.
'quiz_1', 'quiz_2', 'quiz_3']].mean(skipna = False) scores_df.loc[:, [
quiz_1 NaN
quiz_2 91.0
quiz_3 NaN
dtype: float64
Normal arithmetic operators, unlike the pandas summary functions, do not ignore missing values. The code below adds a new column called quiz_avg
to the scores_df
DataFrame that has the calculated average quiz score for each student. Note that it only creates an average for student who have no missing quiz scores.
'quiz_avg'] = (scores_df['quiz_1'] + scores_df['quiz_2'] + scores_df['quiz_3']) / 3 scores_df[
scores_df
student | quiz_1 | quiz_2 | quiz_3 | quiz_avg | |
---|---|---|---|---|---|
0 | 1 | 99.0 | 95 | NaN | NaN |
1 | 2 | 87.0 | 90 | NaN | NaN |
2 | 3 | 55.0 | 78 | NaN | NaN |
3 | 4 | 100.0 | 94 | 75.0 | 89.67 |
4 | 5 | 75.0 | 85 | 93.0 | 84.33 |
5 | 6 | NaN | 90 | NaN | NaN |
6 | 7 | 88.0 | 90 | 85.0 | 87.67 |
7 | 8 | 90.0 | 89 | 88.0 | 89.00 |
8 | 9 | 92.0 | 99 | 90.0 | 93.67 |
9 | 10 | NaN | 100 | 92.0 | NaN |
To create an average quiz score for all students we could first use the fillna()
method to replace the missing values with zeros and then make the new column.
# Here we fill the missing values with zeros first
0, inplace = True) scores_df.fillna(
'quiz_avg'] = (scores_df['quiz_1'] + scores_df['quiz_2'] + scores_df['quiz_3']) / 3 scores_df[
scores_df
student | quiz_1 | quiz_2 | quiz_3 | quiz_avg | |
---|---|---|---|---|---|
0 | 1 | 99.0 | 95 | 0.0 | 64.67 |
1 | 2 | 87.0 | 90 | 0.0 | 59.00 |
2 | 3 | 55.0 | 78 | 0.0 | 44.33 |
3 | 4 | 100.0 | 94 | 75.0 | 89.67 |
4 | 5 | 75.0 | 85 | 93.0 | 84.33 |
5 | 6 | 0.0 | 90 | 0.0 | 30.00 |
6 | 7 | 88.0 | 90 | 85.0 | 87.67 |
7 | 8 | 90.0 | 89 | 88.0 | 89.00 |
8 | 9 | 92.0 | 99 | 90.0 | 93.67 |
9 | 10 | 0.0 | 100 | 92.0 | 64.00 |