1 Pandas Fundamentals

Click here to download this chapter as a Jupyter (.ipynb) file.

Pandas is a package that provides data structures and tools for practical, real-world data manipulation and analysis in python. It is designed to work with tabular data, that is, data structured with rows representing entities and columns representing variables related to those entities, like most spreadsheet data. Pandas is the most widely-used data manipulation package for python, and is a fundamental part of the business analyst’s toolset. In the Reddit thread linked below a user asks if pandas is used in industry and whether it is worthwhile to learn pandas. Take a look at the responses the user receives.

Reddit Thread: How much is Pandas used in the Industry? Is it worth learning?

This chapter introduces some fundamental features and tools of pandas. We will learn about the fundamental pandas data structures, the DataFrame and the Series, as well as tools for investigating the data in a DataFrame or Series and tools for filtering or subsetting data based on various criteria. This chapter also covers how to sort data, and introduces special methods for working with string data in columns and dates in columns.

It is important to recognize that the pandas skills and knowledge covered in this textbook are a useful and important subset, but we won’t be covering all of pandas, and we won’t present every feature of the pandas tools we do cover! To learn pandas really well you will need to continue to practice pandas skills frequently, and consult references to broaden your skillset from time to time. As you practice using the tools (pandas objects, properties and methods) presented in this textbook it is recommended that you take a look at their descriptions in the pandas documentation below to get a feel for all that they can do.

The pandas website has many good references to help you learn. You can consult the user guide or the API reference there. There is also a nice one-page pandas cheat sheet posted there.

If you like to learn from books I recommend the following book. A link is provided to the free edition, but it is also available on Amazon.com if you want a hard copy:

Python for Data Analysis, 3rd Edition by Wes McKinney, the original author of pandas

If you prefer to learn from videos the following two Youtube video series are highly regarded for learning pandas.

OK. Let’s start by importing the pandas package into our python session. When working with add-on packages in python you first need to install the package if it is not already installed, and then import the package or specific functions from the package into your python session to make the package and its functions available to your code. Pandas is typically imported with the following statement:

import pandas as pd

We will also import the numpy package in the standard manner to have access to numpy functions.

import numpy as np

1.1 Pandas data structures

Pandas has two primary data structures:

Series (1-dimensional) - An array of data of a single data type, akin to a column of data in a typical tabular data structure
DataFrame (2-dimensional) - Made up of multiple Series. Each Series can be a different data type. The DataFrame holds tabular data made up of rows and columns, like we might see in an Excel table or CSV file. Each of the columns is a Series object.

The DataFrame and Series data structures in pandas are implemented as objects. In the object-oriented programming paradigm objects can have properties, which are descriptive elements, and methods, which are functions that operate on the object or that the object can perform. You can think of properties as information about the object and methods as things the object can do. So, as you learn pandas you will be learning lots of different properties and methods that apply to Series or DataFrames. You will also learn pandas functions that can operate on Series or DataFrames.

A pandas DataFrame can be created in many ways. Below we will see how to create a DataFrame from a python dict object (“dictionary”) and from a CSV file. CSV is a common text representation of tabular data. CSV stands for “comma-separated values”. In a CSV file each row represents a row in a table, and the rows use commas to separate the columns within the rows.

1.2 Creating a DataFrame

There are many different ways of creating a pandas DataFrame. We will focus on two commonly-used methods: creating a DataFrame from a python dict (“dictionary”) and creating a DataFrame from a comma-separated-values (CSV) text file.

1.2.1 Creating a DataFrame from a python dict

The code below defines a basic python dict, then creates and displays a pandas DataFrame called student_data from that dict, using the pandas DataFrame() function. Since the DataFrame() function is supplied by pandas and is not a part of base python we need to preceed it with pd. That is because when we imported the pandas package we imported it with the statement import pandas as pd. The as pd part establishes pd as an alias for pandas, and in our code we should type pd instead of pandas when we want to access pandas functions.

The keys of the python dict become the column names in the new DataFrame, and the values of the python dict become the column values in the new DataFrame.

scores = {
    'name': ["Hannah", "Sam", "Bill", "Erin", "Latasha"], 
    'year': [2, 1, 1, 4, 4],
    'score': [98, 67, 75, 88, 99],
    'gender': ["F", "NB", "M", "F", "F"]
}

student_data = pd.DataFrame(scores)

student_data

	name	year	score	gender
0	Hannah	2	98	F
1	Sam	1	67	NB
2	Bill	1	75	M
3	Erin	4	88	F
4	Latasha	4	99	F

Notice that pandas automatically adds a row index to the DataFrame, in this case the integers from 0 to 3. We can see the row index displayed on the left side of the DataFrame. There is also a column index created from the column names. The index property of a DataFrame refers to the row index, and the columns property of a DataFrame refers to the column index.

Note, also, that row indexes and column indexes are special pandas data types. We will work with them more closely in later chapters.

The properties and methods of a DataFrame or Series object can be accessed with dot notation. The identifier (“name”) student_data refers to a specific pandas DataFrame object, the one we created above. Because student_data is an object of type DataFrame it has the properties and methods shared by all objects of the DataFrame type. To access a property or method, use the DataFrame name followed by a dot (.), followed by the name of the property or method. For example, the code below uses the index property of the student_data DataFrame to display information about the row index. We can see from the code output that the index is of type RangeIndex that goes from 0 to 5 by 1. Similar to a python range, the stop value, 5, is not included.

student_data.index

RangeIndex(start=0, stop=5, step=1)

Similarly, we can use the columns property to access information on the DataFrame’s column index.

student_data.columns

Index(['name', 'year', 'score', 'gender'], dtype='object')

We can also specify an index for a DataFrame using the set_index() method. In the code below we specify that the index for the student_data DataFrame should be the name column.

student_data = student_data.set_index('name')

Notice in the code above that we had to assign the result of applying the set_index() method on the student_data DataFrame back to the student_data object for the changes to persist. This is something to watch out for when using pandas: some methods or functions return a copy of the object with the operation applied, while other operations alter the object to which the operation is applied. Many of the functions or methods that return a copy of the object with the operation applied have a an inplace parameter that is by default set to False. If you set this parameter to True the operation will alter the object in place rather than returning a copy of the object with the operation applied.

Let’s re-create the student_data DataFrame and investigate this concept. When we re-create the student_data DataFrame from scratch with the code below the DataFrame is created using the default index of consecutive integers.

student_data = pd.DataFrame(scores)

student_data

	name	year	score	gender
0	Hannah	2	98	F
1	Sam	1	67	NB
2	Bill	1	75	M
3	Erin	4	88	F
4	Latasha	4	99	F

The code below uses the set_index() method to change the student_data DataFrame’s index to the name column. The set_index() method, though, is one of those methods that by default only returns a copy of the object with the method applied, rather than altering the object in place. So, here it creates and returns a copy of the student_data DataFrame that has the name column as the index, but that altered copy of the DataFrame is not assigned to anything, so when we display the student_data DataFrame we see that it is unchanged; it still has the integer index.

# This will not actually change the student_data DataFrame's index!
student_data.set_index('name')
student_data

	name	year	score	gender
0	Hannah	2	98	F
1	Sam	1	67	NB
2	Bill	1	75	M
3	Erin	4	88	F
4	Latasha	4	99	F

To make the index change persistent we need to tell the set_index() method to actually alter the DataFrame in place by setting the inplace parameter to True, as in the following code. Look at the left side of the DataFrame in the output and you wil see that the index for the DataFrame is now the data that was in the name column. Note that when a column becomes the index, like this, it is no longer a column and can no longer be accessed in the same way we normally acess a DataFrame column.

# This WILL change the student_data DataFrame's index.
student_data.set_index('name', inplace=True)
student_data

	year	score	gender
name
Hannah	2	98	F
Sam	1	67	NB
Bill	1	75	M
Erin	4	88	F
Latasha	4	99	F

1.2.2 Creating a DataFrame from a CSV file

Creating a DataFrame from a CSV file is very useful, because organizations typically have a lot of data in spreadsheets, and spreadsheet applications like Microsoft Excel and Libreoffice Calc have the ability to easily save their data as a CSV file. Organizations that share free data also often use the CSV format to disseminate the data.

The code below reads a CSV file from a server on the Internet and then creates a new DataFrame object called boxscores from the CSV file. This data is boxscore data from the 2023-24 NBA season. Each row of the data represents one player’s performance in one game.

# Here we read a csv file from the web
url = "https://neuronjolt.com/data/nba_bs_2023-24_season_cleaned.csv"
boxscores = pd.read_csv(url)

Note that if the file is read successfully there is no feedback provided by pandas.

After reading the file into the DataFrame boxscores we should investigate the data. The following section introduces many ways to get more familiar with the data in a DataFrame. These are important, because, unlike GUI applications like Microsoft Excel, the data is not displayed for us to see.

1.3 Investigating a DataFrame

1.3.1 The `info()` method

The info() method shows summary information about the DataFrame. From the output of the ’info()` method we can see how many columns and rows the DataFrame has, the name and data type of each of the columns, the number of non-missing entriesin each column, and the amount of memory used to store the DataFrame.

boxscores.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26498 entries, 0 to 26497
Data columns (total 22 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   date      26498 non-null  int64 
 1   team      26498 non-null  object
 2   opponent  26498 non-null  object
 3   player    26498 non-null  object
 4   position  26498 non-null  object
 5   min       26498 non-null  int64 
 6   oreb      26498 non-null  int64 
 7   dreb      26498 non-null  int64 
 8   reb       26498 non-null  int64 
 9   ast       26498 non-null  int64 
 10  stl       26498 non-null  int64 
 11  blk       26498 non-null  int64 
 12  to        26498 non-null  int64 
 13  pf        26498 non-null  int64 
 14  plus_m    26498 non-null  int64 
 15  pts       26498 non-null  int64 
 16  fg_m      26498 non-null  int64 
 17  fg_a      26498 non-null  int64 
 18  tres_m    26498 non-null  int64 
 19  tres_a    26498 non-null  int64 
 20  ft_m      26498 non-null  int64 
 21  ft_a      26498 non-null  int64 
dtypes: int64(18), object(4)
memory usage: 4.4+ MB

1.3.2 The `head(n)` and `tail(n)` methods.

You can use these methods to view the first n rows or the last n rows of the DataFrame, respectively. n defaults to 5.

boxscores.head()

	date	team	opponent	player	position	min	oreb	dreb	reb	ast	...	to	pf	plus_m	pts	fg_m	fg_a	tres_m	tres_a	ft_m	ft_a
0	20231024	LAL	DEN	A. Davis	PF	34	1	7	8	4	...	2	3	-17	17	6	17	1	2	4	4
1	20231024	LAL	DEN	T. Prince	PF	30	1	2	3	1	...	1	0	-14	18	6	8	4	6	2	2
2	20231024	LAL	DEN	L. James	SF	29	1	7	8	5	...	0	1	7	21	10	16	1	4	0	1
3	20231024	LAL	DEN	D. Russell	PG	36	0	4	4	7	...	3	3	1	11	4	12	2	5	1	2
4	20231024	LAL	DEN	A. Reaves	SG	31	4	4	8	4	...	2	2	-14	14	4	11	1	2	5	7

5 rows × 22 columns

boxscores.tail(2)

	date	team	opponent	player	position	min	oreb	dreb	reb	ast	...	to	pf	plus_m	pts	fg_m	fg_a	tres_m	tres_a	ft_m	ft_a
26496	20240414	SAC	POR	M. Jones	G	6	0	0	0	0	...	1	2	-8	0	0	1	0	1	0	0
26497	20240414	SAC	POR	C. Jones	G	4	0	1	1	1	...	0	0	1	8	3	3	0	0	2	2

2 rows × 22 columns

1.3.3 The `sample(n)` method

The sample() method returns a randomly-selected n rows of the DataFrame. This often gives a better feel for the data than head() or tail(), especially if the DataFrame has any inherent order to it.

boxscores.sample(3)

	date	team	opponent	player	position	min	oreb	dreb	reb	ast	...	to	pf	plus_m	pts	fg_m	fg_a	tres_m	tres_a	ft_m	ft_a
22653	20240323	PHX	SA	B. Beal	SG	28	0	1	1	12	...	1	1	15	13	6	10	1	3	0	0
8764	20231221	MEM	IND	Z. Williams	F	18	0	1	1	2	...	2	2	6	16	6	9	3	4	1	2
26115	20240412	UTAH	LAC	D. Bazley	F	23	2	3	5	1	...	2	1	-1	12	5	7	1	2	1	1

3 rows × 22 columns

1.4 Setting options for the pandas environment

Notice that when we displayed subsets of the DataFrame with the head() and sample() methods there were too many columns for us to see, with the default settings for pandas. Some of the columns are left out of the display, and we see ellipses (...) in the center of the DataFrame output instead. We can use the pandas get_option() and set_option() functions to set various environment options. There is also a reset_option() function to reset an option or options to the default setting, and a describe_option() function that can be used to show more information about all options available or more information about a specific option.

To view all the available options you can use the describe_option() method with nothing specified in the parentheses. The code to run to see all the available options is shown below, but we will not run it here because it creates a long list of output!

pd.describe_option()

The option we want to set to control how many columns are displayed is display.max_columns. Let’s see the maximum number of columns being displayed by default and change the settings so that we can see all of the columns.

pd.get_option('display.max_columns')

We know from our investigation of the boxscores DataFrame above that it has 22 columns. We may also add some new columns later on. So, let’s set the pandas display.max_columns setting to 30 columns.

pd.set_option('display.max_columns', 30)

After changing that setting, when we display selected rows of the DataFrame we will now be able to see all the columns, because we have 22 columns of data, and the display.max_columns setting is now set to 30.

boxscores.sample(3)

	date	team	opponent	player	position	min	oreb	dreb	reb	ast	stl	blk	to	pf	plus_m	pts	fg_m	fg_a	tres_m	tres_a	ft_m	ft_a
22320	20240321	NY	DEN	I. Hartenstein	C	26	3	5	8	3	1	2	3	6	-15	20	8	10	0	0	4	5
25152	20240407	SA	PHI	V. Wembanyama	C	43	3	15	18	6	0	7	9	1	-1	33	11	20	5	10	6	7
19765	20240305	MIA	DET	N. Jovic	F	21	0	4	4	2	2	0	0	2	8	8	3	5	1	2	1	2

We can also set other options to control the number of rows displayed, and number of digits to show after the decimal point.

pd.set_option('display.max_rows', 100)
pd.set_option('display.precision', 2)

1.5 Other properties and methods for investigating data

Other useful properties and methods for investigating your data include shape, columns, describe(), value_counts(), unique(), and nunique(). Some of these methods and properties can be applied to both DataFrame and Series objects, and others apply only to one or the other.

1.5.1 The `shape` property

The shape property returns a tuple that indicates the number of rows and number of columns. This can be useful if you need to programmatically extract the number of rows and columns.

Before going into more details, let’s read in some new data from a CSV file into a new DataFrame named store.

store = pd.read_csv("https://neuronjolt.com/data/superstore.csv")

Now, let’s use the shape property of the store DataFrame to return a tuple that indicates the number of rows and columns in the store DataFrame.

store.shape

(9994, 20)

Because we know the format of the tuple returned by the shape property we can extract parts of it with our code. The code below uses indexing to extract the first element of the tuple (the element at index 0, which is the number of rows in the store DataFrame), store it in a variable named num_rows, and include that variable in a string that is printed to output.

num_rows = store.shape[0]
print(f"The store DataFrame has {num_rows} rows.")

The store DataFrame has 9994 rows.

1.5.2 Accessing individual columns with `[]`

Recall that the columns property returns the Index object with the column names.

boxscores.columns

Index(['date', 'team', 'opponent', 'player', 'position', 'min', 'oreb', 'dreb',
       'reb', 'ast', 'stl', 'blk', 'to', 'pf', 'plus_m', 'pts', 'fg_m', 'fg_a',
       'tres_m', 'tres_a', 'ft_m', 'ft_a'],
      dtype='object')

Individual columns or lists of columns can be accessed through square brackets, similar to values in a python dict. Notice that when one column is indicated in the square brackets a pandas Series is returned. When a list of columns is indicated in the square brackets, even if there is only one column in the list, a pandas DataFrame is returned. This is a common pattern in pandas.

In the code cell below one column is indicated, so a Series is returned.

boxscores['player'].head()

0      A. Davis
1     T. Prince
2      L. James
3    D. Russell
4     A. Reaves
Name: player, dtype: object

In the code cell below we use the type() function to verify that a Series is returned. The type() function returns the type or class of its argument.

type(boxscores['player'].head())

pandas.core.series.Series

In the next two examples a list of columns is indicated in the square brackets, so a DataFrame is returned.

boxscores[['player']].head()

	player
0	A. Davis
1	T. Prince
2	L. James
3	D. Russell
4	A. Reaves

type(boxscores[['player']].head())

pandas.core.frame.DataFrame

boxscores[['player', 'position']].head()

	player	position
0	A. Davis	PF
1	T. Prince	PF
2	L. James	SF
3	D. Russell	PG
4	A. Reaves	SG

type(boxscores[['player', 'position']].head())

pandas.core.frame.DataFrame

1.5.3 The `describe()` method

The describe() method gives summary information about each column (Series) in the DataFrame, such as the number of non-null values, the five-number summary, the mean, and the standard deviation.

boxscores.describe()

	date	min	oreb	dreb	reb	ast	stl	blk	to	pf	plus_m	pts	fg_m	fg_a	tres_m	tres_a	ft_m	ft_a
count	2.65e+04	26498.00	26498.00	26498.00	26498.00	26498.00	26498.00	26498.00	26498.00	26498.00	26498.00	26498.00	26498.00	26498.00	26498.00	26498.00	26498.00	26498.00
mean	2.02e+07	22.41	0.98	3.06	4.04	2.48	0.69	0.48	1.20	1.74	0.00	10.60	3.91	8.25	1.19	3.26	1.58	2.02
std	4.44e+03	11.17	1.35	2.72	3.47	2.66	0.94	0.86	1.37	1.45	11.57	8.99	3.35	6.19	1.49	3.06	2.30	2.75
min	2.02e+07	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	-58.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
25%	2.02e+07	14.00	0.00	1.00	1.00	0.00	0.00	0.00	0.00	1.00	-7.00	3.00	1.00	3.00	0.00	1.00	0.00	0.00
50%	2.02e+07	23.00	1.00	2.00	3.00	2.00	0.00	0.00	1.00	2.00	0.00	9.00	3.00	7.00	1.00	3.00	1.00	1.00
75%	2.02e+07	32.00	1.00	4.00	6.00	4.00	1.00	1.00	2.00	3.00	7.00	16.00	6.00	12.00	2.00	5.00	2.00	3.00
max	2.02e+07	54.00	15.00	20.00	31.00	23.00	7.00	10.00	11.00	6.00	46.00	73.00	25.00	47.00	12.00	23.00	24.00	32.00

By default the describe() method will only show the summary for numeric variables. However, if the include="all" parameter is set it will also show a summary for other types of variables.

store.describe(include="all")

	Order ID	Order Date	Ship Date	Ship Mode	Customer ID	Customer Name	Segment	Country	City	State	Postal Code	Region	Product ID	Category	Sub-Category	Product Name	Sales	Quantity	Discount	Profit
count	9994	9994	9994	9994	9994	9994	9994	9994	9994	9994	9994.00	9994	9994	9994	9994	9994	9994.00	9994.00	9994.00	9994.00
unique	5009	1237	1334	4	793	793	3	1	531	49	NaN	4	1862	3	17	1850	NaN	NaN	NaN	NaN
top	CA-2017-100111	9/5/2016	12/16/2015	Standard Class	WB-21850	William Brown	Consumer	United States	New York City	California	NaN	West	OFF-PA-10001970	Office Supplies	Binders	Staple envelope	NaN	NaN	NaN	NaN
freq	14	38	35	5968	37	37	5191	9994	915	2001	NaN	3203	19	6026	1523	48	NaN	NaN	NaN	NaN
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	55190.38	NaN	NaN	NaN	NaN	NaN	229.86	3.79	0.16	28.66
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	32063.69	NaN	NaN	NaN	NaN	NaN	623.25	2.23	0.21	234.26
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1040.00	NaN	NaN	NaN	NaN	NaN	0.44	1.00	0.00	-6599.98
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	23223.00	NaN	NaN	NaN	NaN	NaN	17.28	2.00	0.00	1.73
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	56430.50	NaN	NaN	NaN	NaN	NaN	54.49	3.00	0.20	8.67
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	90008.00	NaN	NaN	NaN	NaN	NaN	209.94	5.00	0.20	29.36
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	99301.00	NaN	NaN	NaN	NaN	NaN	22638.48	14.00	0.80	8399.98

The describe() method can also be used on a Series object.

boxscores['pts'].describe()

count    26498.00
mean        10.60
std          8.99
min          0.00
25%          3.00
50%          9.00
75%         16.00
max         73.00
Name: pts, dtype: float64

1.5.4 The `value_counts()` method

The value_counts() method returns the counts of all the unique values, sorted from most frequent to least frequent, as a pandas Series object.

store['Ship Mode'].value_counts()

Ship Mode
Standard Class    5968
Second Class      1945
First Class       1538
Same Day           543
Name: count, dtype: int64

The value_counts() method can even be used for numeric variables. The code below uses value_counts() to count how many player-games are in the boxscores data for each unique number of rebounds. The code uses something called method chaining to add a head() method to limit how many rows are displayed. Method chaining uses the dot (.) operator. The operations execute from right to left, and the object returned by each operation becomes the input (first argument) for the subsequent operation.

So, in the code below we start with the reb column in the boxscores DataFrame. We apply value_counts() to it to return a Series of counts for each number of rebounds. Then, to that Series the head() method is applied to that Series of counts, so that only the first 5 elements in the Series are displayed. Chaining things together like this is very common when working with pandas. To understand and use chained methods you need to get into the habit of thinking about each operation and what it returns, then thinking about what other operations can be aplied to the object returned by that operation.

boxscores['reb'].value_counts().head()

reb
2    3674
1    3594
3    3425
0    3258
4    2980
Name: count, dtype: int64

1.5.5 The `unique()` method

The unique() method returns the unique values of a Series. The code below returns all the unique values found in the Region column of the store DataFrame, as a numpy array.

store['Region'].unique()

array(['South', 'West', 'Central', 'East'], dtype=object)

1.5.6 The `nunique()` method

The nunique() method returns the number of unique values in a Series. From the code below we can see that there are 4 unique values in the Regions column in the store DataFrame.

store['Region'].nunique()

1.6 Methods to calculate summary statistics

Take a look at [this link] (https://pandas.pydata.org/docs/user_guide/basics.html#descriptive-statistics) to the pandas user guide to see some examples and a deeper explanation. These methods can be used on DataFrames or Series objects.

Some commonly-used methods include the following:

count() (counts the number of non-missing observations)
sum()
mean()
median()
min()
max()
std()

Note that these methods can take an axis parameter to indicate which axis they work across. If axis = 0 it means they work across the rows and thus apply to the columns. If axis = 1 it means they work across the columns and thus apply to the rows.

Let’s create a simple dataframe of quiz scores and take a look. Note that we are creating the quizzes DataFrame from a python dict with the pandas DataFrame function, as before, but we are using method chaining to direct the output of the DataFrame function directly to the set_index method with which we specify that the name column should be used as the DataFrame’s row index.

quizzes = {
    'name': ["Hannah", "Sam", "Anjali", "Erin", "Latasha"], 
    'quiz_1': [89, 74, 79, 90, 95],
    'quiz_2': [73, 75, 78, 88, 86],
    'quiz_3': [92, 88, 85, 95, 100],
    'quiz_4': [100, 90, 88, 92, 98]
}

quiz_df = pd.DataFrame(quizzes).set_index("name")

quiz_df

	quiz_1	quiz_2	quiz_3	quiz_4
name
Hannah	89	73	92	100
Sam	74	75	88	90
Anjali	79	78	85	88
Erin	90	88	95	92
Latasha	95	86	100	98

Calculate the standard deviation of the scores for quiz 2.

quiz_df['quiz_2'].std()

6.670832032063167

To calculate the averages per student we need to calculate the means for the rows. The row’s go across the columns, which are axis 1, so we use the axis = 1 parameter to specify that we want the means calculated across the columns.

quiz_df.mean(axis = 1)

name
Hannah     88.50
Sam        81.75
Anjali     82.50
Erin       91.25
Latasha    94.75
dtype: float64

To calculate the averages per quiz we need to calculate the means for the columns, which go across the rows, which are axis 0. So, we specify axis = 0 to calculate the means across the rows.

quiz_df.mean(axis = 0)

quiz_1    85.4
quiz_2    80.0
quiz_3    92.0
quiz_4    93.6
dtype: float64

1.7 Sorting with the `sort_values()` method

This method can apply to a Series or a DataFrame. If applied to a DataFrame the by parameter indicates what to sort on (typically a column name or a list of column names). Both take an ascending parameter that defaults to True. To sort in descending order you need to set ascending = False.

Let’s take another look at the quiz_df DataFrame, but sort it by the quiz_1 score, in descending order.

quiz_df.sort_values(by = 'quiz_1', ascending = False)

	quiz_1	quiz_2	quiz_3	quiz_4
name
Latasha	95	86	100	98
Erin	90	88	95	92
Hannah	89	73	92	100
Anjali	79	78	85	88
Sam	74	75	88	90

Note that sort_values() is another one of those DataFrame methods that returns an altered copy of the DataFrame instead of altering it in place. So, the code above showed us what quiz_df looks like if sorted in descending order of the score on the first quiz, but didn’t actually change the quiz_df DataFrame. To actually change the quiz_df DataFrame we would need to use the inplace = True parameter in the sort_values() method or assign the result of the sort back to quiz_df.

When we display quiz_df below we can see that it is not sorted.

quiz_df

	quiz_1	quiz_2	quiz_3	quiz_4
name
Hannah	89	73	92	100
Sam	74	75	88	90
Anjali	79	78	85	88
Erin	90	88	95	92
Latasha	95	86	100	98

Let’s do the sort again, this time setting the inplace = True parameter. This time the quiz_df DataFrame is actually changed.

quiz_df.sort_values(by = 'quiz_1', ascending = False, inplace = True)

quiz_df

	quiz_1	quiz_2	quiz_3	quiz_4
name
Latasha	95	86	100	98
Erin	90	88	95	92
Hannah	89	73	92	100
Anjali	79	78	85	88
Sam	74	75	88	90

Below we sort the boxscores DataFrame by position and pts in descending order, display only the player, position, and pts columns, and show the first 10 rows. We use a list to indicate the sort columns. Note that we use a list for the sort order as well, to show what sort order to use for each of the columns. Since position lines up with True, position is sorted in ascending order, and ‘C’ comes first. pts, however, are sorted in descending order within each position because pts is the second column in the sort order list and False is the second Boolean in the list for the ascending parameter.

(boxscores
 .sort_values(by = ['position', 'pts'], ascending = [True, False])
 .loc[:, ['player', 'position', 'pts']]
 .head(10)
)

	player	position	pts
13726	J. Embiid	C	70
13768	K. Towns	C	62
8542	J. Embiid	C	51
6396	J. Embiid	C	50
2072	J. Embiid	C	48
19829	A. Sengun	C	45
7832	J. Embiid	C	42
13588	N. Jokic	C	42
24234	N. Jokic	C	42
7277	J. Embiid	C	41

For most of the examples in this textbook we will use a method-chaining approach, as in the code example above. As a reminder, though, sometimes you may want to break the operations into steps, particularly when you want to reuse one of the intermediate results. The code below shows what the code would look like if we broke the example above into two steps, naming the intermediate results sorted_pos_pts.

sorted_pos_pts = boxscores.sort_values(by = ['position', 'pts'], 
                                       ascending = [True, False])
sorted_pos_pts[['player', 'position', 'pts']].head(10)

	player	position	pts
13726	J. Embiid	C	70
13768	K. Towns	C	62
8542	J. Embiid	C	51
6396	J. Embiid	C	50
2072	J. Embiid	C	48
19829	A. Sengun	C	45
7832	J. Embiid	C	42
13588	N. Jokic	C	42
24234	N. Jokic	C	42
7277	J. Embiid	C	41

1.8 Removing duplicate rows from a DataFrame

The drop_duplicates() method by default returns a copy of a DataFrame with the duplicate rows removed. To remove duplicates in place set the inplace = True parameter. You can use the subset parameter to indicate the column(s) to use to determine whether a row is a duplicate or not. By default all columns are included.

The code cell below creates a new version of the quiz_df DataFrame that has duplicate rows.

quiz_df_with_dupes = quiz_df.sample(10, replace = True)

quiz_df_with_dupes

	quiz_1	quiz_2	quiz_3	quiz_4
name
Sam	74	75	88	90
Latasha	95	86	100	98
Anjali	79	78	85	88
Sam	74	75	88	90
Anjali	79	78	85	88
Anjali	79	78	85	88
Hannah	89	73	92	100
Erin	90	88	95	92
Latasha	95	86	100	98
Erin	90	88	95	92

Let’s now use drop_duplicates() to see what the quiz_df_with_dupes DataFrame looks like with duplicate rows removed.

quiz_df_with_dupes.drop_duplicates()

	quiz_1	quiz_2	quiz_3	quiz_4
name
Sam	74	75	88	90
Latasha	95	86	100	98
Anjali	79	78	85	88
Hannah	89	73	92	100
Erin	90	88	95	92

Let’s take a look at another example. In the boxscores DataFrame each row represents a player-game, so we have multiple rows for each player. What if we wanted to see all the unique player-team combinations for the season? If we just select the player and team columns the result has lots of duplicates.

(boxscores[['player', 'team']]
 .sort_values(by = ['player', 'team'])
 .head(12)
)

	player	team
2888	A. Bailey	CHA
7856	A. Bailey	CHA
16641	A. Bailey	CHA
21311	A. Bailey	CHA
21927	A. Bailey	CHA
22577	A. Bailey	CHA
22839	A. Bailey	CHA
23483	A. Bailey	CHA
25894	A. Bailey	CHA
26178	A. Bailey	CHA
136	A. Black	ORL
1177	A. Black	ORL

We can use drop_duplicates() to reduce it to one row per unique player-team combination.

(boxscores[['player', 'team']]
 .drop_duplicates()
 .sort_values(by = ['player', 'team'])
 .head(12)
)

	player	team
2888	A. Bailey	CHA
136	A. Black	ORL
1294	A. Brooks	BKN
165	A. Burks	DET
16867	A. Burks	NY
219	A. Caruso	CHI
312	A. Coffey	LAC
0	A. Davis	LAL
220	A. Dosunmu	CHI
217	A. Drummond	CHI
181	A. Edwards	MIN
18060	A. Flagler	OKC

1.9 Selecting a subset of a pandas DataFrame

Selecting a subset of data according to some criteria is a very common data operation. For example, a business analyst might want to see the sales for one family of products in one region of the country. A sports analyst might want to look at stats for a particular player or set of players. This section demonstrates several different ways of extracting subsets from pandas DataFrames. The first two methods we cover, selecting by label with loc[] and selecting by position with iloc[] are fundamental, so we need to cover them, but they are actually not as interesting as or used as often as the third method we will present - using a Boolean array as a selector. So, as you read through the first two subsections below, know that there is more interesting material ahead!

loc[] can also use a callable, such as a function, as a selector, but that is a more advanced usage, and it is only helpful in specific circumstances. We haven’t yet covered the code skills we need to create those circumstances, so we will hold off on covering the use of loc[] with a callable until the next chapter.

1.9.1 Selecting by label with `loc[]`

loc[] is a method used to access a selection from a DataFrame by index (row label) and column name (column label, optional). It may also be used with Boolean arrays or callables (functions) as the selectors, as we will see below. loc[] may be used to access values in a DataFrame or to set values in a DataFrame.

In the code below we select the first four rows (rows with index 0 to 3). Note that, unlike base python slices, when slicing by label the last label is included in the slice.

# Note that both endpoints are included if you select a slice of index values
boxscores.loc[0:3]

	date	team	opponent	player	position	min	oreb	dreb	reb	ast	stl	blk	to	pf	plus_m	pts	fg_m	fg_a	tres_m	tres_a	ft_m	ft_a
0	20231024	LAL	DEN	A. Davis	PF	34	1	7	8	4	0	2	2	3	-17	17	6	17	1	2	4	4
1	20231024	LAL	DEN	T. Prince	PF	30	1	2	3	1	0	1	1	0	-14	18	6	8	4	6	2	2
2	20231024	LAL	DEN	L. James	SF	29	1	7	8	5	1	0	0	1	7	21	10	16	1	4	0	1
3	20231024	LAL	DEN	D. Russell	PG	36	0	4	4	7	1	0	3	3	1	11	4	12	2	5	1	2

In this next example we use loc[] to select according to a slice of rows and two specific columns, indicated in the list ['player', 'position']. The rows are indicated first, then there is a comma, and then the columns are indicated. If there is no comma included loc[] accesses the specified rows for all the columns.

boxscores.loc[0:3, ['player', 'position']]

	player	position
0	A. Davis	PF
1	T. Prince	PF
2	L. James	SF
3	D. Russell	PG

In this example we access the first and fourth row, specifying them by index label with the list [0, 3].

boxscores.loc[[0, 3]]

	date	team	opponent	player	position	min	oreb	dreb	reb	ast	stl	blk	to	pf	plus_m	pts	fg_m	fg_a	tres_m	tres_a	ft_m	ft_a
0	20231024	LAL	DEN	A. Davis	PF	34	1	7	8	4	0	2	2	3	-17	17	6	17	1	2	4	4
3	20231024	LAL	DEN	D. Russell	PG	36	0	4	4	7	1	0	3	3	1	11	4	12	2	5	1	2

In the next example we used lists to specify both the row and the column labels for the selection.

boxscores.loc[[0, 3], ['player', 'pts']]

	player	pts
0	A. Davis	17
3	D. Russell	11

In this final example we use loc[] to access two specific rows, and a slice of columns.

boxscores.loc[[0, 3], 'player':'reb']

	player	position	min	oreb	dreb	reb
0	A. Davis	PF	34	1	7	8
3	D. Russell	PG	36	0	4	4

1.9.2 Selecting by position with `iloc[]`

iloc[] selects by row and column position (0-based). It isn’t very useful with the box score data, but could be useful with other tpes of data. Recall that when using loc[] slices include both ends. With iloc[] slices behave more like base python slices; the endpoints are not included.

In the first code example we select the second and third rows (indexes 1 and 2) and the third and fourth columns (indexes 2 and 3).

boxscores.iloc[1:3, 2:4]

	opponent	player
1	DEN	T. Prince
2	DEN	L. James

In the next example we use a list to select the row indexes and another list to select the columns.

boxscores.iloc[[0,3,49], [1,2,4]]

	team	opponent	position
0	LAL	DEN	PF
3	LAL	DEN	PG
49	ATL	CHA	PF

In the example below we select specific rows by their indexes, and then use a colon (:) to indicate all rows. We can also use a colon in the first position to indicate that all rows are selected.

boxscores.iloc[[1,5,7], :]

	date	team	opponent	player	position	min	oreb	dreb	reb	ast	blk	to	pf	plus_m	pts	fg_m	fg_a	tres_m	tres_a	ft_m	ft_a
1	20231024	LAL	DEN	T. Prince	PF	30	1	2	3	1	1	1	0	-14	18	6	8	4	6	2	2
5	20231024	LAL	DEN	R. Hachimura	PF	15	2	1	3	0	0	0	2	-8	6	3	10	0	3	0	0
7	20231024	LAL	DEN	M. Lewis	F	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

1.9.3 Boolean Arrays as selectors with `loc[]`

This method of selecting a subset from a DataFrame is very powerful and frequently used. We can use a Boolean array to select rows and columns. A Boolean array is an array of True or False values. Where the array is True the row or column is included and where the array is False the row or column is not included. For example, let’s view all boxscore lines for player-games in which the player scored 60 or more points. First, take a look at the Boolean array we will use. This Boolean array is created by the inequality boxscores['pts'] >= 60. It will have as many True and False values in total as there are values in the pts column of the boxscores DataFrame. For every row where pts is greater than or equal to 60 the Boolean array will have a True value. For every other row it will have a False value.

boxscores['pts'] >= 60

0        False
1        False
2        False
3        False
4        False
         ...  
26493    False
26494    False
26495    False
26496    False
26497    False
Name: pts, Length: 26498, dtype: bool

Next, we can use that Boolean array within loc[] to display only the rows of the boxscores DataFrame for which pts is >= 60. Only the rows that are in the same positions (same row index) as the True values in the Boolean array will be included in the subset. In addition to using the Boolean array to select the rows, we use a list to select a subset of the columns to be included. Finally, we use method chaining to tack on a sort at the end that will sort the result in descending order of points.

boxscores.loc[boxscores['pts'] >= 60, ['date', 'team', 'opponent', 'player', 'pts']].sort_values(by = 'pts', ascending = False)

	date	team	opponent	player	pts
14280	20240126	DAL	ATL	L. Doncic	73
13726	20240122	PHI	SA	J. Embiid	70
7396	20231213	MIL	IND	G. Antetokounmpo	64
13768	20240122	MIN	CHA	K. Towns	62
14327	20240126	PHX	IND	D. Booker	62
23635	20240329	NY	SA	J. Brunson	61
15660	20240203	GS	ATL	Steph Curry	60

You might have noticed that the code example above has a long line of code. For readability it is common practice in programming to try to keep code lines short. Ideally they should be 80 characters or less. However, recall that python is very picky about whitespace; line endings and indentation have meaning to the python interpreter. So, when building pandas code that includes a lot of method chaining one common practice to keep the lines shorter is to enclose the code within parentheses, since python ignores the whitespace for code within parentheses, brackets or braces. The code example below is the same code as the example above, but it is enclosed in parentheses to permit shorter lines, which results in more readable code.

(boxscores
 .loc[boxscores['pts'] >= 60, ['date', 'team', 'opponent', 'player', 'pts']]
 .sort_values(by = 'pts', ascending = False)
)

	date	team	opponent	player	pts
14280	20240126	DAL	ATL	L. Doncic	73
13726	20240122	PHI	SA	J. Embiid	70
7396	20231213	MIL	IND	G. Antetokounmpo	64
13768	20240122	MIN	CHA	K. Towns	62
14327	20240126	PHX	IND	D. Booker	62
23635	20240329	NY	SA	J. Brunson	61
15660	20240203	GS	ATL	Steph Curry	60

A Note on Style: An alternative to method chaining is to create intermediate objects and perform operations on them step by step. For example, we could accomplish the same thing as we did in the code above with the following code. Which approach you prefer is a matter of style, and each has its advantages and disadvantages. I prefer the method-chaining approach because it is clear and concise, and because it doesn’t create names for the objects created in the intermediate steps. For example, we can see that the the code below we had to create a new name, gt_60 to refer to the intermediate DataFrame. If you need to use that intermediate DataFrame again it is great to give it a name like this, but if you don’t have a need to refer to it again, naming it creates something unneeded in your code, and introduces more opportunity for error. How does it create opportunity for error? As you may already have experienced, writing code is typically an iterative process. You write some code, try it out, fix errors, revise it, try it again, etc. If you have lots of intermediate names created in your code it is very easy to write code using the wrong intermediate names, or write code that uses names that you created in a step that has been removed in one of your revisions. If you refer to a name that has been removed in a subsequent revision your code will run just fine in the current session, but if you close your session and re-run the code there will be an error!

These observations are not just theory, by the way. I’ve spent thousands of hours helping students with code and grading student code and I have seen many many errors caused by losing track of intermediate names.

gt_60 = boxscores.loc[boxscores['pts'] >= 60, 
['date', 'team', 'opponent', 'player', 'pts']]
gt_60.sort_values(by = 'pts', ascending = False)

	date	team	opponent	player	pts
14280	20240126	DAL	ATL	L. Doncic	73
13726	20240122	PHI	SA	J. Embiid	70
7396	20231213	MIL	IND	G. Antetokounmpo	64
13768	20240122	MIN	CHA	K. Towns	62
14327	20240126	PHX	IND	D. Booker	62
23635	20240329	NY	SA	J. Brunson	61
15660	20240203	GS	ATL	Steph Curry	60

When creating the Boolean array we can combine various logical tests with logical and (&) and logical or (|). When we do so each criterion has to be enclosed in parentheses. Let’s take a look at statlines for players who scored at least 40 points AND had at least 15 rebounds. Since there are two criteria that both need to be true, we enclose each in parentheses and connect them with the & character.

boxscores.loc[(boxscores['pts'] >= 40) & (boxscores['reb'] >= 15), 
['date', 'player', 'pts', 'reb']]

	date	player	pts	reb
3932	20231118	G. Antetokounmpo	40	15
7832	20231216	J. Embiid	42	15
11342	20240106	G. Antetokounmpo	48	17
13726	20240122	J. Embiid	70	18
17488	20240214	D. Avdija	43	15
19086	20240229	A. Davis	40	15
19253	20240301	G. Antetokounmpo	46	16
19829	20240305	A. Sengun	45	16
23645	20240329	V. Wembanyama	40	20
24234	20240402	N. Jokic	42	16

1.9.4 Use of the `query()` method to subset DataFrames

In addition to using using Boolean arrays with loc[] we can also subset DataFrames with the query() method and strings, similar to SQL, that specify the conditions we want to apply. The strings can refer to either column names or index names. This can lead to code that is more compact and, arguably, easier to read and understand. Notice that with the query() method we can use & or and for logical and and use | or or for logical or.

Let’s make a sample DataFrame named scores_df to use to investigate the query() method. Note that this sample DataFrame has some missing values. We will learn how to deal with those later in the chapter. The scores_df DataFrame shows quiz scores for students in multiple sections of a class.

scores_df = pd.DataFrame(
    {
        'section': ['A', 'B', 'A', 'A', 'B', 'B', 'A', 'A', 'B', 'A'],
        'student': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        'q1': [99, 87, 55, 100, 75, None, 88, 90, 92, np.nan],
        'q2': [95, 90, 78, 94, 85, 90, 90, 89, 99, 100],
        'q3': [None, np.nan, None, 75, 93, None, 85, 88, 90, 92]
    }
)

scores_df

	section	student	q1	q2	q3
0	A	1	99.0	95	NaN
1	B	2	87.0	90	NaN
2	A	3	55.0	78	NaN
3	A	4	100.0	94	75.0
4	B	5	75.0	85	93.0
5	B	6	NaN	90	NaN
6	A	7	88.0	90	85.0
7	A	8	90.0	89	88.0
8	B	9	92.0	99	90.0
9	A	10	NaN	100	92.0

Next, let’s filter scores_df to only show scores for students in section A who scored 90 or above on the first quiz (the q1 column). The first code example below accomplished the filtering by using loc[] with a Boolean array created by two criteria combined with an &, as we learned to do earlier in this chapter.

scores_df.loc[(scores_df['section'] == 'A') & (scores_df['q1'] >= 90)]

	section	student	q1	q2	q3
0	A	1	99.0	95	NaN
3	A	4	100.0	94	75.0
7	A	8	90.0	89	88.0

This next code example accomplishes the same filtering, but uses query() instead of loc[]. The query string is text, so it must be enclosed in single quotes or double quotes. Since one of the criteria refers to the section, which is a string value, we need to use different delimiters for the query string and for the section value. In the code example below single quotes are used to delimit the query string, and double quotes are used around the section value. We could also have used double quotes for the query string and single quotes for the section value and it would work the same way.

scores_df.query('section == "A" and q1 >= 90')

	section	student	q1	q2	q3
0	A	1	99.0	95	NaN
3	A	4	100.0	94	75.0
7	A	8	90.0	89	88.0

If the column names are not legal python identifiers (i.e. they have spaces in them or begin with numbers) we need to add backticks around them in the query string. Let’s change the column names of scores_df so that they are not legal python identifiers, and then demonstrate how to refer to them with backticks.

scores_df.columns = ['section', 'student number', '1 quiz', '2 quiz', '3 quiz']
scores_df

	section	student number	1 quiz	2 quiz	3 quiz
0	A	1	99.0	95	NaN
1	B	2	87.0	90	NaN
2	A	3	55.0	78	NaN
3	A	4	100.0	94	75.0
4	B	5	75.0	85	93.0
5	B	6	NaN	90	NaN
6	A	7	88.0	90	85.0
7	A	8	90.0	89	88.0
8	B	9	92.0	99	90.0
9	A	10	NaN	100	92.0

We see above that the column names have been changed. The code below uses query() to select the data for students with student number >= 5 and score on the third quiz >=90. Since student number and 3 quiz are both invalid python identifiers (the first contains a space and the second begins with a number and contains a space) we need to put backticks around them in the query string.

scores_df.query('`student number` >= 5 and `3 quiz` >= 90')

	section	student number	1 quiz	2 quiz	3 quiz
4	B	5	75.0	85	93.0
8	B	9	92.0	99	90.0
9	A	10	NaN	100	92.0

You can refer to a variable within the query string by prefixing the variable name with @. This gives you the flexibility to make your code more dynamic. For example, in the code below we first restore the column names for scores_df to the original column names. The we create a variable named honors_thresh that contains the threshold score value for receiving honors. Finally, we use query() to select all the data for students who scored at or above the threshold score on the first quiz.

# First, restore sane column names to the scores_df DataFrame
scores_df.columns = ['section', 'student', 'q1', 'q2', 'q3']

# Assign threshold score for honors to a variable
honors_thresh = 93

quiz_1_honors = scores_df.query('q1 >= @honors_thresh')
quiz_1_honors

	section	student	q1	q2	q3
0	A	1	99.0	95	NaN
3	A	4	100.0	94	75.0

Arithmetic operators can also be used within the query string.

In the code below we identify students who improved at least 10 points from quiz 1 to quiz 2.

scores_df.query('q2 >= q1 + 10')

	section	student	q1	q2	q3
2	A	3	55.0	78	NaN
4	B	5	75.0	85	93.0

1.9.4.1 The `isin()` method

The isin() method returns a Boolean array based on whether or not a Series’s value is in a specified collection of values. It is very useful to specify multiple criteria without having to use lots fully-stated criteria connected with logical or.

For example, Let’s look at date, player, team, and pts for the top scoring performances by Lakers or Clippers, sorted in decreasing order of points scored. Only include rows for player-games in which 40 or more points were scored.

(boxscores
 .loc[(boxscores['team'].isin(['LAL', 'LAC'])) &
      (boxscores['pts'] >= 40), ['date', 'player', 'team', 'pts']]
 .sort_values(by = 'pts', ascending = False)
)

	date	player	team	pts
20318	20240308	D. Russell	LAL	44
6855	20231208	K. Leonard	LAC	41
11779	20240109	A. Davis	LAL	41
23829	20240331	P. George	LAC	41
9174	20231223	L. James	LAL	40
9312	20231225	A. Davis	LAL	40
19086	20240229	A. Davis	LAL	40
21558	20240316	L. James	LAL	40
23807	20240331	L. James	LAL	40

1.10 Operations on columns

In this section we will learn how to create new columns from existing columns, how to remove (rows or) columns, and how to rename columns.

1.10.1 Adding new columns by assigning to them

You can create a new column simply by assigning to it. For example, in the code example below we create new columns named tres_pct, fg_pct, and ft_pct to represent percentage of three pointers made, percentage of shots made, and percentage of free throws made, respectively.

Operations are performed element-wise, so you don’t need to loop over rows. What does that mean? Take a look at the first line in the code below. What it means is that the value in each row of the new column tres_pct (three point shooting percentage) is calculated by taking that row’s tres_m (three pointers made) and dividing it by that row’s tres_a (three pointers attempted).

boxscores['tres_pct'] = boxscores['tres_m']/boxscores['tres_a']
boxscores['fg_pct'] = boxscores['fg_m']/boxscores['fg_a']
boxscores['ft_pct'] = boxscores['ft_m']/boxscores['ft_a']

Assigning directly to a column that did not previously exist in the DataFrame adds that column to the DataFrame.

1.10.2 Adding new columns with `assign()`

We can also use the assign() method to create a copy of the DataFrame with specified columns added. This is useful for method chaining. For example, below we calculate points per minute and view the player name, points per minute, points, rebounds, assists, steals, and minutes for the top-10 best games in which a player played for at least 28 minutes. Note that with this approach the new column ppm is not permanently added to the boxscores DataFrame.

(boxscores
 .assign(ppm = boxscores['pts'] / boxscores['min'])
 .loc[boxscores['min'] >= 28, ['player', 'ppm', 'pts', 'reb', 'ast', 'stl', 'min']]
 .sort_values(by = 'ppm', ascending = False)
 .head(10)
)

	player	ppm	pts	reb	ast	stl	min
13726	J. Embiid	1.89	70	18	5	1	37
7396	G. Antetokounmpo	1.73	64	14	3	4	37
14327	D. Booker	1.63	62	5	4	1	38
13768	K. Towns	1.63	62	8	2	0	38
14280	L. Doncic	1.62	73	10	7	1	45
2072	J. Embiid	1.55	48	11	6	1	31
2560	G. Antetokounmpo	1.50	54	12	3	0	36
24354	M. Flynn	1.47	50	6	5	3	34
15660	Steph Curry	1.46	60	6	4	0	41
7782	J. Brunson	1.43	50	6	9	5	35

1.10.3 Using `drop()` to remove rows or columns from a DataFrame

With drop() we specify the labels to drop, the axis to which the labels refer (0 for rows and 1 for columns). drop() then returns a copy of the DataFrame with the specified rows and/or columns dropped. It does not affect the original DataFrame unless we assign the result back to it or set inplace = True.

As an example, let’s drop the section column from the scores_df DataFrame.

scores_df.drop('section', axis = 1, inplace = True)

scores_df

	student	q1	q2	q3
0	1	99.0	95	NaN
1	2	87.0	90	NaN
2	3	55.0	78	NaN
3	4	100.0	94	75.0
4	5	75.0	85	93.0
5	6	NaN	90	NaN
6	7	88.0	90	85.0
7	8	90.0	89	88.0
8	9	92.0	99	90.0
9	10	NaN	100	92.0

1.10.4 Renaming DataFrame columns

To rename columns in a DataFrame use the rename() method with the columns parameter set to a dict in the form {old name : new_name}

scores_df.rename(columns = {'q1': 'first_quiz',
                            'q2': 'second_quiz',
                            'q3': 'third_quiz'})

	student	first_quiz	second_quiz	third_quiz
0	1	99.0	95	NaN
1	2	87.0	90	NaN
2	3	55.0	78	NaN
3	4	100.0	94	75.0
4	5	75.0	85	93.0
5	6	NaN	90	NaN
6	7	88.0	90	85.0
7	8	90.0	89	88.0
8	9	92.0	99	90.0
9	10	NaN	100	92.0

Note that your change won’t persist unless you set inplace = True or assign the result of the rename() method back to scores_df. When we display scores_df below we see that it still has the old names for the columns that represent the quiz scores.

scores_df

	student	q1	q2	q3
0	1	99.0	95	NaN
1	2	87.0	90	NaN
2	3	55.0	78	NaN
3	4	100.0	94	75.0
4	5	75.0	85	93.0
5	6	NaN	90	NaN
6	7	88.0	90	85.0
7	8	90.0	89	88.0
8	9	92.0	99	90.0
9	10	NaN	100	92.0

To rename columns we can also just assign new column labels over the old ones with the columns property. This operation doesn’t return a copy of the DataFrame; it changes the column labels in place.

scores_df.columns = ['student', 'quiz_1', 'quiz_2', 'quiz_3']
scores_df

	student	quiz_1	quiz_2	quiz_3
0	1	99.0	95	NaN
1	2	87.0	90	NaN
2	3	55.0	78	NaN
3	4	100.0	94	75.0
4	5	75.0	85	93.0
5	6	NaN	90	NaN
6	7	88.0	90	85.0
7	8	90.0	89	88.0
8	9	92.0	99	90.0
9	10	NaN	100	92.0

1.11 Vectorized string methods with the `str` accessor

Series has a set of vectorized string processing methods. Vectorized means that they operate on each element of the Series individually and then return a Series of all the individual results. These methods also exclude missing/NA values automatically. These string methods are accessed via the Series object’s str accessor and generally have names matching python’s built-in string methods. For more information on these string methods see the pandas string methods documentation.

As an example, lets look at the top-10 scoring performances by players with first initial R. In the code below the .str after boxscores['player'] is used to let us access the string method startswith(). We use the startswith() method to check for player values that start with the R character. The code then uses method chaining to sort the result by pts descending and then show the first 10 rows.

(boxscores
 .loc[boxscores['player'].str.startswith('R'), ['player', 'pts']]
 .sort_values(by = 'pts', ascending = False)
 .head(10)
)

	player	pts
11513	R. Barrett	37
17582	R. Hachimura	36
5369	R. Jackson	35
25943	R. Barrett	35
23312	R. Hachimura	32
25218	R. Hachimura	30
20036	R. Hachimura	29
13748	R. Barrett	29
24136	R. Barrett	28
7009	R. Barrett	27

Another useful string method that may be applied to a Series element-wise with the str accessor is contains(). With this method we can check to see if a string contains a specific value. For example, the following code shows the unique player names for all players whose name includes the string ‘Ham’.

boxscores.loc[boxscores['player'].str.contains('Ham'), 'player'].unique()

array(['R.J. Hampton'], dtype=object)

The contains() method has a parameter, case, that defaults to True, which makes the test case-sensitive. If we don’t want it to be case sensitive we can set case = False. Note how the code example below is identical to the code example above, except that we have specified that the test is not case-sensitive by specifying case = False. Many more names are a match when we remove the case sensitivity.

boxscores.loc[boxscores['player'].str.contains('Ham', case = False), 'player'].unique()

array(['C. Cunningham', 'M. Branham', 'M. Beauchamp', 'R.J. Hampton',
       'Julian Champagnie', 'L. Shamet', 'D. Graham', 'Justin Champagnie'],
      dtype=object)

1.11.1 Extracting elements from lists, or strings with `str.get()`

When a vectorized string method results in a Series of lists or strings the str.get() method may be used to extract elements from the lists or strings by position.

For example, to extract the last names of the NBA players we could use str.split() to divide each name into a list, and then use the str.get() method to extract the second item in the list.

boxscores['player'].str.split().str.get(1)

0           Davis
1          Prince
2           James
3         Russell
4          Reaves
           ...   
26493         Len
26494    Mitchell
26495      Duarte
26496       Jones
26497       Jones
Name: player, Length: 26498, dtype: object

1.12 Accessing parts of datetime fields with the `dt` accessor

Base python has a datetime type that is used for storing data that represents dates and times. Pandas has a dt accessor for accessing parts of datetime objects in a Series that is similar to the str accessor that we use to apply string methods to each item in a Series. We can use dt to access components of datetime objects, including the following:

date - the date without the time values
month - month as an integer (1 - 12)
day - day as an integer (1 - 31)
year - year as an integer
isocalendar().week - week of the year as an integer
weekday - day of week returned as an integer (0 = Monday; 6 = Sunday)
day_name() - day of week returned as a string

Before we demonstrate some of these methods let’s first transform the date column in the boxscores DataFrame to a datetime type. We saw when we ran the info() method on the boxscores DataFrame previously that it is of type int64. There is a to_datetime() method that operates on pandas Series objects. It tries to infer the string format of the dates in the column of values it is to convert. If it can’t infer the string format of the dates you can indicate the format with the format parameter.

boxscores['date'] = pd.to_datetime(boxscores['date'], format = "%Y%m%d")

After that transformation, when we run the info() method of the boxscores DataFrame we see that the date column is now of type datetime.

boxscores.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26498 entries, 0 to 26497
Data columns (total 25 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   date      26498 non-null  datetime64[ns]
 1   team      26498 non-null  object        
 2   opponent  26498 non-null  object        
 3   player    26498 non-null  object        
 4   position  26498 non-null  object        
 5   min       26498 non-null  int64         
 6   oreb      26498 non-null  int64         
 7   dreb      26498 non-null  int64         
 8   reb       26498 non-null  int64         
 9   ast       26498 non-null  int64         
 10  stl       26498 non-null  int64         
 11  blk       26498 non-null  int64         
 12  to        26498 non-null  int64         
 13  pf        26498 non-null  int64         
 14  plus_m    26498 non-null  int64         
 15  pts       26498 non-null  int64         
 16  fg_m      26498 non-null  int64         
 17  fg_a      26498 non-null  int64         
 18  tres_m    26498 non-null  int64         
 19  tres_a    26498 non-null  int64         
 20  ft_m      26498 non-null  int64         
 21  ft_a      26498 non-null  int64         
 22  tres_pct  20559 non-null  float64       
 23  fg_pct    24895 non-null  float64       
 24  ft_pct    14232 non-null  float64       
dtypes: datetime64[ns](1), float64(3), int64(17), object(4)
memory usage: 5.1+ MB

To see a bit of how the dt accessor works, let’s view 5 random rows of the date column and then use the dt accessor and one of the datetime methods to extract just the day of the week from the dates for those rows.

bs_selection = boxscores.sample(5)
bs_selection['date']

10392   2024-01-01
10465   2024-01-01
19810   2024-03-05
7437    2023-12-13
13121   2024-01-18
Name: date, dtype: datetime64[ns]

Now let’s use the dt accessor and the day_name() method to see which days of the week those dates are.

bs_selection['date'].dt.day_name()

10392       Monday
10465       Monday
19810      Tuesday
7437     Wednesday
13121     Thursday
Name: date, dtype: object

Let’s look at the top 10 scoring performances in October, the first month of the season.

(boxscores.loc[boxscores['date'].dt.month == 10, ['date', 'player', 'pts']]
 .sort_values(by = 'pts', ascending = False)
 .head(10)
)

	date	player	pts
575	2023-10-28	Z. LaVine	51
504	2023-10-27	L. Doncic	49
444	2023-10-27	Donovan Mitchell	43
1034	2023-10-30	Steph Curry	42
553	2023-10-27	Steph Curry	41
960	2023-10-30	D. Murray	41
563	2023-10-27	D. Fox	39
329	2023-10-26	D. Lillard	39
336	2023-10-26	K. Durant	39
871	2023-10-29	D. Fox	37

Next, let’s look at the top scoring performances that occurred on a Monday.

(boxscores
 .loc[boxscores['date'].dt.day_name() == 'Monday', ['date', 'player', 'pts']]
 .sort_values(by = 'pts', ascending = False)
 .head(10)
)

	date	player	pts
13726	2024-01-22	J. Embiid	70
13768	2024-01-22	K. Towns	62
24107	2024-04-01	D. Booker	52
9344	2023-12-25	L. Doncic	50
2072	2023-11-06	J. Embiid	48
14975	2024-01-29	L. Doncic	45
4191	2023-11-20	J. Tatum	45
2100	2023-11-06	C. Thomas	45
8209	2023-12-18	C. Cunningham	43
13801	2024-01-22	K. Durant	43

1.13 Dealing with missing data

Pandas attempts to make working with missing data as painless as possible. For example, all of the descriptive statistics on pandas objects exclude missing data by default with the skipna = True parameter setting.

For numbers, pandas uses the floating-point value NaN (Not a Number) to represent missing data. The python special value None is also treated as missing data.

Let’s practice working with the scores_df DataFrame, which was created with some missing data.

scores_df

	student	quiz_1	quiz_2	quiz_3
0	1	99.0	95	NaN
1	2	87.0	90	NaN
2	3	55.0	78	NaN
3	4	100.0	94	75.0
4	5	75.0	85	93.0
5	6	NaN	90	NaN
6	7	88.0	90	85.0
7	8	90.0	89	88.0
8	9	92.0	99	90.0
9	10	NaN	100	92.0

1.13.1 Methods for identifying and working with missing values

The following methods are very useful for dealing with missing values:

dropna() - Filters axis labels based on whether values for each label have missing data. Can use the thresh parameter to set threshold for how many values must be non-missing for the row or column to be kept. Specify rows or columns with the axis parameter.
fillna() - Fills in missing data with a specified value or by using an interpolation method such as “ffill” or “bfill”.
isna() - Returns Boolean values, with True where values are missing. This can be combined with sum() to count missing values or mean() to calculate percentage of missing values.
notna() - Negation of isna(). Returns True for non-missing values and False for missing values.

In the code cell below we count the missing values in each column (the default) in the score_df DataFrame. We make use of an interesting trick that you can use in many different programming languages. scores_df.isna() returns a DataFrame of Booleans (True and False values) in the same size and shape of the scores_df DataFrame. The True values occur in the positions of the DataFrame that have missing values. When a math operation such as the sum() method is applied to Booleans the True values are treated as ones and the False values are treated as zeros. So, summing the Booleans gives the count of True values. The True values only occur where there are missing values, so we get a count of missing values.

scores_df.isna().sum()

student    0
quiz_1     2
quiz_2     0
quiz_3     4
dtype: int64

If we want to count the missing values by row we can set the axis = 1 parameter within the sum() method to count the missing values across the columns.

scores_df.isna().sum(axis = 1)

0    1
1    1
2    1
3    0
4    0
5    2
6    0
7    0
8    0
9    1
dtype: int64

Next we calculate the proportion of missing values by column. When we apply the mean() method to Booleans the True values are treated as ones and the False values are treated as zeros. The mean() method takes the column means by default. When you take the mean of a set of zeros and ones you end up with the proportion of ones! In this case the proportion of ones is also the proportion of missing values, because the True values were treated as ones.

Just to illustrate, if we have the values 1, 1, 0, 0, 0 and we take the mean we get 2/5, or 0.40. This is also the proportion of ones in the values (2 ones out of 5 total values).

scores_df.isna().mean()

student    0.0
quiz_1     0.2
quiz_2     0.0
quiz_3     0.4
dtype: float64

We can calculate the proportion of missing values in the rows by setting the axis parameter to 1.

scores_df.isna().mean(axis = 1)

0    0.25
1    0.25
2    0.25
3    0.00
4    0.00
5    0.50
6    0.00
7    0.00
8    0.00
9    0.25
dtype: float64

If we set the axis parameter to None we can calculate the proportion of missing values in the entire DataFrame.

scores_df.isna().mean(axis = None)

0.15

We can also use the Boolean array returned by isna() within a loc[] to select based on missing values. The code below selects just the rows that have missing values for quiz 1.

scores_df.loc[scores_df['quiz_1'].isna()]

	student	quiz_1	quiz_2	quiz_3
5	6	NaN	90	NaN
9	10	NaN	100	92.0

The dropna() method can be used to drop rows or columns that have missing values. By default the axis parameter is set to 0, so rows with missing values are dropped.

scores_df

	student	quiz_1	quiz_2	quiz_3
0	1	99.0	95	NaN
1	2	87.0	90	NaN
2	3	55.0	78	NaN
3	4	100.0	94	75.0
4	5	75.0	85	93.0
5	6	NaN	90	NaN
6	7	88.0	90	85.0
7	8	90.0	89	88.0
8	9	92.0	99	90.0
9	10	NaN	100	92.0

scores_df.dropna()

	student	quiz_1	quiz_2	quiz_3
3	4	100.0	94	75.0
4	5	75.0	85	93.0
6	7	88.0	90	85.0
7	8	90.0	89	88.0
8	9	92.0	99	90.0

If we set axis = 1 columns with missing values are dropped. Now we only see the columns for student and quiz 2. Quiz 1 and quiz 3 both have some missing values.

scores_df.dropna(axis = 1)

	student	quiz_2
0	1	95
1	2	90
2	3	78
3	4	94
4	5	85
5	6	90
6	7	90
7	8	89
8	9	99
9	10	100

We can use the thresh parameter to specify how many values must be non-missing to keep the column or row. In the code below we drop columns that do not have at least 7 non-missing values. The result now has the results for quiz 1 and quiz 2.

scores_df.dropna(axis = 1, thresh = 7)

	student	quiz_1	quiz_2
0	1	99.0	95
1	2	87.0	90
2	3	55.0	78
3	4	100.0	94
4	5	75.0	85
5	6	NaN	90
6	7	88.0	90
7	8	90.0	89
8	9	92.0	99
9	10	NaN	100

1.13.2 Examples of how missing data affects various calculations

When doing mathematical calculations it is important to consider the effects of missing data. The pandas basic statistical methods ignore missing data by default, because they have a skipna parameter that is set to True by default.

scores_df.loc[:, ['quiz_1', 'quiz_2', 'quiz_3']].mean()  # skipna = True is set by default

quiz_1    85.75
quiz_2    91.00
quiz_3    87.17
dtype: float64

Skipping the missing values may lead to misleading summary statistics if there are a lot of missing values. For example, imagine taking the mean score for a quiz for which 20 students have scores and 30 students do not yet have scores. You would get a mean, but it only reflects scores for 40% of the students. If we set skipna = False the summary statistic will only be calculated if there are no missing values. The code below only produces a mean score for quiz 2, because that is the only quiz for which none of the scores are missing.

scores_df.loc[:, ['quiz_1', 'quiz_2', 'quiz_3']].mean(skipna = False)

quiz_1     NaN
quiz_2    91.0
quiz_3     NaN
dtype: float64

Normal arithmetic operators, unlike the pandas summary functions, do not ignore missing values. The code below adds a new column called quiz_avg to the scores_df DataFrame that has the calculated average quiz score for each student. Note that it only creates an average for student who have no missing quiz scores.

scores_df['quiz_avg'] = (scores_df['quiz_1'] + scores_df['quiz_2'] + scores_df['quiz_3']) / 3

scores_df

	student	quiz_1	quiz_2	quiz_3	quiz_avg
0	1	99.0	95	NaN	NaN
1	2	87.0	90	NaN	NaN
2	3	55.0	78	NaN	NaN
3	4	100.0	94	75.0	89.67
4	5	75.0	85	93.0	84.33
5	6	NaN	90	NaN	NaN
6	7	88.0	90	85.0	87.67
7	8	90.0	89	88.0	89.00
8	9	92.0	99	90.0	93.67
9	10	NaN	100	92.0	NaN

To create an average quiz score for all students we could first use the fillna() method to replace the missing values with zeros and then make the new column.

# Here we fill the missing values with zeros first
scores_df.fillna(0, inplace = True)

scores_df['quiz_avg'] = (scores_df['quiz_1'] + scores_df['quiz_2'] + scores_df['quiz_3']) / 3

scores_df

	student	quiz_1	quiz_2	quiz_3	quiz_avg
0	1	99.0	95	0.0	64.67
1	2	87.0	90	0.0	59.00
2	3	55.0	78	0.0	44.33
3	4	100.0	94	75.0	89.67
4	5	75.0	85	93.0	84.33
5	6	0.0	90	0.0	30.00
6	7	88.0	90	85.0	87.67
7	8	90.0	89	88.0	89.00
8	9	92.0	99	90.0	93.67
9	10	0.0	100	92.0	64.00

1.1 Pandas data structures

1.2 Creating a DataFrame

1.2.1 Creating a DataFrame from a python dict

1.2.2 Creating a DataFrame from a CSV file

1.3 Investigating a DataFrame

1.3.1 The info() method

1.3.2 The head(n) and tail(n) methods.

1.3.3 The sample(n) method

1.4 Setting options for the pandas environment

1.5 Other properties and methods for investigating data

1.5.1 The shape property

1.5.2 Accessing individual columns with []

1.5.3 The describe() method

1.5.4 The value_counts() method

1.5.5 The unique() method

1.5.6 The nunique() method