Appendix C — Chapter 1 Practice - Solutions

Click here to download this chapter as a Jupyter (.ipynb) file.

These are suggested solutions to the practice exercises for the Pandas Fundamentals chapter. For some of the exercises alternative solutions are presented.

import pandas as pd
import numpy as np

The code below loads the superstore.csv dataset from an online server into a DataFrame named store. This data describes retail sales for a business. Each row of the dataset represents one product on one order. An order that included multiple products will have multiple rows, each with the same order ID but a different product ID.

store = pd.read_csv("https://neuronjolt.com/data/superstore.csv")

C.1 Practice Exercise 1-1

Use the .info() method to learn more about the store DataFrame

store.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Order ID       9994 non-null   object 
 1   Order Date     9994 non-null   object 
 2   Ship Date      9994 non-null   object 
 3   Ship Mode      9994 non-null   object 
 4   Customer ID    9994 non-null   object 
 5   Customer Name  9994 non-null   object 
 6   Segment        9994 non-null   object 
 7   Country        9994 non-null   object 
 8   City           9994 non-null   object 
 9   State          9994 non-null   object 
 10  Postal Code    9994 non-null   int64  
 11  Region         9994 non-null   object 
 12  Product ID     9994 non-null   object 
 13  Category       9994 non-null   object 
 14  Sub-Category   9994 non-null   object 
 15  Product Name   9994 non-null   object 
 16  Sales          9994 non-null   float64
 17  Quantity       9994 non-null   int64  
 18  Discount       9994 non-null   float64
 19  Profit         9994 non-null   float64
dtypes: float64(3), int64(2), object(15)
memory usage: 1.5+ MB

C.2 Practice Exercise 1-2

Use the .sample() method to examine 3 randomly-selected rows of data from the store DataFrame

store.sample(3)

	Order ID	Order Date	Ship Date	Ship Mode	Customer ID	Customer Name	Segment	Country	City	State	Postal Code	Region	Product ID	Category	Sub-Category	Product Name	Sales	Quantity	Discount	Profit
5053	CA-2015-141243	1/3/2015	1/8/2015	Second Class	AH-10465	Amy Hunt	Consumer	United States	Dallas	Texas	75217	Central	TEC-AC-10003198	Technology	Accessories	Enermax Acrylux Wireless Keyboard	398.400	5	0.2	84.6600
4774	CA-2017-119746	11/23/2017	11/27/2017	Standard Class	CM-12385	Christopher Martinez	Consumer	United States	Chicago	Illinois	60610	Central	TEC-PH-10004447	Technology	Phones	Toshiba IPT2010-SD IP Telephone	222.384	2	0.2	16.6788
4204	CA-2014-145387	10/31/2014	11/2/2014	Second Class	AM-10705	Anne McFarland	Consumer	United States	Cranston	Rhode Island	2920	East	OFF-PA-10002195	Office Supplies	Paper	RSVP Cards & Envelopes, Blank White, 8-1/2" X ...	10.160	2	0.0	4.7752

C.3 Practice Exercise 1-3

Display the names of the columns in the store DataFrame.

store.columns

Index(['Order ID', 'Order Date', 'Ship Date', 'Ship Mode', 'Customer ID',
       'Customer Name', 'Segment', 'Country', 'City', 'State', 'Postal Code',
       'Region', 'Product ID', 'Category', 'Sub-Category', 'Product Name',
       'Sales', 'Quantity', 'Discount', 'Profit'],
      dtype='object')

C.4 Practice Exercise 1-4

Display the count of unique orders (unique Order IDs) in the store DataFrame.

store['Order ID'].nunique()

C.5 Practice Exercise 1-5

Display the counts of rows in the store DataFrame for each unique product ID. Your result should show the unique product IDs with an integer beside each one that represents the count of rows in the dataset that have that product ID.

store['Product ID'].value_counts()

Product ID
OFF-PA-10001970    19
TEC-AC-10003832    18
FUR-FU-10004270    16
FUR-CH-10002647    15
FUR-CH-10001146    15
                   ..
TEC-MA-10004552     1
TEC-MA-10003589     1
OFF-AP-10003099     1
TEC-PH-10002645     1
OFF-ST-10001627     1
Name: count, Length: 1862, dtype: int64

C.6 Practice Exercise 1-6

Using just one pandas method, display descriptive statistics (including count, mean, std deviation, minimum, 25th percentile, median, 75th percentile, and maximum) for the sales amount of the order lines.

store['Sales'].describe()

count     9994.000000
mean       229.858001
std        623.245101
min          0.444000
25%         17.280000
50%         54.490000
75%        209.940000
max      22638.480000
Name: Sales, dtype: float64

The next exercises makes use of the quiz_df DataFrame, which shows quiz scores for several students. The DataFrame is created and displayed in the code cell below.

quizzes = {
    'name': ["Hannah", "Sam", "Anjali", "Erin", "Latasha"], 
    'quiz_1': [89, 74, 79, 90, 95],
    'quiz_2': [73, 75, 78, 88, 86],
    'quiz_3': [92, 88, 85, 95, 100],
    'quiz_4': [100, 90, 88, 92, 98]
}

quiz_df = pd.DataFrame(quizzes).set_index('name')

quiz_df

	quiz_1	quiz_2	quiz_3	quiz_4
name
Hannah	89	73	92	100
Sam	74	75	88	90
Anjali	79	78	85	88
Erin	90	88	95	92
Latasha	95	86	100	98

C.7 Practice Exercise 1-7

Using one pandas method, calculate and display the median quiz score for each student in the quiz_df DataFrame

quiz_df.median(axis = 1)

name
Hannah     90.5
Sam        81.5
Anjali     82.0
Erin       91.0
Latasha    96.5
dtype: float64

C.8 Practice Exercise 1-8

Calculate and display the minimum quiz score for each quiz in the quiz_df DataFrame

quiz_df.min(axis = 0)

quiz_1    74
quiz_2    73
quiz_3    85
quiz_4    88
dtype: int64

C.9 Practice Exercise 1-9

Show the quiz scores for just Anjali. Hint: In quiz_df the names are the row index, so they cannot be accessed as a column.

quiz_df.loc['Anjali']

quiz_1    79
quiz_2    78
quiz_3    85
quiz_4    88
Name: Anjali, dtype: int64

C.10 Practice Exercise 1-10

Show the scores on quizzes 1 through 3 for Sam and Erin.

quiz_df.loc[['Sam', 'Erin'], 'quiz_1':'quiz_3']

	quiz_1	quiz_2	quiz_3
name
Sam	74	75	88
Erin	90	88	95

Now, let’s load the NBA boxscore data from the 2023-24 season so that we can use it in some of the exercises below. Recall that in this dataset each row represents the statistics for one player in one game.

url = "https://neuronjolt.com/data/nba_bs_2023-24_season_cleaned.csv"
boxscores = pd.read_csv(url)

C.11 Practice Exercise 1-11

Show counts by player of games in which the player scored 50+ points during the season. Hint: Remember that each row represents a player-game, so you should limit the data to rows in which 50 or more points were scored and then count the rows for each player.

boxscores.loc[boxscores['pts'] >= 50, 'player'].value_counts()

player
T. Maxey            3
D. Booker           3
J. Embiid           3
J. Brunson          2
G. Antetokounmpo    2
L. Doncic           2
Z. LaVine           1
K. Towns            1
Steph Curry         1
M. Flynn            1
A. Edwards          1
Name: count, dtype: int64

# Alternative solution using .query()
boxscores.query('pts >= 50')['player'].value_counts()

player
T. Maxey            3
D. Booker           3
J. Embiid           3
J. Brunson          2
G. Antetokounmpo    2
L. Doncic           2
Z. LaVine           1
K. Towns            1
Steph Curry         1
M. Flynn            1
A. Edwards          1
Name: count, dtype: int64

C.12 Practice Exercise 1-12

Let’s look for some amazing guard play. Display player, ast and stl, and pts for player-games in which a player had 20 or more assists or 7 or more steals.

boxscores.loc[(boxscores['ast'] >= 20) | (boxscores['stl'] >= 7),
              ['player', 'ast', 'stl', 'pts']]

	player	ast	stl	pts
3282	S. Gilgeous-Alexander	5	7	28
9726	T. Haliburton	20	2	21
10128	T. Haliburton	23	2	22
17842	H. Jones	4	7	17
20672	A. Davis	5	7	27
25572	Tre Mann	7	7	16

# Alternative solution using .query()
boxscores.query('ast >= 20 or stl >=7').loc[:, ['player', 'ast', 'stl', 'pts']]

	player	ast	stl	pts
3282	S. Gilgeous-Alexander	5	7	28
9726	T. Haliburton	20	2	21
10128	T. Haliburton	23	2	22
17842	H. Jones	4	7	17
20672	A. Davis	5	7	27
25572	Tre Mann	7	7	16

C.13 Practice Exercise 1-13

Show player name and count of games (rows) in which the player had 10 or more assists. Use the head() method to limit the result to 10 rows. In other words, create a top-10 list of players according to how many games in which they had 10 or more assists.

boxscores.loc[boxscores['ast'] >= 10, 'player'].value_counts().head(10)

player
T. Haliburton    47
L. Doncic        37
Trae Young       37
N. Jokic         31
D. Sabonis       29
J. Harden        27
F. VanVleet      24
L. James         18
D. Murray        16
C. Cunningham    16
Name: count, dtype: int64

# Alternative solution using .query()
boxscores.query('ast >= 10')['player'].value_counts().head(10)

player
T. Haliburton    47
L. Doncic        37
Trae Young       37
N. Jokic         31
D. Sabonis       29
J. Harden        27
F. VanVleet      24
L. James         18
D. Murray        16
C. Cunningham    16
Name: count, dtype: int64

C.14 Practice Exercise 1-14

Show player name, points, rebounds, and assists for all the player-games in which the player had at least ten points, ten rebounds, and ten assists (“triple double”). Sort in decreasing order of points.

(boxscores
 .loc[(boxscores['pts'] >=10) & 
      (boxscores['reb'] >= 10) &
      (boxscores['ast'] >= 10), 
 ['player', 'pts', 'reb', 'ast']]
 .sort_values(by = 'pts', ascending = False)
)

	player	pts	reb	ast
7964	L. Doncic	40	12	10
6523	L. Doncic	40	10	11
20389	L. Doncic	39	10	10
9208	L. Doncic	39	12	10
25233	L. Doncic	39	12	10
...	...	...	...	...
22777	K. Middleton	11	10	10
22886	J. Hart	11	14	10
15081	J. Hart	10	10	10
21858	J. Hart	10	11	11
24225	Tre Jones	10	12	11

136 rows × 4 columns

# Alternative solution using .query()
(boxscores
 .query('pts >= 10 & reb >= 10 & ast >= 10')
 .loc[:, ['player', 'pts', 'reb', 'ast']]
 .sort_values(by = 'pts', ascending = False)
)

	player	pts	reb	ast
7964	L. Doncic	40	12	10
6523	L. Doncic	40	10	11
20389	L. Doncic	39	10	10
9208	L. Doncic	39	12	10
25233	L. Doncic	39	12	10
...	...	...	...	...
22777	K. Middleton	11	10	10
22886	J. Hart	11	14	10
15081	J. Hart	10	10	10
21858	J. Hart	10	11	11
24225	Tre Jones	10	12	11

136 rows × 4 columns

C.15 Practice Exercise 1-15

Display the entire stat line for any player who scored 40 or more points and 10 or more rebounds in a game while playing for the Lakers (‘LAL’ in the team column).

(boxscores
 .loc[(boxscores['team'] == 'LAL') &
      (boxscores['pts'] >= 40) &
      (boxscores['reb'] >= 10)]
)

	date	team	opponent	player	position	min	oreb	dreb	reb	ast	...	to	pf	plus_m	pts	fg_m	fg_a	tres_m	tres_a	ft_m	ft_a
9312	20231225	LAL	BOS	A. Davis	PF	39	5	8	13	4	...	2	2	-6	40	15	26	2	4	8	10
11779	20240109	LAL	TOR	A. Davis	PF	40	1	10	11	6	...	3	4	0	41	13	17	2	2	13	14
19086	20240229	LAL	WSH	A. Davis	PF	41	3	12	15	4	...	1	2	6	40	14	22	0	0	12	13

3 rows × 22 columns

# Alternative solution using .query()
boxscores.query('team == "LAL" & pts >= 40 & reb >= 10')

	date	team	opponent	player	position	min	oreb	dreb	reb	ast	...	to	pf	plus_m	pts	fg_m	fg_a	tres_m	tres_a	ft_m	ft_a
9312	20231225	LAL	BOS	A. Davis	PF	39	5	8	13	4	...	2	2	-6	40	15	26	2	4	8	10
11779	20240109	LAL	TOR	A. Davis	PF	40	1	10	11	6	...	3	4	0	41	13	17	2	2	13	14
19086	20240229	LAL	WSH	A. Davis	PF	41	3	12	15	4	...	1	2	6	40	14	22	0	0	12	13

3 rows × 22 columns

C.16 Practice Exercise 1-16

How many players are in the boxscores data?

boxscores['player'].nunique()

C.17 Practice Exercise 1-17

Convert the date column in the boxscore data to a datetime type.

boxscores['date'] = pd.to_datetime(boxscores['date'], format = "%Y%m%d")

C.18 Practice Exercise 1-18

Show the date, player, and points for the top 5 games (top 5 points scored by a player in a game) that occurred in January, 2024.

(boxscores
 .loc[boxscores['date'].dt.month == 1,['date', 'player', 'pts']]
 .sort_values(by = 'pts', ascending = False)
 .head(5)
)

	date	player	pts
14280	2024-01-26	L. Doncic	73
13726	2024-01-22	J. Embiid	70
13768	2024-01-22	K. Towns	62
14327	2024-01-26	D. Booker	62
13307	2024-01-19	D. Booker	52

C.19 Practice Exercise 1-19

Show all the unique player names that include the string “James”. Make the test case-sensitive.

(boxscores
 .loc[boxscores['player'].str.contains('James'), 'player']
 .unique()
)

array(['L. James', 'James Johnson'], dtype=object)

C.20 Practice Exercise 1-20

From the sales data, show descriptive statistics for quantity ordered for product ID ‘FUR-CH-10001146’.

store.loc[store['Product ID'] == 'FUR-CH-10001146', 'Quantity'].describe()

count    15.000000
mean      3.400000
std       1.723783
min       1.000000
25%       2.000000
50%       3.000000
75%       4.500000
max       7.000000
Name: Quantity, dtype: float64

The code below creates the scores_df DataFrame, which has some missing values. It shows course section and quiz scores for some students.

scores_df = pd.DataFrame(
    {
        'section': ['A', 'B', 'A', 'A', 'B', 'B', 'A', 'A', 'B', 'A'],
        'student': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        'q1': [99, 87, 55, 100, 75, None, 88, 90, 92, np.nan],
        'q2': [95, 90, 78, 94, 85, 90, 90, 89, 99, 100],
        'q3': [None, np.nan, None, 75, 93, None, 85, 88, 90, 92]
    }
)

scores_df

	section	student	q1	q2	q3
0	A	1	99.0	95	NaN
1	B	2	87.0	90	NaN
2	A	3	55.0	78	NaN
3	A	4	100.0	94	75.0
4	B	5	75.0	85	93.0
5	B	6	NaN	90	NaN
6	A	7	88.0	90	85.0
7	A	8	90.0	89	88.0
8	B	9	92.0	99	90.0
9	A	10	NaN	100	92.0

C.21 Practice Exercise 1-21

Show all rows in the scores DataFrame for which the quiz 3 score is missing.

scores_df.loc[scores_df['q3'].isna()]

	section	student	q1	q2	q3
0	A	1	99.0	95	NaN
1	B	2	87.0	90	NaN
2	A	3	55.0	78	NaN
5	B	6	NaN	90	NaN

C.22 Practice Exercise 1-22

Show all rows in the scores DataFrame for which the quiz 3 score is NOT missing.

scores_df.loc[scores_df['q3'].notna()]

	section	student	q1	q2	q3
3	A	4	100.0	94	75.0
4	B	5	75.0	85	93.0
6	A	7	88.0	90	85.0
7	A	8	90.0	89	88.0
8	B	9	92.0	99	90.0
9	A	10	NaN	100	92.0

C.23 Practice Exercise 1-23

Display the q1 column of the scores_df DataFrame with the missing values in that column replaced by zeros.

scores_df['q1'].fillna(0)

0     99.0
1     87.0
2     55.0
3    100.0
4     75.0
5      0.0
6     88.0
7     90.0
8     92.0
9      0.0
Name: q1, dtype: float64

C.24 Practice Exercise 1-24

Display the quiz_df, then add a new column to the DataFrame that has the quiz average for each student. Display the DataFrame again after adding the new column.

quiz_df

	quiz_1	quiz_2	quiz_3	quiz_4
name
Hannah	89	73	92	100
Sam	74	75	88	90
Anjali	79	78	85	88
Erin	90	88	95	92
Latasha	95	86	100	98

quiz_df['quiz_avg'] = (quiz_df['quiz_1'] + 
                       quiz_df['quiz_2'] + 
                       quiz_df['quiz_3'] + 
                       quiz_df['quiz_4']) / 4

quiz_df

	quiz_1	quiz_2	quiz_3	quiz_4	quiz_avg
name
Hannah	89	73	92	100	88.50
Sam	74	75	88	90	81.75
Anjali	79	78	85	88	82.50
Erin	90	88	95	92	91.25
Latasha	95	86	100	98	94.75

C.25 Practice Exercise 1-25

Delete the new column you created for the quiz average from the quiz_df DataFrame and then display the DataFrame to verify that the column is gone.

quiz_df.drop('quiz_avg', axis = 1, inplace = True)

quiz_df

	quiz_1	quiz_2	quiz_3	quiz_4
name
Hannah	89	73	92	100
Sam	74	75	88	90
Anjali	79	78	85	88
Erin	90	88	95	92
Latasha	95	86	100	98