Appendix C — Chapter 1 Practice - Solutions

Click here to download this chapter as a Jupyter (.ipynb) file.

These are suggested solutions to the practice exercises for the Pandas Fundamentals chapter. For some of the exercises alternative solutions are presented.

import pandas as pd
import numpy as np

The code below loads the superstore.csv dataset from an online server into a DataFrame named store. This data describes retail sales for a business. Each row of the dataset represents one product on one order. An order that included multiple products will have multiple rows, each with the same order ID but a different product ID.

store = pd.read_csv("https://neuronjolt.com/data/superstore.csv")

C.1 Practice Exercise 1-1

Use the .info() method to learn more about the store DataFrame

store.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Order ID       9994 non-null   object 
 1   Order Date     9994 non-null   object 
 2   Ship Date      9994 non-null   object 
 3   Ship Mode      9994 non-null   object 
 4   Customer ID    9994 non-null   object 
 5   Customer Name  9994 non-null   object 
 6   Segment        9994 non-null   object 
 7   Country        9994 non-null   object 
 8   City           9994 non-null   object 
 9   State          9994 non-null   object 
 10  Postal Code    9994 non-null   int64  
 11  Region         9994 non-null   object 
 12  Product ID     9994 non-null   object 
 13  Category       9994 non-null   object 
 14  Sub-Category   9994 non-null   object 
 15  Product Name   9994 non-null   object 
 16  Sales          9994 non-null   float64
 17  Quantity       9994 non-null   int64  
 18  Discount       9994 non-null   float64
 19  Profit         9994 non-null   float64
dtypes: float64(3), int64(2), object(15)
memory usage: 1.5+ MB

C.2 Practice Exercise 1-2

Use the .sample() method to examine 3 randomly-selected rows of data from the store DataFrame

store.sample(3)
Order ID Order Date Ship Date Ship Mode Customer ID Customer Name Segment Country City State Postal Code Region Product ID Category Sub-Category Product Name Sales Quantity Discount Profit
5053 CA-2015-141243 1/3/2015 1/8/2015 Second Class AH-10465 Amy Hunt Consumer United States Dallas Texas 75217 Central TEC-AC-10003198 Technology Accessories Enermax Acrylux Wireless Keyboard 398.400 5 0.2 84.6600
4774 CA-2017-119746 11/23/2017 11/27/2017 Standard Class CM-12385 Christopher Martinez Consumer United States Chicago Illinois 60610 Central TEC-PH-10004447 Technology Phones Toshiba IPT2010-SD IP Telephone 222.384 2 0.2 16.6788
4204 CA-2014-145387 10/31/2014 11/2/2014 Second Class AM-10705 Anne McFarland Consumer United States Cranston Rhode Island 2920 East OFF-PA-10002195 Office Supplies Paper RSVP Cards & Envelopes, Blank White, 8-1/2" X ... 10.160 2 0.0 4.7752

C.3 Practice Exercise 1-3

Display the names of the columns in the store DataFrame.

store.columns
Index(['Order ID', 'Order Date', 'Ship Date', 'Ship Mode', 'Customer ID',
       'Customer Name', 'Segment', 'Country', 'City', 'State', 'Postal Code',
       'Region', 'Product ID', 'Category', 'Sub-Category', 'Product Name',
       'Sales', 'Quantity', 'Discount', 'Profit'],
      dtype='object')

C.4 Practice Exercise 1-4

Display the count of unique orders (unique Order IDs) in the store DataFrame.

store['Order ID'].nunique()
5009

C.5 Practice Exercise 1-5

Display the counts of rows in the store DataFrame for each unique product ID. Your result should show the unique product IDs with an integer beside each one that represents the count of rows in the dataset that have that product ID.

store['Product ID'].value_counts()
Product ID
OFF-PA-10001970    19
TEC-AC-10003832    18
FUR-FU-10004270    16
FUR-CH-10002647    15
FUR-CH-10001146    15
                   ..
TEC-MA-10004552     1
TEC-MA-10003589     1
OFF-AP-10003099     1
TEC-PH-10002645     1
OFF-ST-10001627     1
Name: count, Length: 1862, dtype: int64

C.6 Practice Exercise 1-6

Using just one pandas method, display descriptive statistics (including count, mean, std deviation, minimum, 25th percentile, median, 75th percentile, and maximum) for the sales amount of the order lines.

store['Sales'].describe()
count     9994.000000
mean       229.858001
std        623.245101
min          0.444000
25%         17.280000
50%         54.490000
75%        209.940000
max      22638.480000
Name: Sales, dtype: float64

The next exercises makes use of the quiz_df DataFrame, which shows quiz scores for several students. The DataFrame is created and displayed in the code cell below.

quizzes = {
    'name': ["Hannah", "Sam", "Anjali", "Erin", "Latasha"], 
    'quiz_1': [89, 74, 79, 90, 95],
    'quiz_2': [73, 75, 78, 88, 86],
    'quiz_3': [92, 88, 85, 95, 100],
    'quiz_4': [100, 90, 88, 92, 98]
}

quiz_df = pd.DataFrame(quizzes).set_index('name')

quiz_df
quiz_1 quiz_2 quiz_3 quiz_4
name
Hannah 89 73 92 100
Sam 74 75 88 90
Anjali 79 78 85 88
Erin 90 88 95 92
Latasha 95 86 100 98

C.7 Practice Exercise 1-7

Using one pandas method, calculate and display the median quiz score for each student in the quiz_df DataFrame

quiz_df.median(axis = 1)
name
Hannah     90.5
Sam        81.5
Anjali     82.0
Erin       91.0
Latasha    96.5
dtype: float64

C.8 Practice Exercise 1-8

Calculate and display the minimum quiz score for each quiz in the quiz_df DataFrame

quiz_df.min(axis = 0)
quiz_1    74
quiz_2    73
quiz_3    85
quiz_4    88
dtype: int64

C.9 Practice Exercise 1-9

Show the quiz scores for just Anjali. Hint: In quiz_df the names are the row index, so they cannot be accessed as a column.

quiz_df.loc['Anjali']
quiz_1    79
quiz_2    78
quiz_3    85
quiz_4    88
Name: Anjali, dtype: int64

C.10 Practice Exercise 1-10

Show the scores on quizzes 1 through 3 for Sam and Erin.

quiz_df.loc[['Sam', 'Erin'], 'quiz_1':'quiz_3']
quiz_1 quiz_2 quiz_3
name
Sam 74 75 88
Erin 90 88 95

Now, let’s load the NBA boxscore data from the 2023-24 season so that we can use it in some of the exercises below. Recall that in this dataset each row represents the statistics for one player in one game.

url = "https://neuronjolt.com/data/nba_bs_2023-24_season_cleaned.csv"
boxscores = pd.read_csv(url)

C.11 Practice Exercise 1-11

Show counts by player of games in which the player scored 50+ points during the season. Hint: Remember that each row represents a player-game, so you should limit the data to rows in which 50 or more points were scored and then count the rows for each player.

boxscores.loc[boxscores['pts'] >= 50, 'player'].value_counts()
player
T. Maxey            3
D. Booker           3
J. Embiid           3
J. Brunson          2
G. Antetokounmpo    2
L. Doncic           2
Z. LaVine           1
K. Towns            1
Steph Curry         1
M. Flynn            1
A. Edwards          1
Name: count, dtype: int64
# Alternative solution using .query()
boxscores.query('pts >= 50')['player'].value_counts()
player
T. Maxey            3
D. Booker           3
J. Embiid           3
J. Brunson          2
G. Antetokounmpo    2
L. Doncic           2
Z. LaVine           1
K. Towns            1
Steph Curry         1
M. Flynn            1
A. Edwards          1
Name: count, dtype: int64

C.12 Practice Exercise 1-12

Let’s look for some amazing guard play. Display player, ast and stl, and pts for player-games in which a player had 20 or more assists or 7 or more steals.

boxscores.loc[(boxscores['ast'] >= 20) | (boxscores['stl'] >= 7),
              ['player', 'ast', 'stl', 'pts']]
player ast stl pts
3282 S. Gilgeous-Alexander 5 7 28
9726 T. Haliburton 20 2 21
10128 T. Haliburton 23 2 22
17842 H. Jones 4 7 17
20672 A. Davis 5 7 27
25572 Tre Mann 7 7 16
# Alternative solution using .query()
boxscores.query('ast >= 20 or stl >=7').loc[:, ['player', 'ast', 'stl', 'pts']]
player ast stl pts
3282 S. Gilgeous-Alexander 5 7 28
9726 T. Haliburton 20 2 21
10128 T. Haliburton 23 2 22
17842 H. Jones 4 7 17
20672 A. Davis 5 7 27
25572 Tre Mann 7 7 16

C.13 Practice Exercise 1-13

Show player name and count of games (rows) in which the player had 10 or more assists. Use the head() method to limit the result to 10 rows. In other words, create a top-10 list of players according to how many games in which they had 10 or more assists.

boxscores.loc[boxscores['ast'] >= 10, 'player'].value_counts().head(10)
player
T. Haliburton    47
L. Doncic        37
Trae Young       37
N. Jokic         31
D. Sabonis       29
J. Harden        27
F. VanVleet      24
L. James         18
D. Murray        16
C. Cunningham    16
Name: count, dtype: int64
# Alternative solution using .query()
boxscores.query('ast >= 10')['player'].value_counts().head(10)
player
T. Haliburton    47
L. Doncic        37
Trae Young       37
N. Jokic         31
D. Sabonis       29
J. Harden        27
F. VanVleet      24
L. James         18
D. Murray        16
C. Cunningham    16
Name: count, dtype: int64

C.14 Practice Exercise 1-14

Show player name, points, rebounds, and assists for all the player-games in which the player had at least ten points, ten rebounds, and ten assists (“triple double”). Sort in decreasing order of points.

(boxscores
 .loc[(boxscores['pts'] >=10) & 
      (boxscores['reb'] >= 10) &
      (boxscores['ast'] >= 10), 
 ['player', 'pts', 'reb', 'ast']]
 .sort_values(by = 'pts', ascending = False)
)
player pts reb ast
7964 L. Doncic 40 12 10
6523 L. Doncic 40 10 11
20389 L. Doncic 39 10 10
9208 L. Doncic 39 12 10
25233 L. Doncic 39 12 10
... ... ... ... ...
22777 K. Middleton 11 10 10
22886 J. Hart 11 14 10
15081 J. Hart 10 10 10
21858 J. Hart 10 11 11
24225 Tre Jones 10 12 11

136 rows × 4 columns

# Alternative solution using .query()
(boxscores
 .query('pts >= 10 & reb >= 10 & ast >= 10')
 .loc[:, ['player', 'pts', 'reb', 'ast']]
 .sort_values(by = 'pts', ascending = False)
)
player pts reb ast
7964 L. Doncic 40 12 10
6523 L. Doncic 40 10 11
20389 L. Doncic 39 10 10
9208 L. Doncic 39 12 10
25233 L. Doncic 39 12 10
... ... ... ... ...
22777 K. Middleton 11 10 10
22886 J. Hart 11 14 10
15081 J. Hart 10 10 10
21858 J. Hart 10 11 11
24225 Tre Jones 10 12 11

136 rows × 4 columns

C.15 Practice Exercise 1-15

Display the entire stat line for any player who scored 40 or more points and 10 or more rebounds in a game while playing for the Lakers (‘LAL’ in the team column).

(boxscores
 .loc[(boxscores['team'] == 'LAL') &
      (boxscores['pts'] >= 40) &
      (boxscores['reb'] >= 10)]
)
date team opponent player position min oreb dreb reb ast ... to pf plus_m pts fg_m fg_a tres_m tres_a ft_m ft_a
9312 20231225 LAL BOS A. Davis PF 39 5 8 13 4 ... 2 2 -6 40 15 26 2 4 8 10
11779 20240109 LAL TOR A. Davis PF 40 1 10 11 6 ... 3 4 0 41 13 17 2 2 13 14
19086 20240229 LAL WSH A. Davis PF 41 3 12 15 4 ... 1 2 6 40 14 22 0 0 12 13

3 rows × 22 columns

# Alternative solution using .query()
boxscores.query('team == "LAL" & pts >= 40 & reb >= 10')
date team opponent player position min oreb dreb reb ast ... to pf plus_m pts fg_m fg_a tres_m tres_a ft_m ft_a
9312 20231225 LAL BOS A. Davis PF 39 5 8 13 4 ... 2 2 -6 40 15 26 2 4 8 10
11779 20240109 LAL TOR A. Davis PF 40 1 10 11 6 ... 3 4 0 41 13 17 2 2 13 14
19086 20240229 LAL WSH A. Davis PF 41 3 12 15 4 ... 1 2 6 40 14 22 0 0 12 13

3 rows × 22 columns

C.16 Practice Exercise 1-16

How many players are in the boxscores data?

boxscores['player'].nunique()
584

C.17 Practice Exercise 1-17

Convert the date column in the boxscore data to a datetime type.

boxscores['date'] = pd.to_datetime(boxscores['date'], format = "%Y%m%d")

C.18 Practice Exercise 1-18

Show the date, player, and points for the top 5 games (top 5 points scored by a player in a game) that occurred in January, 2024.

(boxscores
 .loc[boxscores['date'].dt.month == 1,['date', 'player', 'pts']]
 .sort_values(by = 'pts', ascending = False)
 .head(5)
)
date player pts
14280 2024-01-26 L. Doncic 73
13726 2024-01-22 J. Embiid 70
13768 2024-01-22 K. Towns 62
14327 2024-01-26 D. Booker 62
13307 2024-01-19 D. Booker 52

C.19 Practice Exercise 1-19

Show all the unique player names that include the string “James”. Make the test case-sensitive.

(boxscores
 .loc[boxscores['player'].str.contains('James'), 'player']
 .unique()
)
array(['L. James', 'James Johnson'], dtype=object)

C.20 Practice Exercise 1-20

From the sales data, show descriptive statistics for quantity ordered for product ID ‘FUR-CH-10001146’.

store.loc[store['Product ID'] == 'FUR-CH-10001146', 'Quantity'].describe()
count    15.000000
mean      3.400000
std       1.723783
min       1.000000
25%       2.000000
50%       3.000000
75%       4.500000
max       7.000000
Name: Quantity, dtype: float64

The code below creates the scores_df DataFrame, which has some missing values. It shows course section and quiz scores for some students.

scores_df = pd.DataFrame(
    {
        'section': ['A', 'B', 'A', 'A', 'B', 'B', 'A', 'A', 'B', 'A'],
        'student': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        'q1': [99, 87, 55, 100, 75, None, 88, 90, 92, np.nan],
        'q2': [95, 90, 78, 94, 85, 90, 90, 89, 99, 100],
        'q3': [None, np.nan, None, 75, 93, None, 85, 88, 90, 92]
    }
)

scores_df
section student q1 q2 q3
0 A 1 99.0 95 NaN
1 B 2 87.0 90 NaN
2 A 3 55.0 78 NaN
3 A 4 100.0 94 75.0
4 B 5 75.0 85 93.0
5 B 6 NaN 90 NaN
6 A 7 88.0 90 85.0
7 A 8 90.0 89 88.0
8 B 9 92.0 99 90.0
9 A 10 NaN 100 92.0

C.21 Practice Exercise 1-21

Show all rows in the scores DataFrame for which the quiz 3 score is missing.

scores_df.loc[scores_df['q3'].isna()]
section student q1 q2 q3
0 A 1 99.0 95 NaN
1 B 2 87.0 90 NaN
2 A 3 55.0 78 NaN
5 B 6 NaN 90 NaN

C.22 Practice Exercise 1-22

Show all rows in the scores DataFrame for which the quiz 3 score is NOT missing.

scores_df.loc[scores_df['q3'].notna()]
section student q1 q2 q3
3 A 4 100.0 94 75.0
4 B 5 75.0 85 93.0
6 A 7 88.0 90 85.0
7 A 8 90.0 89 88.0
8 B 9 92.0 99 90.0
9 A 10 NaN 100 92.0

C.23 Practice Exercise 1-23

Display the q1 column of the scores_df DataFrame with the missing values in that column replaced by zeros.

scores_df['q1'].fillna(0)
0     99.0
1     87.0
2     55.0
3    100.0
4     75.0
5      0.0
6     88.0
7     90.0
8     92.0
9      0.0
Name: q1, dtype: float64

C.24 Practice Exercise 1-24

Display the quiz_df, then add a new column to the DataFrame that has the quiz average for each student. Display the DataFrame again after adding the new column.

quiz_df
quiz_1 quiz_2 quiz_3 quiz_4
name
Hannah 89 73 92 100
Sam 74 75 88 90
Anjali 79 78 85 88
Erin 90 88 95 92
Latasha 95 86 100 98
quiz_df['quiz_avg'] = (quiz_df['quiz_1'] + 
                       quiz_df['quiz_2'] + 
                       quiz_df['quiz_3'] + 
                       quiz_df['quiz_4']) / 4

quiz_df
quiz_1 quiz_2 quiz_3 quiz_4 quiz_avg
name
Hannah 89 73 92 100 88.50
Sam 74 75 88 90 81.75
Anjali 79 78 85 88 82.50
Erin 90 88 95 92 91.25
Latasha 95 86 100 98 94.75

C.25 Practice Exercise 1-25

Delete the new column you created for the quiz average from the quiz_df DataFrame and then display the DataFrame to verify that the column is gone.

quiz_df.drop('quiz_avg', axis = 1, inplace = True)

quiz_df
quiz_1 quiz_2 quiz_3 quiz_4
name
Hannah 89 73 92 100
Sam 74 75 88 90
Anjali 79 78 85 88
Erin 90 88 95 92
Latasha 95 86 100 98