import pandas as pd
import numpy as np
Appendix C — Chapter 1 Practice - Solutions
Click here to download this chapter as a Jupyter (.ipynb) file.
These are suggested solutions to the practice exercises for the Pandas Fundamentals chapter. For some of the exercises alternative solutions are presented.
The code below loads the superstore.csv
dataset from an online server into a DataFrame named store
. This data describes retail sales for a business. Each row of the dataset represents one product on one order. An order that included multiple products will have multiple rows, each with the same order ID but a different product ID.
= pd.read_csv("https://neuronjolt.com/data/superstore.csv") store
C.1 Practice Exercise 1-1
Use the .info()
method to learn more about the store
DataFrame
store.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Order ID 9994 non-null object
1 Order Date 9994 non-null object
2 Ship Date 9994 non-null object
3 Ship Mode 9994 non-null object
4 Customer ID 9994 non-null object
5 Customer Name 9994 non-null object
6 Segment 9994 non-null object
7 Country 9994 non-null object
8 City 9994 non-null object
9 State 9994 non-null object
10 Postal Code 9994 non-null int64
11 Region 9994 non-null object
12 Product ID 9994 non-null object
13 Category 9994 non-null object
14 Sub-Category 9994 non-null object
15 Product Name 9994 non-null object
16 Sales 9994 non-null float64
17 Quantity 9994 non-null int64
18 Discount 9994 non-null float64
19 Profit 9994 non-null float64
dtypes: float64(3), int64(2), object(15)
memory usage: 1.5+ MB
C.2 Practice Exercise 1-2
Use the .sample()
method to examine 3 randomly-selected rows of data from the store
DataFrame
3) store.sample(
Order ID | Order Date | Ship Date | Ship Mode | Customer ID | Customer Name | Segment | Country | City | State | Postal Code | Region | Product ID | Category | Sub-Category | Product Name | Sales | Quantity | Discount | Profit | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5053 | CA-2015-141243 | 1/3/2015 | 1/8/2015 | Second Class | AH-10465 | Amy Hunt | Consumer | United States | Dallas | Texas | 75217 | Central | TEC-AC-10003198 | Technology | Accessories | Enermax Acrylux Wireless Keyboard | 398.400 | 5 | 0.2 | 84.6600 |
4774 | CA-2017-119746 | 11/23/2017 | 11/27/2017 | Standard Class | CM-12385 | Christopher Martinez | Consumer | United States | Chicago | Illinois | 60610 | Central | TEC-PH-10004447 | Technology | Phones | Toshiba IPT2010-SD IP Telephone | 222.384 | 2 | 0.2 | 16.6788 |
4204 | CA-2014-145387 | 10/31/2014 | 11/2/2014 | Second Class | AM-10705 | Anne McFarland | Consumer | United States | Cranston | Rhode Island | 2920 | East | OFF-PA-10002195 | Office Supplies | Paper | RSVP Cards & Envelopes, Blank White, 8-1/2" X ... | 10.160 | 2 | 0.0 | 4.7752 |
C.3 Practice Exercise 1-3
Display the names of the columns in the store
DataFrame.
store.columns
Index(['Order ID', 'Order Date', 'Ship Date', 'Ship Mode', 'Customer ID',
'Customer Name', 'Segment', 'Country', 'City', 'State', 'Postal Code',
'Region', 'Product ID', 'Category', 'Sub-Category', 'Product Name',
'Sales', 'Quantity', 'Discount', 'Profit'],
dtype='object')
C.4 Practice Exercise 1-4
Display the count of unique orders (unique Order IDs) in the store
DataFrame.
'Order ID'].nunique() store[
5009
C.5 Practice Exercise 1-5
Display the counts of rows in the store
DataFrame for each unique product ID. Your result should show the unique product IDs with an integer beside each one that represents the count of rows in the dataset that have that product ID.
'Product ID'].value_counts() store[
Product ID
OFF-PA-10001970 19
TEC-AC-10003832 18
FUR-FU-10004270 16
FUR-CH-10002647 15
FUR-CH-10001146 15
..
TEC-MA-10004552 1
TEC-MA-10003589 1
OFF-AP-10003099 1
TEC-PH-10002645 1
OFF-ST-10001627 1
Name: count, Length: 1862, dtype: int64
C.6 Practice Exercise 1-6
Using just one pandas method, display descriptive statistics (including count, mean, std deviation, minimum, 25th percentile, median, 75th percentile, and maximum) for the sales amount of the order lines.
'Sales'].describe() store[
count 9994.000000
mean 229.858001
std 623.245101
min 0.444000
25% 17.280000
50% 54.490000
75% 209.940000
max 22638.480000
Name: Sales, dtype: float64
The next exercises makes use of the quiz_df
DataFrame, which shows quiz scores for several students. The DataFrame is created and displayed in the code cell below.
= {
quizzes 'name': ["Hannah", "Sam", "Anjali", "Erin", "Latasha"],
'quiz_1': [89, 74, 79, 90, 95],
'quiz_2': [73, 75, 78, 88, 86],
'quiz_3': [92, 88, 85, 95, 100],
'quiz_4': [100, 90, 88, 92, 98]
}
= pd.DataFrame(quizzes).set_index('name')
quiz_df
quiz_df
quiz_1 | quiz_2 | quiz_3 | quiz_4 | |
---|---|---|---|---|
name | ||||
Hannah | 89 | 73 | 92 | 100 |
Sam | 74 | 75 | 88 | 90 |
Anjali | 79 | 78 | 85 | 88 |
Erin | 90 | 88 | 95 | 92 |
Latasha | 95 | 86 | 100 | 98 |
C.7 Practice Exercise 1-7
Using one pandas method, calculate and display the median quiz score for each student in the quiz_df
DataFrame
= 1) quiz_df.median(axis
name
Hannah 90.5
Sam 81.5
Anjali 82.0
Erin 91.0
Latasha 96.5
dtype: float64
C.8 Practice Exercise 1-8
Calculate and display the minimum quiz score for each quiz in the quiz_df
DataFrame
min(axis = 0) quiz_df.
quiz_1 74
quiz_2 73
quiz_3 85
quiz_4 88
dtype: int64
C.9 Practice Exercise 1-9
Show the quiz scores for just Anjali. Hint: In quiz_df
the names are the row index, so they cannot be accessed as a column.
'Anjali'] quiz_df.loc[
quiz_1 79
quiz_2 78
quiz_3 85
quiz_4 88
Name: Anjali, dtype: int64
C.10 Practice Exercise 1-10
Show the scores on quizzes 1 through 3 for Sam and Erin.
'Sam', 'Erin'], 'quiz_1':'quiz_3'] quiz_df.loc[[
quiz_1 | quiz_2 | quiz_3 | |
---|---|---|---|
name | |||
Sam | 74 | 75 | 88 |
Erin | 90 | 88 | 95 |
Now, let’s load the NBA boxscore data from the 2023-24 season so that we can use it in some of the exercises below. Recall that in this dataset each row represents the statistics for one player in one game.
= "https://neuronjolt.com/data/nba_bs_2023-24_season_cleaned.csv"
url = pd.read_csv(url) boxscores
C.11 Practice Exercise 1-11
Show counts by player of games in which the player scored 50+ points during the season. Hint: Remember that each row represents a player-game, so you should limit the data to rows in which 50 or more points were scored and then count the rows for each player.
'pts'] >= 50, 'player'].value_counts() boxscores.loc[boxscores[
player
T. Maxey 3
D. Booker 3
J. Embiid 3
J. Brunson 2
G. Antetokounmpo 2
L. Doncic 2
Z. LaVine 1
K. Towns 1
Steph Curry 1
M. Flynn 1
A. Edwards 1
Name: count, dtype: int64
# Alternative solution using .query()
'pts >= 50')['player'].value_counts() boxscores.query(
player
T. Maxey 3
D. Booker 3
J. Embiid 3
J. Brunson 2
G. Antetokounmpo 2
L. Doncic 2
Z. LaVine 1
K. Towns 1
Steph Curry 1
M. Flynn 1
A. Edwards 1
Name: count, dtype: int64
C.12 Practice Exercise 1-12
Let’s look for some amazing guard play. Display player
, ast
and stl
, and pts
for player-games in which a player had 20 or more assists or 7 or more steals.
'ast'] >= 20) | (boxscores['stl'] >= 7),
boxscores.loc[(boxscores['player', 'ast', 'stl', 'pts']] [
player | ast | stl | pts | |
---|---|---|---|---|
3282 | S. Gilgeous-Alexander | 5 | 7 | 28 |
9726 | T. Haliburton | 20 | 2 | 21 |
10128 | T. Haliburton | 23 | 2 | 22 |
17842 | H. Jones | 4 | 7 | 17 |
20672 | A. Davis | 5 | 7 | 27 |
25572 | Tre Mann | 7 | 7 | 16 |
# Alternative solution using .query()
'ast >= 20 or stl >=7').loc[:, ['player', 'ast', 'stl', 'pts']] boxscores.query(
player | ast | stl | pts | |
---|---|---|---|---|
3282 | S. Gilgeous-Alexander | 5 | 7 | 28 |
9726 | T. Haliburton | 20 | 2 | 21 |
10128 | T. Haliburton | 23 | 2 | 22 |
17842 | H. Jones | 4 | 7 | 17 |
20672 | A. Davis | 5 | 7 | 27 |
25572 | Tre Mann | 7 | 7 | 16 |
C.13 Practice Exercise 1-13
Show player name and count of games (rows) in which the player had 10 or more assists. Use the head()
method to limit the result to 10 rows. In other words, create a top-10 list of players according to how many games in which they had 10 or more assists.
'ast'] >= 10, 'player'].value_counts().head(10) boxscores.loc[boxscores[
player
T. Haliburton 47
L. Doncic 37
Trae Young 37
N. Jokic 31
D. Sabonis 29
J. Harden 27
F. VanVleet 24
L. James 18
D. Murray 16
C. Cunningham 16
Name: count, dtype: int64
# Alternative solution using .query()
'ast >= 10')['player'].value_counts().head(10) boxscores.query(
player
T. Haliburton 47
L. Doncic 37
Trae Young 37
N. Jokic 31
D. Sabonis 29
J. Harden 27
F. VanVleet 24
L. James 18
D. Murray 16
C. Cunningham 16
Name: count, dtype: int64
C.14 Practice Exercise 1-14
Show player name, points, rebounds, and assists for all the player-games in which the player had at least ten points, ten rebounds, and ten assists (“triple double”). Sort in decreasing order of points.
(boxscores'pts'] >=10) &
.loc[(boxscores['reb'] >= 10) &
(boxscores['ast'] >= 10),
(boxscores['player', 'pts', 'reb', 'ast']]
[= 'pts', ascending = False)
.sort_values(by )
player | pts | reb | ast | |
---|---|---|---|---|
7964 | L. Doncic | 40 | 12 | 10 |
6523 | L. Doncic | 40 | 10 | 11 |
20389 | L. Doncic | 39 | 10 | 10 |
9208 | L. Doncic | 39 | 12 | 10 |
25233 | L. Doncic | 39 | 12 | 10 |
... | ... | ... | ... | ... |
22777 | K. Middleton | 11 | 10 | 10 |
22886 | J. Hart | 11 | 14 | 10 |
15081 | J. Hart | 10 | 10 | 10 |
21858 | J. Hart | 10 | 11 | 11 |
24225 | Tre Jones | 10 | 12 | 11 |
136 rows × 4 columns
# Alternative solution using .query()
(boxscores'pts >= 10 & reb >= 10 & ast >= 10')
.query('player', 'pts', 'reb', 'ast']]
.loc[:, [= 'pts', ascending = False)
.sort_values(by )
player | pts | reb | ast | |
---|---|---|---|---|
7964 | L. Doncic | 40 | 12 | 10 |
6523 | L. Doncic | 40 | 10 | 11 |
20389 | L. Doncic | 39 | 10 | 10 |
9208 | L. Doncic | 39 | 12 | 10 |
25233 | L. Doncic | 39 | 12 | 10 |
... | ... | ... | ... | ... |
22777 | K. Middleton | 11 | 10 | 10 |
22886 | J. Hart | 11 | 14 | 10 |
15081 | J. Hart | 10 | 10 | 10 |
21858 | J. Hart | 10 | 11 | 11 |
24225 | Tre Jones | 10 | 12 | 11 |
136 rows × 4 columns
C.15 Practice Exercise 1-15
Display the entire stat line for any player who scored 40 or more points and 10 or more rebounds in a game while playing for the Lakers (‘LAL’ in the team
column).
(boxscores'team'] == 'LAL') &
.loc[(boxscores['pts'] >= 40) &
(boxscores['reb'] >= 10)]
(boxscores[ )
date | team | opponent | player | position | min | oreb | dreb | reb | ast | ... | to | pf | plus_m | pts | fg_m | fg_a | tres_m | tres_a | ft_m | ft_a | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9312 | 20231225 | LAL | BOS | A. Davis | PF | 39 | 5 | 8 | 13 | 4 | ... | 2 | 2 | -6 | 40 | 15 | 26 | 2 | 4 | 8 | 10 |
11779 | 20240109 | LAL | TOR | A. Davis | PF | 40 | 1 | 10 | 11 | 6 | ... | 3 | 4 | 0 | 41 | 13 | 17 | 2 | 2 | 13 | 14 |
19086 | 20240229 | LAL | WSH | A. Davis | PF | 41 | 3 | 12 | 15 | 4 | ... | 1 | 2 | 6 | 40 | 14 | 22 | 0 | 0 | 12 | 13 |
3 rows × 22 columns
# Alternative solution using .query()
'team == "LAL" & pts >= 40 & reb >= 10') boxscores.query(
date | team | opponent | player | position | min | oreb | dreb | reb | ast | ... | to | pf | plus_m | pts | fg_m | fg_a | tres_m | tres_a | ft_m | ft_a | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9312 | 20231225 | LAL | BOS | A. Davis | PF | 39 | 5 | 8 | 13 | 4 | ... | 2 | 2 | -6 | 40 | 15 | 26 | 2 | 4 | 8 | 10 |
11779 | 20240109 | LAL | TOR | A. Davis | PF | 40 | 1 | 10 | 11 | 6 | ... | 3 | 4 | 0 | 41 | 13 | 17 | 2 | 2 | 13 | 14 |
19086 | 20240229 | LAL | WSH | A. Davis | PF | 41 | 3 | 12 | 15 | 4 | ... | 1 | 2 | 6 | 40 | 14 | 22 | 0 | 0 | 12 | 13 |
3 rows × 22 columns
C.16 Practice Exercise 1-16
How many players are in the boxscores
data?
'player'].nunique() boxscores[
584
C.17 Practice Exercise 1-17
Convert the date column in the boxscore data to a datetime type.
'date'] = pd.to_datetime(boxscores['date'], format = "%Y%m%d") boxscores[
C.18 Practice Exercise 1-18
Show the date, player, and points for the top 5 games (top 5 points scored by a player in a game) that occurred in January, 2024.
(boxscores'date'].dt.month == 1,['date', 'player', 'pts']]
.loc[boxscores[= 'pts', ascending = False)
.sort_values(by 5)
.head( )
date | player | pts | |
---|---|---|---|
14280 | 2024-01-26 | L. Doncic | 73 |
13726 | 2024-01-22 | J. Embiid | 70 |
13768 | 2024-01-22 | K. Towns | 62 |
14327 | 2024-01-26 | D. Booker | 62 |
13307 | 2024-01-19 | D. Booker | 52 |
C.19 Practice Exercise 1-19
Show all the unique player names that include the string “James”. Make the test case-sensitive.
(boxscores'player'].str.contains('James'), 'player']
.loc[boxscores[
.unique() )
array(['L. James', 'James Johnson'], dtype=object)
C.20 Practice Exercise 1-20
From the sales data, show descriptive statistics for quantity ordered for product ID ‘FUR-CH-10001146’.
'Product ID'] == 'FUR-CH-10001146', 'Quantity'].describe() store.loc[store[
count 15.000000
mean 3.400000
std 1.723783
min 1.000000
25% 2.000000
50% 3.000000
75% 4.500000
max 7.000000
Name: Quantity, dtype: float64
The code below creates the scores_df
DataFrame, which has some missing values. It shows course section and quiz scores for some students.
= pd.DataFrame(
scores_df
{'section': ['A', 'B', 'A', 'A', 'B', 'B', 'A', 'A', 'B', 'A'],
'student': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'q1': [99, 87, 55, 100, 75, None, 88, 90, 92, np.nan],
'q2': [95, 90, 78, 94, 85, 90, 90, 89, 99, 100],
'q3': [None, np.nan, None, 75, 93, None, 85, 88, 90, 92]
}
)
scores_df
section | student | q1 | q2 | q3 | |
---|---|---|---|---|---|
0 | A | 1 | 99.0 | 95 | NaN |
1 | B | 2 | 87.0 | 90 | NaN |
2 | A | 3 | 55.0 | 78 | NaN |
3 | A | 4 | 100.0 | 94 | 75.0 |
4 | B | 5 | 75.0 | 85 | 93.0 |
5 | B | 6 | NaN | 90 | NaN |
6 | A | 7 | 88.0 | 90 | 85.0 |
7 | A | 8 | 90.0 | 89 | 88.0 |
8 | B | 9 | 92.0 | 99 | 90.0 |
9 | A | 10 | NaN | 100 | 92.0 |
C.21 Practice Exercise 1-21
Show all rows in the scores DataFrame for which the quiz 3 score is missing.
'q3'].isna()] scores_df.loc[scores_df[
section | student | q1 | q2 | q3 | |
---|---|---|---|---|---|
0 | A | 1 | 99.0 | 95 | NaN |
1 | B | 2 | 87.0 | 90 | NaN |
2 | A | 3 | 55.0 | 78 | NaN |
5 | B | 6 | NaN | 90 | NaN |
C.22 Practice Exercise 1-22
Show all rows in the scores DataFrame for which the quiz 3 score is NOT missing.
'q3'].notna()] scores_df.loc[scores_df[
section | student | q1 | q2 | q3 | |
---|---|---|---|---|---|
3 | A | 4 | 100.0 | 94 | 75.0 |
4 | B | 5 | 75.0 | 85 | 93.0 |
6 | A | 7 | 88.0 | 90 | 85.0 |
7 | A | 8 | 90.0 | 89 | 88.0 |
8 | B | 9 | 92.0 | 99 | 90.0 |
9 | A | 10 | NaN | 100 | 92.0 |
C.23 Practice Exercise 1-23
Display the q1
column of the scores_df
DataFrame with the missing values in that column replaced by zeros.
'q1'].fillna(0) scores_df[
0 99.0
1 87.0
2 55.0
3 100.0
4 75.0
5 0.0
6 88.0
7 90.0
8 92.0
9 0.0
Name: q1, dtype: float64
C.24 Practice Exercise 1-24
Display the quiz_df
, then add a new column to the DataFrame that has the quiz average for each student. Display the DataFrame again after adding the new column.
quiz_df
quiz_1 | quiz_2 | quiz_3 | quiz_4 | |
---|---|---|---|---|
name | ||||
Hannah | 89 | 73 | 92 | 100 |
Sam | 74 | 75 | 88 | 90 |
Anjali | 79 | 78 | 85 | 88 |
Erin | 90 | 88 | 95 | 92 |
Latasha | 95 | 86 | 100 | 98 |
'quiz_avg'] = (quiz_df['quiz_1'] +
quiz_df['quiz_2'] +
quiz_df['quiz_3'] +
quiz_df['quiz_4']) / 4
quiz_df[
quiz_df
quiz_1 | quiz_2 | quiz_3 | quiz_4 | quiz_avg | |
---|---|---|---|---|---|
name | |||||
Hannah | 89 | 73 | 92 | 100 | 88.50 |
Sam | 74 | 75 | 88 | 90 | 81.75 |
Anjali | 79 | 78 | 85 | 88 | 82.50 |
Erin | 90 | 88 | 95 | 92 | 91.25 |
Latasha | 95 | 86 | 100 | 98 | 94.75 |
C.25 Practice Exercise 1-25
Delete the new column you created for the quiz average from the quiz_df
DataFrame and then display the DataFrame to verify that the column is gone.
'quiz_avg', axis = 1, inplace = True)
quiz_df.drop(
quiz_df
quiz_1 | quiz_2 | quiz_3 | quiz_4 | |
---|---|---|---|---|
name | ||||
Hannah | 89 | 73 | 92 | 100 |
Sam | 74 | 75 | 88 | 90 |
Anjali | 79 | 78 | 85 | 88 |
Erin | 90 | 88 | 95 | 92 |
Latasha | 95 | 86 | 100 | 98 |