import pandas as pd
import numpy as np
# Import the store data
= pd.read_csv("https://neuronjolt.com/data/superstore.csv")
store
# Convert the Order Date column in the store date to a datetime type
'Order Date'] = pd.to_datetime(store['Order Date'])
store[
# Import the NBA boxscore data
= "https://neuronjolt.com/data/nba_bs_2023-24_season_cleaned.csv"
url = pd.read_csv(url)
boxscores
# Convert the date column in the box score data to a datetime type
'date'] = pd.to_datetime(boxscores['date'], format = '%Y%m%d') boxscores[
Appendix E — Chapter 2 Practice - Solutions
Click here to download this chapter as a Jupyter (.ipynb) file.
These are suggested solutions to the practice exercises for the Grouping and Aggregating Data with Pandas chapter. For some of the exercises multiple alternative solutions are presented.
The code below loads pandas and numpy and reads two datasets into DataFrames. The first DataFrame, store
, describes retail sales for a business. Each row of the dataset represents one product on one order. An order that included multiple products will have multiple rows, each with the same order ID but a different product ID. Now that we have covered grouping in pandas we have the tools to group the rows by order ID so that we can calculate various statistics related to orders.
The boxscores
DataFrame is the box score data from the 2023-24 NBA season. Each row represents a player-game, that is, one player’s statistics in one game. Grouping will also help us gain more insight from this dataset, as we can now look at statistics by player, by team, or by game for various time periods.
E.1 Practice Exercise 2-1
How many orders are there in the store
DataFrame?
'Order ID'].nunique() store[
5009
E.2 Practice Exercise 2-2
What is the date range of the orders in the store
DataFrame?
# Solution 1
'Order Date'].describe() store[
count 9994
mean 2016-04-30 00:07:12.259355648
min 2014-01-03 00:00:00
25% 2015-05-23 00:00:00
50% 2016-06-26 00:00:00
75% 2017-05-14 00:00:00
max 2017-12-30 00:00:00
Name: Order Date, dtype: object
# Solution 2
print(f"Earliest Date: {store['Order Date'].min()}")
print(f"Latest Date: {store['Order Date'].max()}")
Earliest Date: 2014-01-03 00:00:00
Latest Date: 2017-12-30 00:00:00
E.3 Practice Exercise 2-3
Add a column to the boxscores
DataFrame that shows the rank of the player-game according to points scored, with the highest points having rank 1. Use the min
ranking method so that all player-games with the same number of points have the same rank, and that rank is the minimum rank of the group.
Then, use nsmallest()
to show date, player, points, and points rank for the player-games ranked in the top 5.
Explanation of Solution Code:
- We create the new column in the
boxscores
DataFrame by directly assigning values to it - The output shows the smallest 5 ranks, but there is no rank 5. Why is that? Since we set the
method = 'min'
parameter within therank()
method ties all receive the lowest rank for the group. There is a tie between two player-games at rank 4, so they both get rank 4. The next-highest would be rank 6, because there would be 5 player-games higher than it - A new code cell is added below to show 15 smallest ranks, so you can see how the ties are handled
'pts_rank'] = boxscores['pts'].rank(method = 'min',
boxscores[= False)
ascending = 5, columns = 'pts_rank')[['date', 'player',
boxscores.nsmallest(n 'pts', 'pts_rank']]
date | player | pts | pts_rank | |
---|---|---|---|---|
14280 | 2024-01-26 | L. Doncic | 73 | 1.0 |
13726 | 2024-01-22 | J. Embiid | 70 | 2.0 |
7396 | 2023-12-13 | G. Antetokounmpo | 64 | 3.0 |
13768 | 2024-01-22 | K. Towns | 62 | 4.0 |
14327 | 2024-01-26 | D. Booker | 62 | 4.0 |
# Added this code to illustrate how ties are handled with the 'min' method
= 15, columns = 'pts_rank')[['date', 'player',
boxscores.nsmallest(n 'pts', 'pts_rank']]
date | player | pts | pts_rank | |
---|---|---|---|---|
14280 | 2024-01-26 | L. Doncic | 73 | 1.0 |
13726 | 2024-01-22 | J. Embiid | 70 | 2.0 |
7396 | 2023-12-13 | G. Antetokounmpo | 64 | 3.0 |
13768 | 2024-01-22 | K. Towns | 62 | 4.0 |
14327 | 2024-01-26 | D. Booker | 62 | 4.0 |
23635 | 2024-03-29 | J. Brunson | 61 | 6.0 |
15660 | 2024-02-03 | Steph Curry | 60 | 7.0 |
2560 | 2023-11-09 | G. Antetokounmpo | 54 | 8.0 |
13307 | 2024-01-19 | D. Booker | 52 | 9.0 |
24107 | 2024-04-01 | D. Booker | 52 | 9.0 |
25143 | 2024-04-07 | T. Maxey | 52 | 9.0 |
575 | 2023-10-28 | Z. LaVine | 51 | 12.0 |
8542 | 2023-12-20 | J. Embiid | 51 | 12.0 |
15401 | 2024-02-01 | T. Maxey | 51 | 12.0 |
25423 | 2024-04-09 | A. Edwards | 51 | 12.0 |
E.4 Practice Exercise 2-4
Show date, player, points, rebounds, assists, and steals for player-games that rank in the top 10 (using the ‘min’ ranking method) for assists. Sort by assists in descending order.
Explanation of Solution Code:
- Since we aren’t asked to display the rank we don’t need to create a new column for it. We can just use it to create a Boolean array within the
loc[]
method - Notice that there are 16 rows in the top 10. This is because there is a tie with many player-games at 17 assists
(boxscores'ast'].rank(ascending = False, method = 'min') <= 10,
.loc[boxscores['date', 'player', 'pts', 'reb', 'ast', 'stl']]
[= 'ast', ascending = False)
.sort_values(by )
date | player | pts | reb | ast | stl | |
---|---|---|---|---|---|---|
10128 | 2023-12-30 | T. Haliburton | 22 | 5 | 23 | 2 |
9726 | 2023-12-28 | T. Haliburton | 21 | 3 | 20 | 2 |
3787 | 2023-11-17 | N. Jokic | 26 | 16 | 18 | 1 |
6175 | 2023-12-02 | L. Doncic | 36 | 15 | 18 | 2 |
10995 | 2024-01-05 | T. Haliburton | 10 | 8 | 18 | 2 |
20109 | 2024-03-07 | I. Quickley | 21 | 9 | 18 | 1 |
2943 | 2023-11-12 | T. Haliburton | 25 | 1 | 17 | 2 |
3445 | 2023-11-15 | Trae Young | 15 | 1 | 17 | 2 |
7197 | 2023-12-12 | L. Doncic | 33 | 6 | 17 | 1 |
7345 | 2023-12-13 | Trae Young | 35 | 4 | 17 | 0 |
8149 | 2023-12-18 | F. VanVleet | 27 | 8 | 17 | 2 |
13330 | 2024-01-19 | T. Haliburton | 21 | 2 | 17 | 0 |
14658 | 2024-01-27 | L. Doncic | 28 | 10 | 17 | 2 |
17583 | 2024-02-14 | D. Russell | 11 | 9 | 17 | 1 |
18630 | 2024-02-27 | Tyus Jones | 14 | 6 | 17 | 2 |
26360 | 2024-04-14 | L. James | 28 | 11 | 17 | 5 |
E.5 Practice Exercise 2-5
The transform()
method of a GroupBy object creates the result of an aggregating function for every row of the input DataFrame. Use assign()
to add a temporary column to the boxscores
DataFrame that has the player’s total points scored for the season. Then, show player, date, pts, and the total points column for 5 randomly-selected games by L. James (LeBron James).
Explanation of Solution Code:
E.5.0.1 Solution 1
- We use
query()
to avoid having to use a lambda function within theloc[]
- Notice that total points is the same on every row. This is what the
transform()
method does; it creates the aggregate value for every row that has the group key. So, in this example, evert player-game for the player L. James gets the total points he scored in the season.
# Solution 1
(boxscores= boxscores.groupby('player')['pts'].transform('sum'))
.assign(tot_pts 'player == "L. James"')
.query('player', 'date', 'pts', 'tot_pts']]
.loc[:, [5)
.sample( )
player | date | pts | tot_pts | |
---|---|---|---|---|
2 | L. James | 2023-10-24 | 21 | 1822 |
15699 | L. James | 2024-02-03 | 24 | 1822 |
7188 | L. James | 2023-12-12 | 33 | 1822 |
4184 | L. James | 2023-11-19 | 37 | 1822 |
5324 | L. James | 2023-11-27 | 18 | 1822 |
Explanation of Solution Code:
E.5.0.2 Solution 2
- We use a lambda function within the
loc[]
because the DataFrame coming down from the previous step doesn’t have a name. The lambda function gives the incoming DataFrame the namedf
and then creates the Boolean array for the.loc[]
row selector based on values in the player column of the incoming DataFramedf['player']
).
# Solution 2
(boxscores= boxscores.groupby('player')['pts'].transform('sum'))
.assign(tot_pts lambda df: df['player'] == 'L. James',
.loc['player', 'date', 'pts', 'tot_pts']]
[5)
.sample( )
player | date | pts | tot_pts | |
---|---|---|---|---|
2392 | L. James | 2023-11-08 | 18 | 1822 |
4474 | L. James | 2023-11-21 | 17 | 1822 |
24326 | L. James | 2024-04-03 | 25 | 1822 |
347 | L. James | 2023-10-26 | 21 | 1822 |
19685 | L. James | 2024-03-04 | 19 | 1822 |
E.6 Practice Exercise 2-6
Create a new column in the boxscores
DataFrame named superstar
. Put ‘yes’ in this column for any player ranked in the top 5 (by ‘min’ ranking method) for season total points or assists. Next, display player name and superstar status for all the superstars.
Explanation of Solution Code:
- To rank the point and assist totals correctly we need just one set of totals per player, so we will need to break this up into steps:
- First, calculate the total points and assists for each player
- Next, add the ranks for points and assists, using the ‘min’ method
- Keep only rows for which either rank is 5 or less
- Save the list of players from this data
- Use the list of players with
isin()
andnp.where()
to create the new column forboxscores
- Display the names of the superstars and their superstar status from
boxscores
usingquery()
. Group by player so that only one row is shown for each superstar player. Use thefirst()
method to show the first value ofsuperstar
in the group for each of the players. This works because thesuperstar
value for all rows for a specified player will all have the same value
= (boxscores
player_totals 'player', as_index = False)
.groupby('pts', 'ast']]
[[sum()
.
)
'pts_rank'] = player_totals['pts'].rank(method = 'min', ascending = False)
player_totals['ast_rank'] = player_totals['ast'].rank(method = 'min', ascending = False)
player_totals[
= player_totals.query('pts_rank <= 5 or ast_rank <= 5')
superstar_totals
= superstar_totals['player'].unique()
superstars
'superstar'] = np.where(boxscores['player'].isin(superstars), 'yes', 'no')
boxscores[
(boxscores'superstar == "yes"')
.query('player')['superstar'].first()
.groupby( )
player
D. Sabonis yes
G. Antetokounmpo yes
J. Brunson yes
J. Harden yes
L. Doncic yes
N. Jokic yes
S. Gilgeous-Alexander yes
T. Haliburton yes
Name: superstar, dtype: object
E.7 Practice Exercise 2-7
Calculate the total sales per month, in chronological order of months.
Explanation of Solution Code:
- Use
assign()
to create columns for the year and month. - Group by year and month. If we only group by month each month’s total will include that month’s sales from all the years in the data.
- use
.agg()
to sum theSales
column, naming the resultMonthly_Sales
(store= store['Order Date'].dt.year,
.assign(Year = store['Order Date'].dt.month)
Month 'Year', 'Month'])
.groupby([= ('Sales', 'sum'))
.agg(Monthly_Sales )
Monthly_Sales | ||
---|---|---|
Year | Month | |
2014 | 1 | 14236.8950 |
2 | 4519.8920 | |
3 | 55691.0090 | |
4 | 28295.3450 | |
5 | 23648.2870 | |
6 | 34595.1276 | |
7 | 33946.3930 | |
8 | 27909.4685 | |
9 | 81777.3508 | |
10 | 31453.3930 | |
11 | 78628.7167 | |
12 | 69545.6205 | |
2015 | 1 | 18174.0756 |
2 | 11951.4110 | |
3 | 38726.2520 | |
4 | 34195.2085 | |
5 | 30131.6865 | |
6 | 24797.2920 | |
7 | 28765.3250 | |
8 | 36898.3322 | |
9 | 64595.9180 | |
10 | 31404.9235 | |
11 | 75972.5635 | |
12 | 74919.5212 | |
2016 | 1 | 18542.4910 |
2 | 22978.8150 | |
3 | 51715.8750 | |
4 | 38750.0390 | |
5 | 56987.7280 | |
6 | 40344.5340 | |
7 | 39261.9630 | |
8 | 31115.3743 | |
9 | 73410.0249 | |
10 | 59687.7450 | |
11 | 79411.9658 | |
12 | 96999.0430 | |
2017 | 1 | 43971.3740 |
2 | 20301.1334 | |
3 | 58872.3528 | |
4 | 36521.5361 | |
5 | 44261.1102 | |
6 | 52981.7257 | |
7 | 45264.4160 | |
8 | 63120.8880 | |
9 | 87866.6520 | |
10 | 77776.9232 | |
11 | 118447.8250 | |
12 | 83829.3188 |
E.8 Practice Exercise 2-8
Who are the top 10 customers by total sales?
Explanation of Solution Code:
- Group by customer ID, sum the sales for each group, and then find the 10 largest with
nlargest()
- Indicated a list of columns, with just one column,
Sales
, in the list, so that the output will be a DataFrame - Since
.nlargest()
is applied to a DataFrame we need to indicate the column(s) for which we want to find the 10 largest values keep = 'all'
specifies that if there is a tie at the end we keep all the customer IDs with the same sales total
(store'Customer ID')[['Sales']]
.groupby(sum()
.= 10, columns = 'Sales', keep = 'all')
.nlargest(n )
Sales | |
---|---|
Customer ID | |
SM-20320 | 25043.050 |
TC-20980 | 19052.218 |
RB-19360 | 15117.339 |
TA-21385 | 14595.620 |
AB-10105 | 14473.571 |
KL-16645 | 14175.229 |
SC-20095 | 14142.334 |
HL-15040 | 12873.298 |
SE-20110 | 12209.438 |
CC-12370 | 12129.072 |
E.9 Practice Exercise 2-9
Using the box scores data, show the top 10 scoring performances (player-games) in December. Show only the date, player, and points scored. If there is a tie for tenth highest, keep all the ties.
Explanation of Solution Code:
- Use
.loc[]
to limit to player-games in December and indicate the columns to keep - Use
.nlargest()
withkeep = 'all'
to find the 10 largest, keeping all ties at the bottom. Note that there were three players tied at 43 points for the tenth-highest spot.
(boxscores'date'].dt.month == 12, ['date', 'player', 'pts']]
.loc[boxscores[= 10, columns = 'pts', keep = 'all')
.nlargest(n )
date | player | pts | |
---|---|---|---|
7396 | 2023-12-13 | G. Antetokounmpo | 64 |
8542 | 2023-12-20 | J. Embiid | 51 |
6396 | 2023-12-06 | J. Embiid | 50 |
7782 | 2023-12-15 | J. Brunson | 50 |
9344 | 2023-12-25 | L. Doncic | 50 |
6374 | 2023-12-06 | D. Bane | 49 |
7997 | 2023-12-16 | Keegan Murray | 47 |
7360 | 2023-12-13 | J. Jackson Jr. | 44 |
9756 | 2023-12-28 | A. Edwards | 44 |
7580 | 2023-12-14 | S. Gilgeous-Alexander | 43 |
8209 | 2023-12-18 | C. Cunningham | 43 |
9516 | 2023-12-26 | D. Fox | 43 |
E.10 Practice Exercise 2-10
Show the player name and their scoring average on Sundays for the top 10 Sunday scoring averages, sorted in descending order.
Explanation of Solution Code:
- Use
.loc[]
to limit to Sunday games and specify two columns to keep - Group by player and take the mean of points. Note that a lost of columns is indicated in
.groupby('player')[['pts']].mean()
so that the output of that step is a DataFrame - Used
.agg()
to calculate the mean of points to gain ability to name the column with the result of the aggregation - Use
.nlargest()
to find the 10 highest scoring averages
'date'].dt.day_name() == 'Sunday',
(boxscores.loc[boxscores['player', 'pts']]
['player')
.groupby(= ('pts', 'mean'))
.agg(scoring_avg = 10, columns = 'scoring_avg', keep = 'all')
.nlargest(n )
scoring_avg | |
---|---|
player | |
J. Embiid | 34.666667 |
L. Doncic | 33.777778 |
T. Maxey | 31.375000 |
L. James | 30.200000 |
G. Antetokounmpo | 29.666667 |
K. Irving | 29.333333 |
L. Markkanen | 28.333333 |
Trae Young | 27.666667 |
GG Jackson | 27.500000 |
A. Edwards | 26.727273 |
E.11 Practice Exercise 2-11
Show Luka Doncic’s scoring average by month, in chronological order.
Hint: To put the months in chronological order you should group by year and month.
(boxscores'player'] == 'L. Doncic', ['date', 'pts']]
.loc[boxscores[= boxscores['date'].dt.year,
.assign(year = boxscores['date'].dt.month)
month 'year', 'month'])
.groupby([= ('pts', 'mean'))
.agg(scoring_avg )
scoring_avg | ||
---|---|---|
year | month | |
2023 | 10 | 39.000000 |
11 | 29.428571 | |
12 | 37.538462 | |
2024 | 1 | 37.100000 |
2 | 33.363636 | |
3 | 32.500000 | |
4 | 32.000000 |
E.12 Practice Exercise 2-12
Show the top 5 teams in descending order of total points scored in the season.
(boxscores'team')['pts'].sum()
.groupby(= 5, keep = 'all')
.nlargest(n )
team
IND 10110
BOS 9887
OKC 9847
MIL 9756
ATL 9703
Name: pts, dtype: int64
E.13 Practice Exercise 2-13
Create a DataFrame that shows field goal percentage (field goals made / field goals attempted), free throw percentage (free throws made / free throws attempted), and three point percentage (three point shots made / three point shots attempted) by player for the season for players who had at least 100 attempts of each type (field goals, free throws, three pointers) for the season.
Show the rows with the top 5 field goal percentages.
Explanation of Solution Code:
E.13.0.1 Solution 1
- We first group by player and sum each player’s shot attempts and shots made for various types of shots
- Then we apply
query()
to filter the data, limiting it to just those players who have at least 100 shot attempts of each type (field goals, free throws, and three pointers) - Then we use
assign()
to create columns with percentage made for each type of shot. The input of theassign()
step is the output of the previous step. That output is a DataFrame that does not have a name. We need its name in order to refer to its columns to do the calculations to create the new columns, so we need to use lambda expressions within the assign. For example, within theassign()
step, consider the following code:
= lambda df: df['fg_m'] / df['fg_a'] fg_pct
The new column, fg_pct
, gets the result of the lambda expression. The lambda expression takes as its argument the input to the assign()
step, which is the DataFrame output by the query()
step above, and gives this argument the name df
. Then it specifies that the values to be assigned to the fg_pct
column are what you get when you divide the fg_m
column in df
by the fg_a
column in df
. * After the new columns are created nlargest()
is used to show the top 5 by field goal percentage.
# Solution 1, using method chaining
'player')
(boxscores.groupby('fg_m', 'fg_a', 'ft_m', 'ft_a', 'tres_m', 'tres_a']]
[[sum()
.'fg_a >= 100 & ft_a >= 100 & tres_a >= 100')
.query(= lambda df: df['fg_m'] / df['fg_a'],
.assign(fg_pct = lambda df: df['ft_m'] / df['ft_a'],
ft_pct = lambda df: df['tres_m'] / df['tres_a']
tres_pct
)= 5, columns = 'fg_pct', keep = 'all')
.nlargest(n )
fg_m | fg_a | ft_m | ft_a | tres_m | tres_a | fg_pct | ft_pct | tres_pct | |
---|---|---|---|---|---|---|---|---|---|
player | |||||||||
G. Antetokounmpo | 837 | 1369 | 514 | 782 | 34 | 124 | 0.611395 | 0.657289 | 0.274194 |
M. Wagner | 332 | 552 | 166 | 204 | 38 | 115 | 0.601449 | 0.813725 | 0.330435 |
J. Smith | 234 | 395 | 72 | 104 | 61 | 144 | 0.592405 | 0.692308 | 0.423611 |
N. Jokic | 822 | 1411 | 358 | 438 | 83 | 231 | 0.582566 | 0.817352 | 0.359307 |
O. Toppin | 332 | 579 | 77 | 100 | 102 | 253 | 0.573402 | 0.770000 | 0.403162 |
# Alternate solution, breaking the task into parts
= boxscores.groupby('player')
gp_by_player = gp_by_player[['fg_m', 'fg_a', 'ft_m', 'ft_a', 'tres_m', 'tres_a']].sum()
player_sums
# Use .copy() because we are going to assign values to new columns in the next step.
# eligible is a subset of player_sums, so if we try to make new columns for
# eligible we are assigning to a subset of player_sums, which causes an error message.
# To avoid this use .copy() to make eligible a separate DataFrame rather than a
# name of a subset of player_sums.
= player_sums.query('fg_a >= 100 & ft_a >= 100 & tres_a >= 100').copy()
eligible
'fg_pct'] = eligible['fg_m'] / eligible['fg_a']
eligible['ft_pct'] = eligible['ft_m'] / eligible['ft_a']
eligible['tres_pct'] = eligible['tres_m'] / eligible['tres_a']
eligible[
= 5, columns = 'fg_pct', keep = 'all') eligible.nlargest(n
fg_m | fg_a | ft_m | ft_a | tres_m | tres_a | fg_pct | ft_pct | tres_pct | |
---|---|---|---|---|---|---|---|---|---|
player | |||||||||
G. Antetokounmpo | 837 | 1369 | 514 | 782 | 34 | 124 | 0.611395 | 0.657289 | 0.274194 |
M. Wagner | 332 | 552 | 166 | 204 | 38 | 115 | 0.601449 | 0.813725 | 0.330435 |
J. Smith | 234 | 395 | 72 | 104 | 61 | 144 | 0.592405 | 0.692308 | 0.423611 |
N. Jokic | 822 | 1411 | 358 | 438 | 83 | 231 | 0.582566 | 0.817352 | 0.359307 |
O. Toppin | 332 | 579 | 77 | 100 | 102 | 253 | 0.573402 | 0.770000 | 0.403162 |
E.14 Practice Exercise 2-14
Best Products in Best Markets
Show the top 5 product IDs by total sales, sorted in decreasing order, for Cities with sales of at least $100,000.
Explanation of Solution Code:
E.14.0.1 Solution 1
- We use
assign()
to create aCity_Sales
column- The
City_Sales
column shows the city’s total sales for every row of the original DataFrame because we created it with thetransform()
method. - This preserves the data in the other columns
- The
- We can then use
query()
to limit to the cities with sales of 100,000 or more - Then group by product ID and sum sales
- Finally, use
largest()
to show the top 5
# Solution 1 - Uses .transform() since we need to do two separate groupings,
# by city, then by product
(store= store.groupby('City')[['Sales']].transform('sum'))
.assign(City_Sales 'City_Sales >= 100_000')
.query('Product ID')[['Sales']].sum()
.groupby(= 5, columns = 'Sales', keep = 'all')
.nlargest(n )
Sales | |
---|---|
Product ID | |
TEC-CO-10004722 | 33599.904 |
OFF-SU-10000151 | 13100.240 |
FUR-CH-10002024 | 12126.954 |
FUR-BO-10004834 | 8809.800 |
FUR-BO-10002213 | 8036.834 |
Explanation of Solution Code:
E.14.0.2 Solution 2
- Breaks it up into steps, assigning intermediate objects to variables along the way
- What we want from
top_cities
is the row index values, since those are the names of the cities with sales of at least 100,000. We use theindex
property to return the index and theto_numpy()
method to change the index from an Index object to just an array, which is a numpy data structure somewhat similar to a python list. - Next, we use
loc[]
withisin()
to limit the data to just the top cities - Finally, we group by product, sum sales, and use
nlargest()
to show the top 5
# Solution 2 - Selects by Cities
# Calculate sales sums for cities and limit to just those with sales >= 100,000
= store.groupby('City')['Sales'].sum()
top_cities = top_cities[top_cities >= 100_000]
top_cities = top_cities.index.to_numpy()
top_cities
# Limit to top cities
= store.loc[store['City'].isin(top_cities)]
store_subset
# Sum sales by product
= store_subset.groupby('Product ID')[['Sales']].sum()
ss_top_products
# Show top 5
= 5, columns = 'Sales', keep = 'all') ss_top_products.nlargest(n
Sales | |
---|---|
Product ID | |
TEC-CO-10004722 | 33599.904 |
OFF-SU-10000151 | 13100.240 |
FUR-CH-10002024 | 12126.954 |
FUR-BO-10004834 | 8809.800 |
FUR-BO-10002213 | 8036.834 |
E.15 Practice Exercise 2-15
Show NBA players and their ranks for minutes played for the season, with rank 1 being the highest number of minutes played. Show just player name and total minutes, and show all players with rank in the top 10. In the case of ties use the minimum rank method.
Explanation of Solution Code:
- We could have done this with
nlargest()
, but we usedrank()
in order to obtain more control over how ties are handled - Since the result of the first step, grouping by player and summing minutes, is no longer
boxscores
we need to use a lambda function within theassign()
to create the new column based on themin
column of the DataFrame that is passed down from the first step
(boxscores'player')[['min']].sum()
.groupby(= lambda df: df['min'].rank(ascending = False, method = 'min'))
.assign(min_played_rank 'min_played_rank <= 5')
.query(= 'min_played_rank')
.sort_values(by )
min | min_played_rank | |
---|---|---|
player | ||
D. DeRozan | 2995 | 1.0 |
D. Sabonis | 2931 | 2.0 |
C. White | 2881 | 3.0 |
Mikal Bridges | 2855 | 4.0 |
P. Banchero | 2798 | 5.0 |
E.16 Practice Exercise 2-16
Show Steph Curry’s points per minute by month of the season.
Hints: * Calculate points per minute for a month by summing the points and minutes for that month and then dividing total points by total minutes * You could accomplish this task by breaking it into steps * Alternatively, you could use a lambda
function within the .assign
method to define the variable that represents points per minute
# Solution 1 - Breaks it into steps to avoid using a lambda function
= (
steph_months
boxscores'player'] == 'Steph Curry']
.loc[boxscores[= boxscores['date'].dt.year,
.assign(year = boxscores['date'].dt.month)
month 'year', 'month'])
.groupby([= ('pts', 'sum'),
.agg(total_pts = ('min', 'sum'))
total_min
)
= steph_months['total_pts'] / steph_months['total_min']) steph_months.assign(ppm
total_pts | total_min | ppm | ||
---|---|---|---|---|
year | month | |||
2023 | 10 | 134 | 127 | 1.055118 |
11 | 367 | 438 | 0.837900 | |
12 | 318 | 439 | 0.724374 | |
2024 | 1 | 310 | 369 | 0.840108 |
2 | 388 | 455 | 0.852747 | |
3 | 291 | 394 | 0.738579 | |
4 | 148 | 198 | 0.747475 |
# Solution 2 - Uses method chaining and lambda function
(boxscores'player'] == 'Steph Curry']
.loc[boxscores[= boxscores['date'].dt.year,
.assign(year = boxscores['date'].dt.month)
month 'year', 'month'])
.groupby([= ('pts', 'sum'),
.agg(total_pts = ('min', 'sum'))
total_min = lambda df: df['total_pts'] / df['total_min'])
.assign(ppm )
total_pts | total_min | ppm | ||
---|---|---|---|---|
year | month | |||
2023 | 10 | 134 | 127 | 1.055118 |
11 | 367 | 438 | 0.837900 | |
12 | 318 | 439 | 0.724374 | |
2024 | 1 | 310 | 369 | 0.840108 |
2 | 388 | 455 | 0.852747 | |
3 | 291 | 394 | 0.738579 | |
4 | 148 | 198 | 0.747475 |
E.17 Practice Exercise 2-17
Show the player name and count of triple doubles during the season, sorted in descending order of triple doubles. Show only the top 10. If there are ties for tenth place make sure all the player names who tied are included.
# Solution 1 - Uses value_counts()
(boxscores'pts'] >= 10) &
.loc[(boxscores['reb'] >= 10) &
(boxscores['ast'] >= 10), 'player']
(boxscores[
.value_counts()= 10, keep = 'all')
.nlargest(n )
player
D. Sabonis 26
N. Jokic 25
L. Doncic 21
G. Antetokounmpo 10
J. Hart 6
L. James 5
S. Barnes 4
J. Giddey 3
J. Harden 3
P. Banchero 2
A. Sengun 2
A. Davis 2
B. Adebayo 2
J. Embiid 2
Name: count, dtype: int64
# Solution 2 - Uses groupby() and count()
(boxscores'pts'] >= 10) &
.loc[(boxscores['reb'] >= 10) &
(boxscores['ast'] >= 10)]
(boxscores['player')['date'].count()
.groupby(= 10, keep = 'all')
.nlargest(n )
player
D. Sabonis 26
N. Jokic 25
L. Doncic 21
G. Antetokounmpo 10
J. Hart 6
L. James 5
S. Barnes 4
J. Giddey 3
J. Harden 3
A. Davis 2
A. Sengun 2
B. Adebayo 2
J. Embiid 2
P. Banchero 2
Name: date, dtype: int64