Appendix G — Chapter 3 Practice - Solutions

Click here to download this chapter as a Jupyter (.ipynb) file.

These are suggested solutions to the practice exercises for the Concatenating, Merging, and Reshaping Data chapter.

Click here to download this chapter as a Jupyter (.ipynb) file.

The code below loads pandas and numpy, creates some sample DataFrames for concatenation practice and reads four datasets into DataFrames.

The first DataFrame, boxscores DataFrame is the box score data from the 2023-24 NBA season. Each row represents a player-game, that is, one player’s statistics in one game.

The second DataFrame, teams, has NBA team records from the 2023-24 NBA season. Following are the column definitions in the teams DataFrame:

The batting DataFrame has batting data for major league baseball in the U.S. since 1871.

The people DataFrame has information on people associated with major league baseball in the U.S..

import pandas as pd
import numpy as np

# Create DataFrames for concatenation practice

east_div_q1 = pd.DataFrame({
    'month': 'Jan Feb Mar'.split(),
    'sales': [433_721.45, 513_720.77, 522_482.31],
    'profit': [47_712.34, 52_936.29, 61_295.22]         
})

west_div_q1 = pd.DataFrame({
    'month': 'Jan Feb Mar'.split(),
    'sales': [633_728.45, 713_780.77, 722_488.35],
    'profit': [77_712.34, 84_976.29, 81_275.28]         
})

# Import the NBA boxscore data
url = "https://neuronjolt.com/data/nba_bs_2023-24_season_cleaned.csv"
boxscores = pd.read_csv(url)

# Convert the date column in the box score data to a datetime type
boxscores['date'] = pd.to_datetime(boxscores['date'], format = '%Y%m%d')

# Import the NBA team records data
url = "https://neuronjolt.com/data/nba_team_recs_23-24.csv"
teams = pd.read_csv(url)

# Import the baseball batting and people data
batting = pd.read_csv('https://neuronjolt.com/data/Batting.csv')
people = pd.read_csv('https://neuronjolt.com/data/People.csv')

G.1 Practice Exercise 3-1

The east_div_q1 DataFrame has the monthly sales and profit for the East division for the first quarter, and the west_div_q1 DataFrame has the monthly sales and profit for the West division for the first quarter. Concatenate those two DataFrames in a way that preserves the information about what division the monthly sales and profit represent.

pd.concat([east_div_q1, west_div_q1],
          keys = ('east', 'west'))
month sales profit
east 0 Jan 433721.45 47712.34
1 Feb 513720.77 52936.29
2 Mar 522482.31 61295.22
west 0 Jan 633728.45 77712.34
1 Feb 713780.77 84976.29
2 Mar 722488.35 81275.28

G.2 Practice Exercise 3-2

Concatenate the east_div_q1 and west_div_q1 DataFrames and then produce from the result a DataFrame named combined that has four columns: division, month, sales and profit and an integer row index with no duplicates.

combined = pd.concat([east_div_q1, west_div_q1],
                       keys = ('east', 'west'))
combined
month sales profit
east 0 Jan 433721.45 47712.34
1 Feb 513720.77 52936.29
2 Mar 522482.31 61295.22
west 0 Jan 633728.45 77712.34
1 Feb 713780.77 84976.29
2 Mar 722488.35 81275.28
combined = combined.reset_index()

combined
level_0 level_1 month sales profit
0 east 0 Jan 433721.45 47712.34
1 east 1 Feb 513720.77 52936.29
2 east 2 Mar 522482.31 61295.22
3 west 0 Jan 633728.45 77712.34
4 west 1 Feb 713780.77 84976.29
5 west 2 Mar 722488.35 81275.28
combined.rename(columns = {'level_0': 'division'}, inplace = True)
combined.drop('level_1', axis = 1, inplace = True)

combined
division month sales profit
0 east Jan 433721.45 47712.34
1 east Feb 513720.77 52936.29
2 east Mar 522482.31 61295.22
3 west Jan 633728.45 77712.34
4 west Feb 713780.77 84976.29
5 west Mar 722488.35 81275.28

Note: There are many ways to accomplish this, so you might have taken a different approach.

G.3 Practice Exercise 3-3

Concatenate the east_div_q1 and west_div_q1 DataFrames in a way that produces a DataFrame named east_west with a MultiIndex for the row index, with the first level of the row index indicating the division and the second level of the row index indicating the month.

east_west = pd.concat([east_div_q1.set_index('month'), 
                       west_div_q1.set_index('month')],
                      keys = ('east', 'west'))

east_west
sales profit
month
east Jan 433721.45 47712.34
Feb 513720.77 52936.29
Mar 522482.31 61295.22
west Jan 633728.45 77712.34
Feb 713780.77 84976.29
Mar 722488.35 81275.28

G.4 Practice Exercise 3-4

The east_west DataFrame that you created for the previous exercise has a MultiIndex for the row index. Manipulate the names for the levels of the row index so that the first level is named ‘division’ and the second level is named ‘month’.

east_west.index.set_names('division', level = 0, inplace = True)

east_west
sales profit
division month
east Jan 433721.45 47712.34
Feb 513720.77 52936.29
Mar 522482.31 61295.22
west Jan 633728.45 77712.34
Feb 713780.77 84976.29
Mar 722488.35 81275.28

Note: The index level names can also be specified within the concat() function with the names parameter.

east_west = pd.concat([east_div_q1.set_index('month'), 
                       west_div_q1.set_index('month')],
                      keys = ('east', 'west'),
                      names = ('division', 'month'))

east_west
sales profit
division month
east Jan 433721.45 47712.34
Feb 513720.77 52936.29
Mar 522482.31 61295.22
west Jan 633728.45 77712.34
Feb 713780.77 84976.29
Mar 722488.35 81275.28

G.5 Practice Exercise 3-5

Display from the east_west DataFrame just the rows for the East division January sales and profit and the West division February sales and profit.

east_west.loc[[('east', 'Jan'), ('west', 'Feb')]]
sales profit
division month
east Jan 433721.45 47712.34
west Feb 713780.77 84976.29

G.6 Practice Exercise 3-6

Display from the east_west DataFrame just the rows for January sales and profit for both divisions.

# Solution using query()

east_west.query('month == "Jan"')
sales profit
division month
east Jan 433721.45 47712.34
west Jan 633728.45 77712.34
# Solution using index.get_level_values() with level position specified

east_west.loc[east_west.index.get_level_values(1) == "Jan"]
sales profit
division month
east Jan 433721.45 47712.34
west Jan 633728.45 77712.34
# Solution using index.get_level_values() with level label specified

east_west.loc[east_west.index.get_level_values('month') == "Jan"]
sales profit
division month
east Jan 433721.45 47712.34
west Jan 633728.45 77712.34

G.7 Practice Exercise 3-7

Merge the boxscores data with the teams data in a way that will keep all the rows of the boxscores data, even if the team’s record is not found in the team records data. Create a new DataFrame named boxscores_2 from the merged data. Use the info() method to investigate the boxscores_2 DataFrame.

boxscores_2 = boxscores.merge(teams, on = 'team', how = 'left')

boxscores_2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26498 entries, 0 to 26497
Data columns (total 29 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   date       26498 non-null  datetime64[ns]
 1   team       26498 non-null  object        
 2   opponent   26498 non-null  object        
 3   player     26498 non-null  object        
 4   position   26498 non-null  object        
 5   min        26498 non-null  int64         
 6   oreb       26498 non-null  int64         
 7   dreb       26498 non-null  int64         
 8   reb        26498 non-null  int64         
 9   ast        26498 non-null  int64         
 10  stl        26498 non-null  int64         
 11  blk        26498 non-null  int64         
 12  to         26498 non-null  int64         
 13  pf         26498 non-null  int64         
 14  plus_m     26498 non-null  int64         
 15  pts        26498 non-null  int64         
 16  fg_m       26498 non-null  int64         
 17  fg_a       26498 non-null  int64         
 18  tres_m     26498 non-null  int64         
 19  tres_a     26498 non-null  int64         
 20  ft_m       26498 non-null  int64         
 21  ft_a       26498 non-null  int64         
 22  wins       26498 non-null  int64         
 23  losses     26498 non-null  int64         
 24  3_pt_g     26498 non-null  object        
 25  10_pt_g    26498 non-null  object        
 26  vs_ge_500  26498 non-null  object        
 27  vs_lt_500  26498 non-null  object        
 28  ot_g       26498 non-null  object        
dtypes: datetime64[ns](1), int64(19), object(9)
memory usage: 5.9+ MB

G.8 Practice Exercise 3-8

Which player playing for which team had the highest scoring average while playing in at least 50 games for a team with a losing record?

(boxscores_2
 .query('losses > wins')
 .groupby(['player', 'team'])
 .agg(games_for_team = ('date', 'count'),
      score_avg = ('pts', 'mean'))
 .query('games_for_team >= 50')
 .nlargest(n = 1, columns = 'score_avg', keep = 'all')
)
games_for_team score_avg
player team
Trae Young ATL 54 25.722222

G.9 Practice Exercise 3-9

Investigating merge types

Create a DataFrame named bs_sub by randomly sampling 12 rows from the boxscores DataFrame and keeping only the player, team, and position columns. Create another DataFrame named teams_sub by randomly sampling 12 rows from the teams DataFrame, keeping only the team, wins, and losses columns.

After creating the two DataFrames, run value_counts() on the team column of each DataFrame. You should see that the relationship between bs_sub and teams_sub has the potential to be many-to-one, since there may be multiple rows in bs_sub for the same team, while teams_sub will only have one row for each team. In a merge each row from one input DataFrame is merged with every row in the other input DataFrame that has the same key value.

Next, show the results of four different merges with bs_sub as the left DataFrame and teams_sub as the right DataFrame: an inner merge, an outer merge, a right merge, and a left merge. Set the indicator = True parameter to help you investigate the results of the various merge types.

For each merge note how many rows are in the result and examine where missing values are created.

See if you can evidence of a many-to-one relationship between bs_sub and teams_sub.

Since the two input DataFrames are created through random sampling you can run your code multiple times to have more chance to study the merges.

bs_sub = boxscores.sample(12).loc[:, ['player', 'team', 'position']]
teams_sub = teams.sample(12).loc[:, ['team', 'wins', 'losses']]
bs_sub['team'].value_counts()
team
IND    2
MIL    2
OKC    1
DEN    1
MIN    1
PHX    1
CLE    1
NO     1
LAC    1
MEM    1
Name: count, dtype: int64
teams_sub['team'].value_counts()
team
SAC    1
NO     1
MEM    1
CHA    1
MIL    1
HOU    1
ATL    1
NY     1
BOS    1
BKN    1
LAL    1
POR    1
Name: count, dtype: int64
# Inner merge - only keeps rows for which a matching team is found in both DataFrames
pd.merge(bs_sub, teams_sub, on = 'team', how = 'inner', indicator = True)
player team position wins losses _merge
0 J. Crowder MIL PF 49 33 both
1 K. Lewis Jr. NO PG 49 33 both
2 J. Jackson Jr. MEM PF 27 55 both
3 A. Green MIL G 49 33 both
# Outer merge - all rows are kept from both DataFrames
pd.merge(bs_sub, teams_sub, on = 'team', how = 'outer', indicator = True)
player team position wins losses _merge
0 NaN ATL NaN 36.0 46.0 right_only
1 NaN BKN NaN 32.0 50.0 right_only
2 NaN BOS NaN 64.0 18.0 right_only
3 NaN CHA NaN 21.0 61.0 right_only
4 I. Okoro CLE SF NaN NaN left_only
5 A. Gordon DEN PF NaN NaN left_only
6 NaN HOU NaN 41.0 41.0 right_only
7 M. Turner IND C NaN NaN left_only
8 A. Nembhard IND PG NaN NaN left_only
9 N. Powell LAC G NaN NaN left_only
10 NaN LAL NaN 47.0 35.0 right_only
11 J. Jackson Jr. MEM PF 27.0 55.0 both
12 J. Crowder MIL PF 49.0 33.0 both
13 A. Green MIL G 49.0 33.0 both
14 M. Conley MIN PG NaN NaN left_only
15 K. Lewis Jr. NO PG 49.0 33.0 both
16 NaN NY NaN 50.0 32.0 right_only
17 O. Dieng OKC F NaN NaN left_only
18 B. Bol PHX C NaN NaN left_only
19 NaN POR NaN 21.0 61.0 right_only
20 NaN SAC NaN 46.0 36.0 right_only
# Left merge - Keeps all rows from left DataFrame
pd.merge(bs_sub, teams_sub, on = 'team', how = 'left', indicator = True)
player team position wins losses _merge
0 A. Gordon DEN PF NaN NaN left_only
1 O. Dieng OKC F NaN NaN left_only
2 J. Crowder MIL PF 49.0 33.0 both
3 M. Conley MIN PG NaN NaN left_only
4 B. Bol PHX C NaN NaN left_only
5 M. Turner IND C NaN NaN left_only
6 I. Okoro CLE SF NaN NaN left_only
7 K. Lewis Jr. NO PG 49.0 33.0 both
8 N. Powell LAC G NaN NaN left_only
9 A. Nembhard IND PG NaN NaN left_only
10 J. Jackson Jr. MEM PF 27.0 55.0 both
11 A. Green MIL G 49.0 33.0 both
# Right merge - keeps all rows from right DataFrame
pd.merge(bs_sub, teams_sub, on = 'team', how = 'right', indicator = True)
player team position wins losses _merge
0 NaN SAC NaN 46 36 right_only
1 K. Lewis Jr. NO PG 49 33 both
2 J. Jackson Jr. MEM PF 27 55 both
3 NaN CHA NaN 21 61 right_only
4 J. Crowder MIL PF 49 33 both
5 A. Green MIL G 49 33 both
6 NaN HOU NaN 41 41 right_only
7 NaN ATL NaN 36 46 right_only
8 NaN NY NaN 50 32 right_only
9 NaN BOS NaN 64 18 right_only
10 NaN BKN NaN 32 50 right_only
11 NaN LAL NaN 47 35 right_only
12 NaN POR NaN 21 61 right_only

G.10 Practice Exercise 3-10

Create a subset of the boxscores DataFrame, named bucks, that only has the statlines for players playing for the Milwaukee Bucks (“MIL” is the team code), and only includes the following columns: date, player, team, pts, reb, ast, stl, blk.

Display a random sample of 5 rows of the bucks DataFrame. This format is called “wide” format because each statistic has its own column, which makes the DataFrame wide.

bucks = (boxscores
         .query('team == "MIL"')
         .loc[:, ['date', 'player', 'team', 'pts', 'reb', 'ast', 'stl', 'blk']]
        )

bucks.sample(5)
date player team pts reb ast stl blk
802 2023-10-29 P. Connaughton MIL 5 3 2 2 0
5119 2023-11-26 B. Lopez MIL 10 6 3 3 4
17625 2024-02-15 G. Antetokounmpo MIL 35 4 12 0 0
8987 2023-12-23 K. Middleton MIL 20 5 5 0 0
19984 2024-03-06 B. Portis MIL 20 6 1 0 0

Make a new DataFrame named bucks_long that is the result of transforming the bucks DataFrame to “long” format. All the columns that represent statistics should be transformed into two columns: stat and value. Display several rows of the bucks_long DataFrame.

bucks_long = bucks.melt(id_vars = ['date', 'player', 'team'],
                        var_name = 'stat')

bucks_long.sample(7)
date player team stat value
1435 2024-01-29 P. Connaughton MIL reb 1
675 2024-03-01 B. Lopez MIL pts 12
3502 2024-03-20 P. Beverley MIL stl 0
713 2024-03-08 D. Gallinari MIL pts 0
3832 2023-11-26 B. Lopez MIL blk 4
2158 2023-12-25 C. Payne MIL ast 1
1778 2024-04-09 D. Lillard MIL reb 3

G.11 Practice Exercise 3-11

Make a new DataFrame named bucks_wide by transforming the bucks_long DataFrame back to wide format. Verify your tranformation by displaying several randomly-selected rows from bucks_wide.

bucks_wide = pd.pivot(bucks_long, 
                      index = ['date', 'player', 'team'],
                      columns = 'stat',
                      values = 'value')
bucks_wide.sample(7)
stat ast blk pts reb stl
date player team
2023-11-17 K. Middleton MIL 1 0 12 6 1
2023-11-24 G. Antetokounmpo MIL 3 1 31 9 3
D. Lillard MIL 10 0 31 6 0
2024-02-13 M. Beauchamp MIL 0 0 8 2 1
2024-01-20 B. Portis MIL 0 0 4 3 0
2024-01-03 K. Middleton MIL 7 2 19 6 0
2024-02-03 G. Antetokounmpo MIL 10 1 48 6 5

G.12 Practice Exercise 3-12

In the teams DataFrame several columns have the teams’ records for various contexts. The data is more useful if we split out the wins and losses into their own columns. Based on the ot_g column create three new columns in the DataFrame:

  • ot_win_pct - win percentage in overtime games
  • ot_wins - number of overtime games won
  • ot_losses - number of overtime games lost
teams['ot_wins'] = teams['ot_g'].str.split('-').str.get(0).astype('int')
teams['ot_losses'] = teams['ot_g'].str.split('-').str.get(1).astype('int')
teams['ot_win_pct'] = teams['ot_wins'] / (teams['ot_wins'] + teams['ot_losses'])

teams.sample(5)
team wins losses 3_pt_g 10_pt_g vs_ge_500 vs_lt_500 ot_g ot_wins ot_losses ot_win_pct
10 HOU 41 41 8-5 23-17 20-32 21-9 2-5 2 5 0.285714
29 WSH 15 67 4-5 6-35 4-47 11-20 0-2 0 2 0.000000
15 MIA 46 36 8-4 21-15 19-27 27-9 2-0 2 0 1.000000
28 UTAH 31 51 5-8 15-33 15-43 16-8 2-2 2 2 0.500000
26 SA 22 60 5-7 12-31 12-45 10-15 3-2 3 2 0.600000

G.13 Practice Exercise 3-13

Examine the batting and people DataFrames. Batting has batting statistics for major league baseball players in the U.S. from 1871 to 2022. The people DataFrame has lots of information on major league baseball players. Both DataFrames have a playerID column that may be used to merge them if you want to answer a question that requires data from both DataFrames.

Next, compare the collective batting averages (total hits / total at bats) for players named “Jim” vs players named “Tim.” You should calculate one batting average for each name that reflects the hitting performance of players with that first name across history.

batting.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112184 entries, 0 to 112183
Data columns (total 22 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   playerID  112184 non-null  object 
 1   yearID    112184 non-null  int64  
 2   stint     112184 non-null  int64  
 3   teamID    112184 non-null  object 
 4   lgID      111447 non-null  object 
 5   G         112184 non-null  int64  
 6   AB        112184 non-null  int64  
 7   R         112184 non-null  int64  
 8   H         112184 non-null  int64  
 9   2B        112184 non-null  int64  
 10  3B        112184 non-null  int64  
 11  HR        112184 non-null  int64  
 12  RBI       111428 non-null  float64
 13  SB        109816 non-null  float64
 14  CS        88642 non-null   float64
 15  BB        112184 non-null  int64  
 16  SO        110084 non-null  float64
 17  IBB       75533 non-null   float64
 18  HBP       109368 non-null  float64
 19  SH        106116 non-null  float64
 20  SF        76080 non-null   float64
 21  GIDP      86742 non-null   float64
dtypes: float64(9), int64(10), object(3)
memory usage: 18.8+ MB
people.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20676 entries, 0 to 20675
Data columns (total 24 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   playerID      20676 non-null  object 
 1   birthYear     20567 non-null  float64
 2   birthMonth    20398 non-null  float64
 3   birthDay      20256 non-null  float64
 4   birthCountry  20617 non-null  object 
 5   birthState    20136 non-null  object 
 6   birthCity     20508 non-null  object 
 7   deathYear     10098 non-null  float64
 8   deathMonth    10097 non-null  float64
 9   deathDay      10096 non-null  float64
 10  deathCountry  10094 non-null  object 
 11  deathState    10038 non-null  object 
 12  deathCity     10089 non-null  object 
 13  nameFirst     20639 non-null  object 
 14  nameLast      20676 non-null  object 
 15  nameGiven     20639 non-null  object 
 16  weight        19864 non-null  float64
 17  height        19944 non-null  float64
 18  bats          19498 non-null  object 
 19  throws        19699 non-null  object 
 20  debut         20463 non-null  object 
 21  finalGame     20463 non-null  object 
 22  retroID       20627 non-null  object 
 23  bbrefID       20665 non-null  object 
dtypes: float64(8), object(16)
memory usage: 3.8+ MB
# Confirm that there is one row per playerID in the people DataFrame
# There are 20,676 rows and 20,676 unique player IDs, so there must be one player ID per row.
people['playerID'].nunique()
20676
# Are there any player IDs in batting that aren't in people?
set(batting['playerID'].unique()) - set(people['playerID'].unique())
set()
(batting
 .merge(people, on = 'playerID', how = 'left')
 .query('nameFirst == "Jim" or nameFirst == "Tim"')
 .groupby('nameFirst')
 .agg(total_hits = ('H', 'sum'),
      total_at_bats = ('AB', 'sum'))
 .assign(bat_avg = lambda df: df['total_hits'] / df['total_at_bats'])
)
total_hits total_at_bats bat_avg
nameFirst
Jim 65555 260699 0.251459
Tim 19916 78868 0.252523

G.14 Practice Exercise 3-14

For years 1990 and later, show average height and weight of professional baseball players in the U.S., by year. Be careful not to double-count any players in any one year!

(batting
 [['yearID', 'playerID']].drop_duplicates()  
 .query('yearID >= 1990')
 .merge(people, on = 'playerID', how = 'left')
 .groupby('yearID')
 [['height', 'weight']].mean()
)
height weight
yearID
1990 73.367961 188.076699
1991 73.328820 188.611219
1992 73.375372 189.064548
1993 73.442935 190.302536
1994 73.453078 189.909183
1995 73.366284 190.443954
1996 73.464505 191.544259
1997 73.399822 192.233304
1998 73.428331 193.239460
1999 73.474773 194.637717
2000 73.471545 195.591057
2001 73.454098 197.180328
2002 73.467980 198.798030
2003 73.473984 199.674797
2004 73.492382 201.206095
2005 73.489086 203.257074
2006 73.586957 206.064412
2007 73.659624 208.138498
2008 73.619675 209.405112
2009 73.676145 210.467615
2010 73.668535 212.336269
2011 73.656371 213.244015
2012 73.663551 213.104361
2013 73.625767 213.030675
2014 73.637121 213.452273
2015 73.674332 213.088279
2016 73.737620 213.121951
2017 73.722386 212.988954
2018 73.727339 212.629442
2019 73.727660 212.746809
2020 73.775795 212.696664
2021 73.777188 211.017905
2022 73.712375 208.642809