import pandas as pd
import numpy as np
# Create DataFrames for concatenation practice
= pd.DataFrame({
east_div_q1 'month': 'Jan Feb Mar'.split(),
'sales': [433_721.45, 513_720.77, 522_482.31],
'profit': [47_712.34, 52_936.29, 61_295.22]
})
= pd.DataFrame({
west_div_q1 'month': 'Jan Feb Mar'.split(),
'sales': [633_728.45, 713_780.77, 722_488.35],
'profit': [77_712.34, 84_976.29, 81_275.28]
})
# Import the NBA boxscore data
= "https://neuronjolt.com/data/nba_bs_2023-24_season_cleaned.csv"
url = pd.read_csv(url)
boxscores
# Convert the date column in the box score data to a datetime type
'date'] = pd.to_datetime(boxscores['date'], format = '%Y%m%d')
boxscores[
# Import the NBA team records data
= "https://neuronjolt.com/data/nba_team_recs_23-24.csv"
url = pd.read_csv(url)
teams
# Import the baseball batting and people data
= pd.read_csv('https://neuronjolt.com/data/Batting.csv')
batting = pd.read_csv('https://neuronjolt.com/data/People.csv') people
Appendix G — Chapter 3 Practice - Solutions
Click here to download this chapter as a Jupyter (.ipynb) file.
These are suggested solutions to the practice exercises for the Concatenating, Merging, and Reshaping Data chapter.
Click here to download this chapter as a Jupyter (.ipynb) file.
The code below loads pandas and numpy, creates some sample DataFrames for concatenation practice and reads four datasets into DataFrames.
The first DataFrame, boxscores
DataFrame is the box score data from the 2023-24 NBA season. Each row represents a player-game, that is, one player’s statistics in one game.
The second DataFrame, teams
, has NBA team records from the 2023-24 NBA season. Following are the column definitions in the teams
DataFrame:
wins
- number of wins for the seasonlosses
- number of losses for the season3_pt_g
- the team’s record (wins-losses) in games decided by 3 points or fewer10_pt_g
- the team’s record (wins-losses) in games decided by 10 points or morevs_ge_500
- the team’s record (wins-losses) vs teams that won more than half their gamesvs_lt_500
- the team’s record (wins-losses) vs teams that lost more than half their gamesot_g
- the team’s record (wins-losses) in games that were decided in overtime
The batting
DataFrame has batting data for major league baseball in the U.S. since 1871.
The people
DataFrame has information on people associated with major league baseball in the U.S..
G.1 Practice Exercise 3-1
The east_div_q1
DataFrame has the monthly sales and profit for the East division for the first quarter, and the west_div_q1
DataFrame has the monthly sales and profit for the West division for the first quarter. Concatenate those two DataFrames in a way that preserves the information about what division the monthly sales and profit represent.
pd.concat([east_div_q1, west_div_q1],= ('east', 'west')) keys
month | sales | profit | ||
---|---|---|---|---|
east | 0 | Jan | 433721.45 | 47712.34 |
1 | Feb | 513720.77 | 52936.29 | |
2 | Mar | 522482.31 | 61295.22 | |
west | 0 | Jan | 633728.45 | 77712.34 |
1 | Feb | 713780.77 | 84976.29 | |
2 | Mar | 722488.35 | 81275.28 |
G.2 Practice Exercise 3-2
Concatenate the east_div_q1
and west_div_q1
DataFrames and then produce from the result a DataFrame named combined
that has four columns: division
, month
, sales
and profit
and an integer row index with no duplicates.
= pd.concat([east_div_q1, west_div_q1],
combined = ('east', 'west')) keys
combined
month | sales | profit | ||
---|---|---|---|---|
east | 0 | Jan | 433721.45 | 47712.34 |
1 | Feb | 513720.77 | 52936.29 | |
2 | Mar | 522482.31 | 61295.22 | |
west | 0 | Jan | 633728.45 | 77712.34 |
1 | Feb | 713780.77 | 84976.29 | |
2 | Mar | 722488.35 | 81275.28 |
= combined.reset_index()
combined
combined
level_0 | level_1 | month | sales | profit | |
---|---|---|---|---|---|
0 | east | 0 | Jan | 433721.45 | 47712.34 |
1 | east | 1 | Feb | 513720.77 | 52936.29 |
2 | east | 2 | Mar | 522482.31 | 61295.22 |
3 | west | 0 | Jan | 633728.45 | 77712.34 |
4 | west | 1 | Feb | 713780.77 | 84976.29 |
5 | west | 2 | Mar | 722488.35 | 81275.28 |
= {'level_0': 'division'}, inplace = True)
combined.rename(columns 'level_1', axis = 1, inplace = True)
combined.drop(
combined
division | month | sales | profit | |
---|---|---|---|---|
0 | east | Jan | 433721.45 | 47712.34 |
1 | east | Feb | 513720.77 | 52936.29 |
2 | east | Mar | 522482.31 | 61295.22 |
3 | west | Jan | 633728.45 | 77712.34 |
4 | west | Feb | 713780.77 | 84976.29 |
5 | west | Mar | 722488.35 | 81275.28 |
Note: There are many ways to accomplish this, so you might have taken a different approach.
G.3 Practice Exercise 3-3
Concatenate the east_div_q1
and west_div_q1
DataFrames in a way that produces a DataFrame named east_west
with a MultiIndex for the row index, with the first level of the row index indicating the division and the second level of the row index indicating the month.
= pd.concat([east_div_q1.set_index('month'),
east_west 'month')],
west_div_q1.set_index(= ('east', 'west'))
keys
east_west
sales | profit | ||
---|---|---|---|
month | |||
east | Jan | 433721.45 | 47712.34 |
Feb | 513720.77 | 52936.29 | |
Mar | 522482.31 | 61295.22 | |
west | Jan | 633728.45 | 77712.34 |
Feb | 713780.77 | 84976.29 | |
Mar | 722488.35 | 81275.28 |
G.4 Practice Exercise 3-4
The east_west
DataFrame that you created for the previous exercise has a MultiIndex for the row index. Manipulate the names for the levels of the row index so that the first level is named ‘division’ and the second level is named ‘month’.
'division', level = 0, inplace = True)
east_west.index.set_names(
east_west
sales | profit | ||
---|---|---|---|
division | month | ||
east | Jan | 433721.45 | 47712.34 |
Feb | 513720.77 | 52936.29 | |
Mar | 522482.31 | 61295.22 | |
west | Jan | 633728.45 | 77712.34 |
Feb | 713780.77 | 84976.29 | |
Mar | 722488.35 | 81275.28 |
Note: The index level names can also be specified within the concat()
function with the names
parameter.
= pd.concat([east_div_q1.set_index('month'),
east_west 'month')],
west_div_q1.set_index(= ('east', 'west'),
keys = ('division', 'month'))
names
east_west
sales | profit | ||
---|---|---|---|
division | month | ||
east | Jan | 433721.45 | 47712.34 |
Feb | 513720.77 | 52936.29 | |
Mar | 522482.31 | 61295.22 | |
west | Jan | 633728.45 | 77712.34 |
Feb | 713780.77 | 84976.29 | |
Mar | 722488.35 | 81275.28 |
G.5 Practice Exercise 3-5
Display from the east_west
DataFrame just the rows for the East division January sales and profit and the West division February sales and profit.
'east', 'Jan'), ('west', 'Feb')]] east_west.loc[[(
sales | profit | ||
---|---|---|---|
division | month | ||
east | Jan | 433721.45 | 47712.34 |
west | Feb | 713780.77 | 84976.29 |
G.6 Practice Exercise 3-6
Display from the east_west
DataFrame just the rows for January sales and profit for both divisions.
# Solution using query()
'month == "Jan"') east_west.query(
sales | profit | ||
---|---|---|---|
division | month | ||
east | Jan | 433721.45 | 47712.34 |
west | Jan | 633728.45 | 77712.34 |
# Solution using index.get_level_values() with level position specified
1) == "Jan"] east_west.loc[east_west.index.get_level_values(
sales | profit | ||
---|---|---|---|
division | month | ||
east | Jan | 433721.45 | 47712.34 |
west | Jan | 633728.45 | 77712.34 |
# Solution using index.get_level_values() with level label specified
'month') == "Jan"] east_west.loc[east_west.index.get_level_values(
sales | profit | ||
---|---|---|---|
division | month | ||
east | Jan | 433721.45 | 47712.34 |
west | Jan | 633728.45 | 77712.34 |
G.7 Practice Exercise 3-7
Merge the boxscores data with the teams data in a way that will keep all the rows of the boxscores data, even if the team’s record is not found in the team records data. Create a new DataFrame named boxscores_2
from the merged data. Use the info()
method to investigate the boxscores_2
DataFrame.
= boxscores.merge(teams, on = 'team', how = 'left')
boxscores_2
boxscores_2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26498 entries, 0 to 26497
Data columns (total 29 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 26498 non-null datetime64[ns]
1 team 26498 non-null object
2 opponent 26498 non-null object
3 player 26498 non-null object
4 position 26498 non-null object
5 min 26498 non-null int64
6 oreb 26498 non-null int64
7 dreb 26498 non-null int64
8 reb 26498 non-null int64
9 ast 26498 non-null int64
10 stl 26498 non-null int64
11 blk 26498 non-null int64
12 to 26498 non-null int64
13 pf 26498 non-null int64
14 plus_m 26498 non-null int64
15 pts 26498 non-null int64
16 fg_m 26498 non-null int64
17 fg_a 26498 non-null int64
18 tres_m 26498 non-null int64
19 tres_a 26498 non-null int64
20 ft_m 26498 non-null int64
21 ft_a 26498 non-null int64
22 wins 26498 non-null int64
23 losses 26498 non-null int64
24 3_pt_g 26498 non-null object
25 10_pt_g 26498 non-null object
26 vs_ge_500 26498 non-null object
27 vs_lt_500 26498 non-null object
28 ot_g 26498 non-null object
dtypes: datetime64[ns](1), int64(19), object(9)
memory usage: 5.9+ MB
G.8 Practice Exercise 3-8
Which player playing for which team had the highest scoring average while playing in at least 50 games for a team with a losing record?
(boxscores_2'losses > wins')
.query('player', 'team'])
.groupby([= ('date', 'count'),
.agg(games_for_team = ('pts', 'mean'))
score_avg 'games_for_team >= 50')
.query(= 1, columns = 'score_avg', keep = 'all')
.nlargest(n )
games_for_team | score_avg | ||
---|---|---|---|
player | team | ||
Trae Young | ATL | 54 | 25.722222 |
G.9 Practice Exercise 3-9
Investigating merge types
Create a DataFrame named bs_sub
by randomly sampling 12 rows from the boxscores
DataFrame and keeping only the player
, team
, and position
columns. Create another DataFrame named teams_sub
by randomly sampling 12 rows from the teams
DataFrame, keeping only the team
, wins
, and losses
columns.
After creating the two DataFrames, run value_counts()
on the team
column of each DataFrame. You should see that the relationship between bs_sub
and teams_sub
has the potential to be many-to-one, since there may be multiple rows in bs_sub
for the same team, while teams_sub
will only have one row for each team. In a merge each row from one input DataFrame is merged with every row in the other input DataFrame that has the same key value.
Next, show the results of four different merges with bs_sub
as the left DataFrame and teams_sub
as the right DataFrame: an inner merge, an outer merge, a right merge, and a left merge. Set the indicator = True
parameter to help you investigate the results of the various merge types.
For each merge note how many rows are in the result and examine where missing values are created.
See if you can evidence of a many-to-one relationship between bs_sub
and teams_sub
.
Since the two input DataFrames are created through random sampling you can run your code multiple times to have more chance to study the merges.
= boxscores.sample(12).loc[:, ['player', 'team', 'position']]
bs_sub = teams.sample(12).loc[:, ['team', 'wins', 'losses']] teams_sub
'team'].value_counts() bs_sub[
team
IND 2
MIL 2
OKC 1
DEN 1
MIN 1
PHX 1
CLE 1
NO 1
LAC 1
MEM 1
Name: count, dtype: int64
'team'].value_counts() teams_sub[
team
SAC 1
NO 1
MEM 1
CHA 1
MIL 1
HOU 1
ATL 1
NY 1
BOS 1
BKN 1
LAL 1
POR 1
Name: count, dtype: int64
# Inner merge - only keeps rows for which a matching team is found in both DataFrames
= 'team', how = 'inner', indicator = True) pd.merge(bs_sub, teams_sub, on
player | team | position | wins | losses | _merge | |
---|---|---|---|---|---|---|
0 | J. Crowder | MIL | PF | 49 | 33 | both |
1 | K. Lewis Jr. | NO | PG | 49 | 33 | both |
2 | J. Jackson Jr. | MEM | PF | 27 | 55 | both |
3 | A. Green | MIL | G | 49 | 33 | both |
# Outer merge - all rows are kept from both DataFrames
= 'team', how = 'outer', indicator = True) pd.merge(bs_sub, teams_sub, on
player | team | position | wins | losses | _merge | |
---|---|---|---|---|---|---|
0 | NaN | ATL | NaN | 36.0 | 46.0 | right_only |
1 | NaN | BKN | NaN | 32.0 | 50.0 | right_only |
2 | NaN | BOS | NaN | 64.0 | 18.0 | right_only |
3 | NaN | CHA | NaN | 21.0 | 61.0 | right_only |
4 | I. Okoro | CLE | SF | NaN | NaN | left_only |
5 | A. Gordon | DEN | PF | NaN | NaN | left_only |
6 | NaN | HOU | NaN | 41.0 | 41.0 | right_only |
7 | M. Turner | IND | C | NaN | NaN | left_only |
8 | A. Nembhard | IND | PG | NaN | NaN | left_only |
9 | N. Powell | LAC | G | NaN | NaN | left_only |
10 | NaN | LAL | NaN | 47.0 | 35.0 | right_only |
11 | J. Jackson Jr. | MEM | PF | 27.0 | 55.0 | both |
12 | J. Crowder | MIL | PF | 49.0 | 33.0 | both |
13 | A. Green | MIL | G | 49.0 | 33.0 | both |
14 | M. Conley | MIN | PG | NaN | NaN | left_only |
15 | K. Lewis Jr. | NO | PG | 49.0 | 33.0 | both |
16 | NaN | NY | NaN | 50.0 | 32.0 | right_only |
17 | O. Dieng | OKC | F | NaN | NaN | left_only |
18 | B. Bol | PHX | C | NaN | NaN | left_only |
19 | NaN | POR | NaN | 21.0 | 61.0 | right_only |
20 | NaN | SAC | NaN | 46.0 | 36.0 | right_only |
# Left merge - Keeps all rows from left DataFrame
= 'team', how = 'left', indicator = True) pd.merge(bs_sub, teams_sub, on
player | team | position | wins | losses | _merge | |
---|---|---|---|---|---|---|
0 | A. Gordon | DEN | PF | NaN | NaN | left_only |
1 | O. Dieng | OKC | F | NaN | NaN | left_only |
2 | J. Crowder | MIL | PF | 49.0 | 33.0 | both |
3 | M. Conley | MIN | PG | NaN | NaN | left_only |
4 | B. Bol | PHX | C | NaN | NaN | left_only |
5 | M. Turner | IND | C | NaN | NaN | left_only |
6 | I. Okoro | CLE | SF | NaN | NaN | left_only |
7 | K. Lewis Jr. | NO | PG | 49.0 | 33.0 | both |
8 | N. Powell | LAC | G | NaN | NaN | left_only |
9 | A. Nembhard | IND | PG | NaN | NaN | left_only |
10 | J. Jackson Jr. | MEM | PF | 27.0 | 55.0 | both |
11 | A. Green | MIL | G | 49.0 | 33.0 | both |
# Right merge - keeps all rows from right DataFrame
= 'team', how = 'right', indicator = True) pd.merge(bs_sub, teams_sub, on
player | team | position | wins | losses | _merge | |
---|---|---|---|---|---|---|
0 | NaN | SAC | NaN | 46 | 36 | right_only |
1 | K. Lewis Jr. | NO | PG | 49 | 33 | both |
2 | J. Jackson Jr. | MEM | PF | 27 | 55 | both |
3 | NaN | CHA | NaN | 21 | 61 | right_only |
4 | J. Crowder | MIL | PF | 49 | 33 | both |
5 | A. Green | MIL | G | 49 | 33 | both |
6 | NaN | HOU | NaN | 41 | 41 | right_only |
7 | NaN | ATL | NaN | 36 | 46 | right_only |
8 | NaN | NY | NaN | 50 | 32 | right_only |
9 | NaN | BOS | NaN | 64 | 18 | right_only |
10 | NaN | BKN | NaN | 32 | 50 | right_only |
11 | NaN | LAL | NaN | 47 | 35 | right_only |
12 | NaN | POR | NaN | 21 | 61 | right_only |
G.10 Practice Exercise 3-10
Create a subset of the boxscores DataFrame, named bucks
, that only has the statlines for players playing for the Milwaukee Bucks (“MIL” is the team code), and only includes the following columns: date
, player
, team
, pts
, reb
, ast
, stl
, blk
.
Display a random sample of 5 rows of the bucks
DataFrame. This format is called “wide” format because each statistic has its own column, which makes the DataFrame wide.
= (boxscores
bucks 'team == "MIL"')
.query('date', 'player', 'team', 'pts', 'reb', 'ast', 'stl', 'blk']]
.loc[:, [
)
5) bucks.sample(
date | player | team | pts | reb | ast | stl | blk | |
---|---|---|---|---|---|---|---|---|
802 | 2023-10-29 | P. Connaughton | MIL | 5 | 3 | 2 | 2 | 0 |
5119 | 2023-11-26 | B. Lopez | MIL | 10 | 6 | 3 | 3 | 4 |
17625 | 2024-02-15 | G. Antetokounmpo | MIL | 35 | 4 | 12 | 0 | 0 |
8987 | 2023-12-23 | K. Middleton | MIL | 20 | 5 | 5 | 0 | 0 |
19984 | 2024-03-06 | B. Portis | MIL | 20 | 6 | 1 | 0 | 0 |
Make a new DataFrame named bucks_long
that is the result of transforming the bucks
DataFrame to “long” format. All the columns that represent statistics should be transformed into two columns: stat
and value
. Display several rows of the bucks_long
DataFrame.
= bucks.melt(id_vars = ['date', 'player', 'team'],
bucks_long = 'stat')
var_name
7) bucks_long.sample(
date | player | team | stat | value | |
---|---|---|---|---|---|
1435 | 2024-01-29 | P. Connaughton | MIL | reb | 1 |
675 | 2024-03-01 | B. Lopez | MIL | pts | 12 |
3502 | 2024-03-20 | P. Beverley | MIL | stl | 0 |
713 | 2024-03-08 | D. Gallinari | MIL | pts | 0 |
3832 | 2023-11-26 | B. Lopez | MIL | blk | 4 |
2158 | 2023-12-25 | C. Payne | MIL | ast | 1 |
1778 | 2024-04-09 | D. Lillard | MIL | reb | 3 |
G.11 Practice Exercise 3-11
Make a new DataFrame named bucks_wide
by transforming the bucks_long
DataFrame back to wide format. Verify your tranformation by displaying several randomly-selected rows from bucks_wide
.
= pd.pivot(bucks_long,
bucks_wide = ['date', 'player', 'team'],
index = 'stat',
columns = 'value') values
7) bucks_wide.sample(
stat | ast | blk | pts | reb | stl | ||
---|---|---|---|---|---|---|---|
date | player | team | |||||
2023-11-17 | K. Middleton | MIL | 1 | 0 | 12 | 6 | 1 |
2023-11-24 | G. Antetokounmpo | MIL | 3 | 1 | 31 | 9 | 3 |
D. Lillard | MIL | 10 | 0 | 31 | 6 | 0 | |
2024-02-13 | M. Beauchamp | MIL | 0 | 0 | 8 | 2 | 1 |
2024-01-20 | B. Portis | MIL | 0 | 0 | 4 | 3 | 0 |
2024-01-03 | K. Middleton | MIL | 7 | 2 | 19 | 6 | 0 |
2024-02-03 | G. Antetokounmpo | MIL | 10 | 1 | 48 | 6 | 5 |
G.12 Practice Exercise 3-12
In the teams
DataFrame several columns have the teams’ records for various contexts. The data is more useful if we split out the wins and losses into their own columns. Based on the ot_g
column create three new columns in the DataFrame:
ot_win_pct
- win percentage in overtime gamesot_wins
- number of overtime games wonot_losses
- number of overtime games lost
'ot_wins'] = teams['ot_g'].str.split('-').str.get(0).astype('int')
teams['ot_losses'] = teams['ot_g'].str.split('-').str.get(1).astype('int')
teams['ot_win_pct'] = teams['ot_wins'] / (teams['ot_wins'] + teams['ot_losses'])
teams[
5) teams.sample(
team | wins | losses | 3_pt_g | 10_pt_g | vs_ge_500 | vs_lt_500 | ot_g | ot_wins | ot_losses | ot_win_pct | |
---|---|---|---|---|---|---|---|---|---|---|---|
10 | HOU | 41 | 41 | 8-5 | 23-17 | 20-32 | 21-9 | 2-5 | 2 | 5 | 0.285714 |
29 | WSH | 15 | 67 | 4-5 | 6-35 | 4-47 | 11-20 | 0-2 | 0 | 2 | 0.000000 |
15 | MIA | 46 | 36 | 8-4 | 21-15 | 19-27 | 27-9 | 2-0 | 2 | 0 | 1.000000 |
28 | UTAH | 31 | 51 | 5-8 | 15-33 | 15-43 | 16-8 | 2-2 | 2 | 2 | 0.500000 |
26 | SA | 22 | 60 | 5-7 | 12-31 | 12-45 | 10-15 | 3-2 | 3 | 2 | 0.600000 |
G.13 Practice Exercise 3-13
Examine the batting
and people
DataFrames. Batting has batting statistics for major league baseball players in the U.S. from 1871 to 2022. The people
DataFrame has lots of information on major league baseball players. Both DataFrames have a playerID
column that may be used to merge them if you want to answer a question that requires data from both DataFrames.
Next, compare the collective batting averages (total hits / total at bats) for players named “Jim” vs players named “Tim.” You should calculate one batting average for each name that reflects the hitting performance of players with that first name across history.
batting.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112184 entries, 0 to 112183
Data columns (total 22 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 playerID 112184 non-null object
1 yearID 112184 non-null int64
2 stint 112184 non-null int64
3 teamID 112184 non-null object
4 lgID 111447 non-null object
5 G 112184 non-null int64
6 AB 112184 non-null int64
7 R 112184 non-null int64
8 H 112184 non-null int64
9 2B 112184 non-null int64
10 3B 112184 non-null int64
11 HR 112184 non-null int64
12 RBI 111428 non-null float64
13 SB 109816 non-null float64
14 CS 88642 non-null float64
15 BB 112184 non-null int64
16 SO 110084 non-null float64
17 IBB 75533 non-null float64
18 HBP 109368 non-null float64
19 SH 106116 non-null float64
20 SF 76080 non-null float64
21 GIDP 86742 non-null float64
dtypes: float64(9), int64(10), object(3)
memory usage: 18.8+ MB
people.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20676 entries, 0 to 20675
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 playerID 20676 non-null object
1 birthYear 20567 non-null float64
2 birthMonth 20398 non-null float64
3 birthDay 20256 non-null float64
4 birthCountry 20617 non-null object
5 birthState 20136 non-null object
6 birthCity 20508 non-null object
7 deathYear 10098 non-null float64
8 deathMonth 10097 non-null float64
9 deathDay 10096 non-null float64
10 deathCountry 10094 non-null object
11 deathState 10038 non-null object
12 deathCity 10089 non-null object
13 nameFirst 20639 non-null object
14 nameLast 20676 non-null object
15 nameGiven 20639 non-null object
16 weight 19864 non-null float64
17 height 19944 non-null float64
18 bats 19498 non-null object
19 throws 19699 non-null object
20 debut 20463 non-null object
21 finalGame 20463 non-null object
22 retroID 20627 non-null object
23 bbrefID 20665 non-null object
dtypes: float64(8), object(16)
memory usage: 3.8+ MB
# Confirm that there is one row per playerID in the people DataFrame
# There are 20,676 rows and 20,676 unique player IDs, so there must be one player ID per row.
'playerID'].nunique() people[
20676
# Are there any player IDs in batting that aren't in people?
set(batting['playerID'].unique()) - set(people['playerID'].unique())
set()
(batting= 'playerID', how = 'left')
.merge(people, on 'nameFirst == "Jim" or nameFirst == "Tim"')
.query('nameFirst')
.groupby(= ('H', 'sum'),
.agg(total_hits = ('AB', 'sum'))
total_at_bats = lambda df: df['total_hits'] / df['total_at_bats'])
.assign(bat_avg )
total_hits | total_at_bats | bat_avg | |
---|---|---|---|
nameFirst | |||
Jim | 65555 | 260699 | 0.251459 |
Tim | 19916 | 78868 | 0.252523 |
G.14 Practice Exercise 3-14
For years 1990 and later, show average height and weight of professional baseball players in the U.S., by year. Be careful not to double-count any players in any one year!
(batting'yearID', 'playerID']].drop_duplicates()
[['yearID >= 1990')
.query(= 'playerID', how = 'left')
.merge(people, on 'yearID')
.groupby('height', 'weight']].mean()
[[ )
height | weight | |
---|---|---|
yearID | ||
1990 | 73.367961 | 188.076699 |
1991 | 73.328820 | 188.611219 |
1992 | 73.375372 | 189.064548 |
1993 | 73.442935 | 190.302536 |
1994 | 73.453078 | 189.909183 |
1995 | 73.366284 | 190.443954 |
1996 | 73.464505 | 191.544259 |
1997 | 73.399822 | 192.233304 |
1998 | 73.428331 | 193.239460 |
1999 | 73.474773 | 194.637717 |
2000 | 73.471545 | 195.591057 |
2001 | 73.454098 | 197.180328 |
2002 | 73.467980 | 198.798030 |
2003 | 73.473984 | 199.674797 |
2004 | 73.492382 | 201.206095 |
2005 | 73.489086 | 203.257074 |
2006 | 73.586957 | 206.064412 |
2007 | 73.659624 | 208.138498 |
2008 | 73.619675 | 209.405112 |
2009 | 73.676145 | 210.467615 |
2010 | 73.668535 | 212.336269 |
2011 | 73.656371 | 213.244015 |
2012 | 73.663551 | 213.104361 |
2013 | 73.625767 | 213.030675 |
2014 | 73.637121 | 213.452273 |
2015 | 73.674332 | 213.088279 |
2016 | 73.737620 | 213.121951 |
2017 | 73.722386 | 212.988954 |
2018 | 73.727339 | 212.629442 |
2019 | 73.727660 | 212.746809 |
2020 | 73.775795 | 212.696664 |
2021 | 73.777188 | 211.017905 |
2022 | 73.712375 | 208.642809 |