import pandas as pd
import numpy as np
# Create DataFrames for concatenation practice
= pd.DataFrame({
east_div_q1 'month': 'Jan Feb Mar'.split(),
'sales': [433_721.45, 513_720.77, 522_482.31],
'profit': [47_712.34, 52_936.29, 61_295.22]
})
= pd.DataFrame({
west_div_q1 'month': 'Jan Feb Mar'.split(),
'sales': [633_728.45, 713_780.77, 722_488.35],
'profit': [77_712.34, 84_976.29, 81_275.28]
})
# Import the NBA boxscore data
= "https://neuronjolt.com/data/nba_bs_2023-24_season_cleaned.csv"
url = pd.read_csv(url)
boxscores
# Convert the date column in the box score data to a datetime type
'date'] = pd.to_datetime(boxscores['date'], format = '%Y%m%d')
boxscores[
# Import the NBA team records data
= "https://neuronjolt.com/data/nba_team_recs_23-24.csv"
url = pd.read_csv(url)
teams
# Import the baseball batting and people data
= pd.read_csv('https://neuronjolt.com/data/Batting.csv')
batting = pd.read_csv('https://neuronjolt.com/data/People.csv') people
Appendix F — Chapter 3 Practice
Click here to download this chapter as a Jupyter (.ipynb) file.
The code below loads pandas and numpy, creates some sample DataFrames for concatenation practice and reads four datasets into DataFrames.
The first DataFrame, boxscores
DataFrame is the box score data from the 2023-24 NBA season. Each row represents a player-game, that is, one player’s statistics in one game.
The second DataFrame, teams
, has NBA team records from the 2023-24 NBA season. Following are the column definitions in the teams
DataFrame:
wins
- number of wins for the seasonlosses
- number of losses for the season3_pt_g
- the team’s record (wins-losses) in games decided by 3 points or fewer10_pt_g
- the team’s record (wins-losses) in games decided by 10 points or morevs_ge_500
- the team’s record (wins-losses) vs teams that won more than half their gamesvs_lt_500
- the team’s record (wins-losses) vs teams that lost more than half their gamesot_g
- the team’s record (wins-losses) in games that were decided in overtime
The batting
DataFrame has batting data for major league baseball in the U.S. since 1871.
The people
DataFrame has information on people associated with major league baseball in the U.S..
F.1 Practice Exercise 3-1
The east_div_q1
DataFrame has the monthly sales and profit for the East division for the first quarter, and the west_div_q1
DataFrame has the monthly sales and profit for the West division for the first quarter. Concatenate those two DataFrames in a way that preserves the information about what division the monthly sales and profit represent.
F.2 Practice Exercise 3-2
Concatenate the east_div_q1
and west_div_q1
DataFrames and then produce from the result a DataFrame named combined
that has four columns: division
, month
, sales
and profit
and an integer row index with no duplicates.
F.3 Practice Exercise 3-3
Concatenate the east_div_q1
and west_div_q1
DataFrames in a way that produces a DataFrame named east_west
with a MultiIndex for the row index, with the first level of the row index indicating the division and the second level of the row index indicating the month.
F.4 Practice Exercise 3-4
The east_west
DataFrame that you created for the previous exercise has a MultiIndex for the row index. Manipulate the names for the levels of the row index so that the first level is named ‘division’ and the second level is named ‘month’.
F.5 Practice Exercise 3-5
Display from the east_west
DataFrame just the rows for the East division January sales and profit and the West division February sales and profit.
F.6 Practice Exercise 3-6
Display from the east_west
DataFrame just the rows for January sales and profit for both divisions.
F.7 Practice Exercise 3-7
Merge the boxscores data with the teams data in a way that will keep all the rows of the boxscores data, even if the team’s record is not found in the team records data. Create a new DataFrame named boxscores_2
from the merged data. Use the info()
method to investigate the boxscores_2
DataFrame.
F.8 Practice Exercise 3-8
Which player playing for which team had the highest scoring average while playing in at least 50 games for a team with a losing record?
F.9 Practice Exercise 3-9
Investigating merge types
Create a DataFrame named bs_sub
by randomly sampling 12 rows from the boxscores
DataFrame and keeping only the player
, team
, and position
columns. Create another DataFrame named teams_sub
by randomly sampling 12 rows from the teams
DataFrame, keeping only the team
, wins
, and losses
columns.
After creating the two DataFrames, run value_counts()
on the team
column of each DataFrame. You should see that the relationship between bs_sub
and teams_sub
has the potential to be many-to-one, since there may be multiple rows in bs_sub
for the same team, while teams_sub
will only have one row for each team. In a merge each row from one input DataFrame is merged with every row in the other input DataFrame that has the same key value.
Next, show the results of four different merges with bs_sub
as the left DataFrame and teams_sub
as the right DataFrame: an inner merge, an outer merge, a right merge, and a left merge. Set the indicator = True
parameter to help you investigate the results of the various merge types.
For each merge note how many rows are in the result and examine where missing values are created.
See if you can evidence of a many-to-one relationship between bs_sub
and teams_sub
.
Since the two input DataFrames are created through random sampling you can run your code multiple times to have more chance to study the merges.
F.10 Practice Exercise 3-10
Create a subset of the boxscores DataFrame, named bucks
, that only has the statlines for players playing for the Milwaukee Bucks (“MIL” is the team code), and only includes the following columns: date
, player
, team
, pts
, reb
, ast
, stl
, blk
.
Display a random sample of 5 rows of the bucks
DataFrame. This format is called “wide” format because each statistic has its own column, which makes the DataFrame wide.
Make a new DataFrame named bucks_long
that is the result of transforming the bucks
DataFrame to “long” format. All the columns that represent statistics should be transformed into two columns: stat
and value
. Display several rows of the bucks_long
DataFrame.
F.11 Practice Exercise 3-11
Make a new DataFrame named bucks_wide
by transforming the bucks_long
DataFrame back to wide format. Verify your tranformation by displaying several randomly-selected rows from bucks_wide
.
F.12 Practice Exercise 3-12
In the teams
DataFrame several columns have the teams’ records for various contexts. The data is more useful if we split out the wins and losses into their own columns. Based on the ot_g
column create three new columns in the DataFrame:
ot_win_pct
- win percentage in overtime gamesot_wins
- number of overtime games wonot_losses
- number of overtime games lost
F.13 Practice Exercise 3-13
Examine the batting
and people
DataFrames. Batting has batting statistics for major league baseball players in the U.S. from 1871 to 2022. The people
DataFrame has lots of information on major league baseball players. Both DataFrames have a playerID
column that may be used to merge them if you want to answer a question that requires data from both DataFrames.
Next, compare the collective batting averages (total hits / total at bats) for players named “Jim” vs players named “Tim.” You should calculate one batting average for each name that reflects the hitting performance of players with that first name across history.
F.14 Practice Exercise 3-14
For years 1990 and later, show average height and weight of professional baseball players in the U.S., by year. Be careful not to double-count any players in any one year!