Appendix F — Chapter 3 Practice

Click here to download this chapter as a Jupyter (.ipynb) file.

The code below loads pandas and numpy, creates some sample DataFrames for concatenation practice and reads four datasets into DataFrames.

The first DataFrame, boxscores DataFrame is the box score data from the 2023-24 NBA season. Each row represents a player-game, that is, one player’s statistics in one game.

The second DataFrame, teams, has NBA team records from the 2023-24 NBA season. Following are the column definitions in the teams DataFrame:

The batting DataFrame has batting data for major league baseball in the U.S. since 1871.

The people DataFrame has information on people associated with major league baseball in the U.S..

import pandas as pd
import numpy as np

# Create DataFrames for concatenation practice

east_div_q1 = pd.DataFrame({
    'month': 'Jan Feb Mar'.split(),
    'sales': [433_721.45, 513_720.77, 522_482.31],
    'profit': [47_712.34, 52_936.29, 61_295.22]         
})

west_div_q1 = pd.DataFrame({
    'month': 'Jan Feb Mar'.split(),
    'sales': [633_728.45, 713_780.77, 722_488.35],
    'profit': [77_712.34, 84_976.29, 81_275.28]         
})

# Import the NBA boxscore data
url = "https://neuronjolt.com/data/nba_bs_2023-24_season_cleaned.csv"
boxscores = pd.read_csv(url)

# Convert the date column in the box score data to a datetime type
boxscores['date'] = pd.to_datetime(boxscores['date'], format = '%Y%m%d')

# Import the NBA team records data
url = "https://neuronjolt.com/data/nba_team_recs_23-24.csv"
teams = pd.read_csv(url)

# Import the baseball batting and people data
batting = pd.read_csv('https://neuronjolt.com/data/Batting.csv')
people = pd.read_csv('https://neuronjolt.com/data/People.csv')

F.1 Practice Exercise 3-1

The east_div_q1 DataFrame has the monthly sales and profit for the East division for the first quarter, and the west_div_q1 DataFrame has the monthly sales and profit for the West division for the first quarter. Concatenate those two DataFrames in a way that preserves the information about what division the monthly sales and profit represent.

F.2 Practice Exercise 3-2

Concatenate the east_div_q1 and west_div_q1 DataFrames and then produce from the result a DataFrame named combined that has four columns: division, month, sales and profit and an integer row index with no duplicates.

F.3 Practice Exercise 3-3

Concatenate the east_div_q1 and west_div_q1 DataFrames in a way that produces a DataFrame named east_west with a MultiIndex for the row index, with the first level of the row index indicating the division and the second level of the row index indicating the month.

F.4 Practice Exercise 3-4

The east_west DataFrame that you created for the previous exercise has a MultiIndex for the row index. Manipulate the names for the levels of the row index so that the first level is named ‘division’ and the second level is named ‘month’.

F.5 Practice Exercise 3-5

Display from the east_west DataFrame just the rows for the East division January sales and profit and the West division February sales and profit.

F.6 Practice Exercise 3-6

Display from the east_west DataFrame just the rows for January sales and profit for both divisions.

F.7 Practice Exercise 3-7

Merge the boxscores data with the teams data in a way that will keep all the rows of the boxscores data, even if the team’s record is not found in the team records data. Create a new DataFrame named boxscores_2 from the merged data. Use the info() method to investigate the boxscores_2 DataFrame.

F.8 Practice Exercise 3-8

Which player playing for which team had the highest scoring average while playing in at least 50 games for a team with a losing record?

F.9 Practice Exercise 3-9

Investigating merge types

Create a DataFrame named bs_sub by randomly sampling 12 rows from the boxscores DataFrame and keeping only the player, team, and position columns. Create another DataFrame named teams_sub by randomly sampling 12 rows from the teams DataFrame, keeping only the team, wins, and losses columns.

After creating the two DataFrames, run value_counts() on the team column of each DataFrame. You should see that the relationship between bs_sub and teams_sub has the potential to be many-to-one, since there may be multiple rows in bs_sub for the same team, while teams_sub will only have one row for each team. In a merge each row from one input DataFrame is merged with every row in the other input DataFrame that has the same key value.

Next, show the results of four different merges with bs_sub as the left DataFrame and teams_sub as the right DataFrame: an inner merge, an outer merge, a right merge, and a left merge. Set the indicator = True parameter to help you investigate the results of the various merge types.

For each merge note how many rows are in the result and examine where missing values are created.

See if you can evidence of a many-to-one relationship between bs_sub and teams_sub.

Since the two input DataFrames are created through random sampling you can run your code multiple times to have more chance to study the merges.

F.10 Practice Exercise 3-10

Create a subset of the boxscores DataFrame, named bucks, that only has the statlines for players playing for the Milwaukee Bucks (“MIL” is the team code), and only includes the following columns: date, player, team, pts, reb, ast, stl, blk.

Display a random sample of 5 rows of the bucks DataFrame. This format is called “wide” format because each statistic has its own column, which makes the DataFrame wide.

Make a new DataFrame named bucks_long that is the result of transforming the bucks DataFrame to “long” format. All the columns that represent statistics should be transformed into two columns: stat and value. Display several rows of the bucks_long DataFrame.

F.11 Practice Exercise 3-11

Make a new DataFrame named bucks_wide by transforming the bucks_long DataFrame back to wide format. Verify your tranformation by displaying several randomly-selected rows from bucks_wide.

F.12 Practice Exercise 3-12

In the teams DataFrame several columns have the teams’ records for various contexts. The data is more useful if we split out the wins and losses into their own columns. Based on the ot_g column create three new columns in the DataFrame:

  • ot_win_pct - win percentage in overtime games
  • ot_wins - number of overtime games won
  • ot_losses - number of overtime games lost

F.13 Practice Exercise 3-13

Examine the batting and people DataFrames. Batting has batting statistics for major league baseball players in the U.S. from 1871 to 2022. The people DataFrame has lots of information on major league baseball players. Both DataFrames have a playerID column that may be used to merge them if you want to answer a question that requires data from both DataFrames.

Next, compare the collective batting averages (total hits / total at bats) for players named “Jim” vs players named “Tim.” You should calculate one batting average for each name that reflects the hitting performance of players with that first name across history.

F.14 Practice Exercise 3-14

For years 1990 and later, show average height and weight of professional baseball players in the U.S., by year. Be careful not to double-count any players in any one year!