2  Grouping and Aggregating Data with Pandas

Click here to download this chapter as a Jupyter (.ipynb) file.

Grouping refers to splitting data into separate groups according to the values in a column or set of columns, which then makes it possible for us to calculate various aggregate statistics for each group. For example, if you have sales data that is divided into the regions North, South, East, and West you could group by region and then sum sales to calculate total sales per region. This is a very common operation in data analysis. In this chapter we will learn how to use the groupby() method in pandas to group our data, and how to then calculate aggregate statistics on the groups.

This chapter also covers a few more useful topics. We will learn about the rank() method to rank values in a Series and the nlargest() and nsmallest() methods to identify the n largest or n smallest values in a Series, respectively. We will also learn how to use a callable within loc[], which allows us to create longer chains of methods, and how to create a column based on if-then logic applied to another column or Series using the numpy where() function. Finally, we will see how the transform() method of a GroupBy object can calculate an aggregate by group, but produce the aggregate value for each row of the input DataFrame, according to the group of each row.

We will start by learning about grouping. First, let’s import pandas and numpy, set our pandas options, and import the retail store data and NBA boxscore data.

import pandas as pd
import numpy as np

pd.set_option('display.max_columns', 30)
pd.set_option('display.max_rows', 100)
pd.options.display.float_format = '{:,.2f}'.format
# Read the retail store data into a DataFrame
store = pd.read_csv("https://neuronjolt.com/data/superstore.csv")

# Read the NBA boxscores data into a DataFrame
url = "https://neuronjolt.com/data/nba_bs_2023-24_season_cleaned.csv"
boxscores = pd.read_csv(url)

2.1 Aggregating data by groups with .groupby()

Data analysts often need to divide a dataset into subsets, or groups, according to the values in one or more columns and then calculate some aggregate statistics on each of the groups. For example, we might want to take the sales data and calculate total sales per product ID. To do so we essentially divide the dataset into one group for each unique product ID and then sum the sales for each group. Another example would be if we wanted to calculate the total number of points allowed by each NBA team during the season. We would divide the data into groups according to the value in the opponent column, and then sum the points for each group. These are both examples of grouping by one column and then calculating an aggregate statistic.

Sometimes we might want to group by a combination of columns. For example, instead of calculating the overall total sales per product ID we might want to calculate monthly total sales per product ID to see if we can identify any sales trends. To do so we would need to divide the data into groups, one group for each unique combination of month and product ID, and then calculate the total sales for each group.

A generic term that data analysts sometimes use for this type of data operation is “Split - Apply - Combine”. This is because the data is first split apart, then some function is applied to calculate a statistic or set of statistics for each group, and then the results of the group calculations are combined into a result.

Pandas provides the groupby() method, which can be used to divide a DataFrame into groups according to the values in a column or set of columns, called the group keys. The object that has the DataFrame divided into groups according to the; group keys is called the GroupBy object. After grouping we can apply various methods to a column or columns selected by using [] on the GroupBy object similar to how we use [] to indicate a column or columns in a DataFrame. In this way we can calculate aggregate statistics for the groups. The sections below show different ways of calculating aggregate statistics after a groupby() operation in pandas.

2.1.1 Simple aggregation of one or more columns

This data operation involves grouping by one or more columns, applying an aggregating function to a selected column or columns in each of the groups, and then combining the group results for display. The output is a pandas Series if the aggregating function is applied to a single column and a DataFrame if the aggregating function is applied to a list of columns (even if there is only one column in the list). The groupby column or columns become the row index by default and the values are the results from applying the aggregating function(s) to the selected column(s) for each group. By default, the final results are presented in sorted order of the group keys (the unique values in the groupby columns). Some data analysts set the sort = False parameter within the groupby() to gain efficiency. When sort = False is set the sort keys in the results will be presented in the order they appear in the DataFrame.

Let’s begin by looking at the simplest case: grouping by one column, selecting one column to aggregate, and specifying the aggregating function. The code below calculates total rebounds by position. Note that since only one column is specified to aggregate the result is a Series.

boxscores.groupby('position')['reb'].sum()
position
C     24595
F     11843
G      7383
PF    19771
PG    11729
SF    15819
SG    15972
Name: reb, dtype: int64

We might want to sort this result in descending order of total rebounds. Because we are applying a sort operation after the aggregation it doesn’t make sense to sort the group keys, so we will set the sort = False parameter here.

Since the result is a Series it only has one column and we can sort it without indicating the sort column.

(boxscores
 .groupby('position', sort = False)['reb']
 .sum()
 .sort_values(ascending = False)
)
position
C     24595
PF    19771
SG    15972
SF    15819
F     11843
PG    11729
G      7383
Name: reb, dtype: int64

If we specify a list of columns to aggregate, even if only one column is in the list, the output will be a DataFrame. In the code below notice the double brackets around 'reb'. The outer square brackets are used to specify the column or columns within each group that you want to aggregate. The inner square brackets indicate that 'reb' is a list. Notice that the output is now a DataFrame. Notice, also, that the unique values of the groupby column, position, have now become the row index values. This is the default behavior for groupby().

boxscores.groupby('position')[['reb']].sum()
reb
position
C 24595
F 11843
G 7383
PF 19771
PG 11729
SF 15819
SG 15972

If we don’t want the groupby column values to become the row index we can set the as_index parameter to False within the groupby() method, as in the code below. Notice that when we set as_index = False in the groupby() method position will remain a column in the output instead of becoming the row index.

boxscores.groupby('position', as_index = False)[['reb']].sum()
position reb
0 C 24595
1 F 11843
2 G 7383
3 PF 19771
4 PG 11729
5 SF 15819
6 SG 15972

Since this result is a DataFrame the sort column must be indicated if we want to sort it.

(boxscores
 .groupby('position', as_index = False)[['reb']]
 .sum()
 .sort_values(by = 'reb', ascending = False)
)
position reb
0 C 24595
3 PF 19771
6 SG 15972
5 SF 15819
1 F 11843
4 PG 11729
2 G 7383

We can have more than one level of grouping. To specify more than one level of grouping use a list of column names as the argument for the groupby() method. Notice that when we group by two columns, team and position in the code below, the aggregates are calculated for each unique combination of team and position, and team and position become a multi-level index. We will cover multi-level indexes in more detail in a later chapter.

boxscores.groupby(['team', 'position'])[['reb']].sum()
reb
team position
ATL C 776
F 40
G 155
PF 569
PG 169
... ... ...
WSH G 111
PF 400
PG 179
SF 1282
SG 647

206 rows × 1 columns

We can also specify more than one column to aggregate after the groupby. The following code groups the sales data in the store DataFrame by product category and region, and then calculates total sales and total profit for each group.

(store
 .groupby(['Category', 'Region'])
 [['Sales', 'Profit']]
 .sum()
)
Sales Profit
Category Region
Furniture Central 163,797.16 -2,871.05
East 208,291.20 3,046.17
South 117,298.68 6,771.21
West 252,612.74 11,504.95
Office Supplies Central 167,026.42 8,879.98
East 205,516.05 41,014.58
South 125,651.31 19,986.39
West 220,853.25 52,609.85
Technology Central 170,416.31 33,697.43
East 264,973.98 47,462.04
South 148,771.91 19,991.83
West 251,991.83 44,303.65

In the examples above we have applied the sum() method. To see all the methods that can be applied to a GroupBy object consult the pandas documentation. Some of the methods most commonly used to calculate aggregate statistics on specified columns of a GroupBy object include the following:

  • sum()
  • count() - count of the group, excluding missing data
  • describe() - generate descriptive statistics for each group
  • first() - first value in the group
  • idxmax() - index of first occurrence of maximum
  • idxmin() - index of first occurrence of minimum
  • max()
  • min()
  • mean()
  • median()
  • quantile() - can return values at a specified quantile for each group
  • size() - size of the group, including missing data
  • std() - standard deviation
  • value_counts() - applies value_counts() within each group

2.1.2 Aggregates on multiple columns with multiple functions

If we want to apply different aggregating functions to different columns we can use the agg() method, and then specify the columns and aggregating functions in one of several ways: - with a dictionary, which doesn’t give us control over the names of the resulting columns - with named aggregation, which does give control over the names of the resulting columns and can be accomplished in two different ways - concise format - using NamedAgg objects

Let’s look at some examples of each of these methods of applying different aggregating functions to different columns.

2.1.2.1 Using a python dict

The code below uses a python dict to map columns to the function or functions to apply to those columns. In the code below, the dict specifies that the mean and standard deviation should be calculated for both rebounds and assists by position. Note that the functions to be applied are indicated with strings. Note, also, that in the result the original column names become the first level of a multi-level index (called a MultiIndex in pandas), and the function names become the second level of the MultiIndex.

df = boxscores.groupby('position').agg({'reb': ['mean', 'std'], 
                                        'ast': ['mean', 'std']})
df
reb ast
mean std mean std
position
C 6.57 4.60 1.72 2.12
F 3.46 3.10 1.31 1.62
G 2.18 2.22 1.73 2.07
PF 5.41 3.83 2.28 2.43
PG 3.30 2.51 4.88 3.38
SF 3.97 3.01 2.30 2.32
SG 3.36 2.55 2.94 2.65

We will cover multiple-level indexes in more detail in a later chapter. For now, let’s take a look at how to work with them when selecting data. In the code above we saved the result to a DataFrame named df. Let’s take a look at some examples of how to access the various columns.

First, let’s look at the column that represents the standard deviation of assists.

df.loc[:, [('ast', 'std')]]
ast
std
position
C 2.12
F 1.62
G 2.07
PF 2.43
PG 3.38
SF 2.32
SG 2.65

Next, let’s look at both stats for assists.

df.loc[:, 'ast']
mean std
position
C 1.72 2.12
F 1.31 1.62
G 1.73 2.07
PF 2.28 2.43
PG 4.88 3.38
SF 2.30 2.32
SG 2.94 2.65

To provide more control over the names of the result columns, and to avoid creating a MultiIndex, we can use a named aggregation. Below we will see two ways to do a named aggregation.

2.1.2.2 Named aggregation example 1 - concise format

Within the .agg() method keywords are the output column names and the values assigned to those keywords are tuples whose first element is the column to use and the second element is the aggregation to apply to that column. Note here that built-in pandas functions are quoted as strings.

boxscores.groupby('position').agg(mean_reb = ('reb', 'mean'),
                                  std_reb = ('reb', 'std'),
                                  mean_ast = ('ast', 'mean'),
                                  st_ast = ('ast', 'std'))
mean_reb std_reb mean_ast st_ast
position
C 6.57 4.60 1.72 2.12
F 3.46 3.10 1.31 1.62
G 2.18 2.22 1.73 2.07
PF 5.41 3.83 2.28 2.43
PG 3.30 2.51 4.88 3.38
SF 3.97 3.01 2.30 2.32
SG 3.36 2.55 2.94 2.65

2.1.2.3 Named aggregation example 2 - using NamedAgg objects

Another way to do a named aggregation is to use a NamedAgg object, as in the example below. Within the agg() method the keywords are the names for the result columns, but a pandas NamedAgg object is created to indicate the column and the aggregating function to be applied to that column. This is similar to the example above, but more verbose.

(boxscores
 .groupby('team')
 .agg(Number_of_Players=pd.NamedAgg(column = 'player', 
                                    aggfunc = 'nunique'), 
      Total_Rebounds=pd.NamedAgg(column = 'reb', 
                                 aggfunc = 'sum'))
)
Number_of_Players Total_Rebounds
team
ATL 19 3663
BKN 21 3613
BOS 19 3799
CHA 26 3303
CHI 18 3593
CLE 19 3550
DAL 22 3521
DEN 17 3643
DET 31 3553
GS 18 3830
HOU 19 3727
IND 22 3405
LAC 22 3523
LAL 21 3535
MEM 36 3494
MIA 21 3468
MIL 21 3622
MIN 19 3577
NO 21 3605
NY 27 3704
OKC 22 3447
ORL 18 3469
PHI 30 3522
PHX 22 3614
POR 22 3505
SA 20 3627
SAC 20 3607
TOR 30 3498
UTAH 23 3727
WSH 27 3368

2.1.3 Using agg() with a custom aggregation function

The agg() method of a GroupBy object can also designate a function to use to do the aggregation. When used in this way its functionality overlaps with the basic groupby, but we can also provide a custom function using a type of callable called a lambda function to define the function. A lambda function is a temporary, anonymous (unnamed) function defined in-place with the lambda keyword. A lambda function can evaluate and return only one expression. It has the following form:

lambda input: expression

The lambda function is an unnamed function object that behaves like a function with the following definition:

def <lambda>(input):
    return expression

Let’s define a simple DataFrame that has some missing values to illustrate this use of agg(). This DataFrame is called scores and contains student IDs, sections, and their scores on two quizzes.

scores = pd.DataFrame({
    'student': 's1 s2 s3 s4 s5 s6 s7 s8 s9 s10'.split(),
    'section': 'A A B B A B B A A B'.split(),
    'q1': [99, 97, 75, None, 86, None, 79, 91, 90, 88],
    'q2': [None, 89, None, None, 88, 60, 85, 81, 80, None]
})

scores
student section q1 q2
0 s1 A 99.00 NaN
1 s2 A 97.00 89.00
2 s3 B 75.00 NaN
3 s4 B NaN NaN
4 s5 A 86.00 88.00
5 s6 B NaN 60.00
6 s7 B 79.00 85.00
7 s8 A 91.00 81.00
8 s9 A 90.00 80.00
9 s10 B 88.00 NaN

Let’s use a basic groupby syntax to calculate the mean score on each quiz for each section. Recall that the mean() method by default ignores missing values.

scores.groupby('section')[['q1', 'q2']].mean()
q1 q2
section
A 92.60 84.50
B 80.67 72.50

We could also use the agg() method to calculate the mean for each quiz. This functionality overlaps with the functionality of the basic groupby.

scores.groupby('section')[['q1', 'q2']].agg('mean')
q1 q2
section
A 92.60 84.50
B 80.67 72.50

Within agg() the function that is specified is applied to each of the input Series. In the code example above the mean() function is applied to q1 in section A, q2 in section A, q1 in section B, and q2 in section B to produce the output.

With agg() we can also specify a custom function to use to do the aggregation. Recall that we can calculate the proportion of a Series that is missing by applying isna() to the Series to create an array of Boolean values, and then applying mean() to the array of Boolean values to calculate the proportion of them that are True. In the code below we use the lambda keyword to define a custom function to operate on each of the input Series. This custom function applies isna() to the Series and then chains that result to the mean() method using the dot operator .. The x after the lambda keyword is the name we give to the input Series so that we can refer to them in the expression that comes after the :. The expression that comes after the : is what the lambda function returns.

So, the code below will calculate the proportion of missing values for each quiz for each section!

scores.groupby('section')[['q1', 'q2']].agg(lambda x: x.isna().mean())
q1 q2
section
A 0.00 0.20
B 0.40 0.60

2.2 Ranking values with rank()

It is common to want to rank data values, or to want to see the largest or smallest values. In this section we will see how to do various types of ranking with the rank() method, and how to find a specified number of largest or smallest values with the nlargest() and nsmallest() methods.

The rank() method returns the rank of a value, and provides several options for dealing with ties in ranking. Let’s use the rank) method to investigate the top NBA players according to total points, rebounds, and assists over the season. First, we use groupby() to calculate totals by player.

player_totals = (boxscores
                 .groupby('player', as_index = False)
                 .agg(total_points = ('pts', 'sum'),
                      total_rebounds = ('reb', 'sum'),
                      total_assists = ('ast', 'sum'))
                )
player_totals.sample(6)
player total_points total_rebounds total_assists
560 Tre Mann 382 149 165
64 Bojan Bogdanovic 868 152 97
386 M. Bamba 253 238 41
574 X. Moon 33 18 21
90 C. Porter 285 109 118
462 P. Beverley 451 239 212

Next, let’s add columns for the players’ ranks on each of those statistics, in descending order, so that the player with rank one is the player with the highest total. We will also add a column for the sum of the player’s ranks in each of those statistics.

player_totals['pts_rank'] = player_totals['total_points'].rank(ascending = False)
player_totals['reb_rank'] = player_totals['total_rebounds'].rank(ascending = False)
player_totals['ast_rank'] = player_totals['total_assists'].rank(ascending = False)
player_totals['rank_sum'] = (player_totals['pts_rank'] 
                             + player_totals['reb_rank'] 
                             + player_totals['ast_rank'])

So, the player with the best average rank across the three statistics is the player with the lowest number in the rank_sum column. Let’s sort the player_totals DataFrame by that column, in ascending order, and take a look at the top 5 players.

player_totals.sort_values(by = 'rank_sum').head(5)
player total_points total_rebounds total_assists pts_rank reb_rank ast_rank rank_sum
434 N. Jokic 2085 976 708 5.00 3.00 2.00 10.00
369 L. Doncic 2370 647 686 1.00 18.00 3.00 22.00
179 G. Antetokounmpo 2222 841 476 3.00 5.00 16.00 24.00
145 D. Sabonis 1593 1120 673 26.00 1.00 4.00 31.00
372 L. James 1822 518 589 14.00 34.00 6.50 54.50

Notice that LeBron James is ranked 6.50 for assists. What is going on there? Let’s take a look at the top 8 players in total assists to investigate. We sort the player_totals DataFrame by total assists, in descending order, and then use the head() method to display the first 8 rows.

(player_totals
 .sort_values(by = 'total_assists', ascending = False)
 .loc[:, ['player', 'total_assists']]
 .head(8)
)
player total_assists
529 T. Haliburton 752
434 N. Jokic 708
369 L. Doncic 686
145 D. Sabonis 673
241 J. Harden 614
176 F. VanVleet 589
372 L. James 589
558 Trae Young 583

Notice that there is a tie in the 6th and 7th positions. Fred VanVleet and LeBron James both totaled 589 assists on the season. This brings up an important issue to consider when ranking; how should we handle ties? Fortunately, the rank() method has several ways of dealing with ties, specified by the argument to the method parameter. Following are the arguments and what they mean:

  • 'average' - average rank of all the ties in the group. So, for example, if it was a tie in the 6th and 7th position, like VanVleet and James, they both get the average rank, 6.5
  • 'min' - the lowest rank of all the ties in the group
  • 'max' - the highest rank of all the ties in the group
  • 'first' - ranks are assigned in order they appear in the input array
  • 'dense' - like 'min' but rank always increases by 1 between groups

LeBron James and Fred VanVleet both have rank 6.5 for assists because by default rank() uses the 'average' method.

Let’s take a look at an example to see these different ranking methods in action. Below a DataFrame named df_1 is created, with several names. Each name has a total and is part of a group.

df_1 = pd.DataFrame({
    'name': 'Bob Frank Jim Lisa Steph Ava Aria Mel Biff Sofia'.split(),
    'total': [1, 2, 3, 3, 4, 4, 4, 4, 5, 6],
    'group': 'a b a a a b a b a b'.split()
})

df_1
name total group
0 Bob 1 a
1 Frank 2 b
2 Jim 3 a
3 Lisa 3 a
4 Steph 4 a
5 Ava 4 b
6 Aria 4 a
7 Mel 4 b
8 Biff 5 a
9 Sofia 6 b

In the code example below we create mew columns to show the results of ranking according to the various methods. Take a look at the various ranks, compared to the values in the total column. The data is presented in increasing order of the total column to make the ranks easier to understand.

df_1['rank_avg'] = df_1['total'].rank(method = 'average')
df_1['rank_min'] = df_1['total'].rank(method = 'min')
df_1['rank_max'] = df_1['total'].rank(method = 'max')
df_1['rank_first'] = df_1['total'].rank(method = 'first')
df_1['rank_dense'] = df_1['total'].rank(method = 'dense')

df_1
name total group rank_avg rank_min rank_max rank_first rank_dense
0 Bob 1 a 1.00 1.00 1.00 1.00 1.00
1 Frank 2 b 2.00 2.00 2.00 2.00 2.00
2 Jim 3 a 3.50 3.00 4.00 3.00 3.00
3 Lisa 3 a 3.50 3.00 4.00 4.00 3.00
4 Steph 4 a 6.50 5.00 8.00 5.00 4.00
5 Ava 4 b 6.50 5.00 8.00 6.00 4.00
6 Aria 4 a 6.50 5.00 8.00 7.00 4.00
7 Mel 4 b 6.50 5.00 8.00 8.00 4.00
8 Biff 5 a 9.00 9.00 9.00 9.00 5.00
9 Sofia 6 b 10.00 10.00 10.00 10.00 6.00

We can also rank within groups. Because each individual value within each group is ranked, the result of the rank() method produces a Series of the same length as the input DataFrame, with the same row index.

df_1.groupby('group')['total'].rank()
0   1.00
1   1.00
2   2.50
3   2.50
4   4.50
5   2.50
6   4.50
7   2.50
8   6.00
9   4.00
Name: total, dtype: float64

Let’s make a new column, names rnk_within_group to hold the rank within each group (according to the default 'average' method, and then display the names, along with their total, their group, and their rank within their group.

df_1['rnk_within_group'] = df_1.groupby('group')['total'].rank()

(df_1
 .loc[:, ['name', 'total', 'group', 'rnk_within_group']]
 .sort_values(by = 'group')
)
name total group rnk_within_group
0 Bob 1 a 1.00
2 Jim 3 a 2.50
3 Lisa 3 a 2.50
4 Steph 4 a 4.50
6 Aria 4 a 4.50
8 Biff 5 a 6.00
1 Frank 2 b 1.00
5 Ava 4 b 2.50
7 Mel 4 b 2.50
9 Sofia 6 b 4.00

2.3 Using the nlargest() and nsmallest() methods

Instead of sorting and then using the head() method as we have been doing to show top performances, we can use the .nlargest() method. It has a more compact syntax than sorting and then selecting a number of rows, and it has a keep parameter that allows you to specify how ties are handled on the smallest item. keep = 'all' will keep all the ties on the smallest item.

Let’s try to use nlargest() to find the top 5 scoring games by players on the Memphis Grizzlies (‘MEM’ in the team column) for the season. Show just the player, date, opponent, and point total.

(boxscores
 .loc[boxscores['team'] == 'MEM']
 .nlargest(n = 5, columns = 'pts', keep = 'all')
 .loc[:, ['player', 'date', 'opponent', 'pts']]
)
player date opponent pts
6374 D. Bane 20231206 DET 49
7360 J. Jackson Jr. 20231213 HOU 44
26321 GG Jackson 20240414 DEN 44
7045 J. Jackson Jr. 20231211 DAL 41
24018 J. Jackson Jr. 20240401 DET 40

2.4 Using a callable as a selector with loc[]

We have seen that loc[] may be used to select rows and columns of a dataframe by label. It can also be used to select rows and columns using a callable. A ‘callable’ is a python object that can be called, such as a function. Both loc[] and iloc[] indexing can accept a callable as indexer. The callable must be a function with one argument (the calling Series or DataFrame). One of the key uses of indexing by callable is that it helps us to chain data selection operations without using a temporary variable.

For example, let’s take a look at average minutes, points, rebounds, assists, and steals for players who have played at least 50 games and average at least 25 points per game. Sort the result by average points descending. To apply the criteria (played at least 50 games and average at least 25 points per game) we need to first group by player and aggregate, because the criteria to be applied don’t exist in the original DataFrame; we need to calculate total games played and scoring average. In the example code below we first do the grouping and aggregating, assigning the result to a DataFrame named player_avgs, and then apply the criteria and sort to get our final result.

# Step 1 - Make new DataFrame with the aggregated fields
player_avgs = (boxscores
               .groupby('player')
               .agg(tot_games = ('date', 'nunique'),
                    avg_min = ('min', 'mean'),
                    avg_pts = ('pts', 'mean'),
                    avg_reb = ('reb', 'mean'),
                    avg_ast = ('ast', 'mean'),
                    avg_stl = ('stl', 'mean')
                   )
              )

# Step 2 - Apply the criteria to the new DataFrame
(player_avgs.loc[(player_avgs['tot_games'] >= 50) &
                 (player_avgs['avg_pts'] >= 25)]
 .sort_values(by = 'avg_pts', ascending = False)
)
tot_games avg_min avg_pts avg_reb avg_ast avg_stl
player
L. Doncic 70 37.49 33.86 9.24 9.80 1.41
G. Antetokounmpo 73 35.25 30.44 11.52 6.52 1.19
S. Gilgeous-Alexander 75 34.05 30.05 5.53 6.20 2.00
J. Brunson 77 35.44 28.73 3.61 6.74 0.91
K. Durant 75 37.20 27.09 6.60 5.04 0.92
D. Booker 68 35.96 27.07 4.53 6.94 0.90
J. Tatum 74 35.78 26.85 8.12 4.92 1.01
Donovan Mitchell 55 35.35 26.60 5.09 6.05 1.84
D. Fox 74 35.93 26.57 4.59 5.65 2.03
Steph Curry 74 32.70 26.43 4.46 5.12 0.73
N. Jokic 79 34.65 26.39 12.35 8.96 1.37
T. Maxey 70 37.49 25.94 3.69 6.19 0.99
A. Edwards 79 35.13 25.94 5.44 5.13 1.28
Trae Young 54 35.89 25.72 2.76 10.80 1.33
L. James 71 35.32 25.66 7.30 8.30 1.25
K. Irving 58 34.98 25.64 5.00 5.16 1.28

The two-step approach illustrated above has a potential disadvantage: it requires creating an intermediate DataFrame. This uses memory and creates unnecessary clutter if you don’t plan to use that intermediate DataFrame in further analysis.

If we don’t want to create the intermediate DataFrame, player_avgs, we can use method chaining. There is a challenge to the method chaining, though, because we need to refer to the calculated total games and scoring average in a step that applies the criteria (played at least 50 games and scoring average of 25 or higher). How do we refer to these values if they are not part of the boxscores DataFrame? We can use a type of callable called a “lambda function,” which we introduced in a previous section in this chapter, to apply the criteria within the .loc[]. As we learned earlier, a lambda function is a temporary, anonymous (unnamed) function defined in-place with the lambda keyword. A lambda function can evaluate and return only one expression. It has the following form:

lambda input: expression 

and it behaves like a function with the following definition:

def <lambda>(input):
    return expression

In the code below the lambda function is in the loc[] method, which is part of the method chain, so its argument is the result of the previous step. The result of the previous step is a DataFrame that has the total games, average minutes, average points, etc. for each player. This DataFrame becomes the input for the lambda function, and is temporarily named df, since df is what follows the lambda keyword. Now that the DataFrame coming into that step has a temporary name we can refer to its columns to state the criteria that create the Boolean array within the loc[] method.

(boxscores.groupby('player')
 .agg(tot_games = ('date', pd.Series.nunique),
      avg_min = ('min', 'mean'),
      avg_pts = ('pts', 'mean'),
      avg_reb = ('reb', 'mean'),
      avg_ast = ('ast', 'mean'),
      avg_stl = ('stl', 'mean')
     )
 .loc[lambda df: (df['tot_games'] >= 50) &
                 (df['avg_pts'] >= 25)]
 .sort_values(by = 'avg_pts', ascending = False)
)
tot_games avg_min avg_pts avg_reb avg_ast avg_stl
player
L. Doncic 70 37.49 33.86 9.24 9.80 1.41
G. Antetokounmpo 73 35.25 30.44 11.52 6.52 1.19
S. Gilgeous-Alexander 75 34.05 30.05 5.53 6.20 2.00
J. Brunson 77 35.44 28.73 3.61 6.74 0.91
K. Durant 75 37.20 27.09 6.60 5.04 0.92
D. Booker 68 35.96 27.07 4.53 6.94 0.90
J. Tatum 74 35.78 26.85 8.12 4.92 1.01
Donovan Mitchell 55 35.35 26.60 5.09 6.05 1.84
D. Fox 74 35.93 26.57 4.59 5.65 2.03
Steph Curry 74 32.70 26.43 4.46 5.12 0.73
N. Jokic 79 34.65 26.39 12.35 8.96 1.37
T. Maxey 70 37.49 25.94 3.69 6.19 0.99
A. Edwards 79 35.13 25.94 5.44 5.13 1.28
Trae Young 54 35.89 25.72 2.76 10.80 1.33
L. James 71 35.32 25.66 7.30 8.30 1.25
K. Irving 58 34.98 25.64 5.00 5.16 1.28

2.5 Using np.where() to create new columns

np.where is a function very similar to the IF function in Excel, except that it is vectorized. The term vectorized means that it applies to each element in an array individually. So, np.where() can apply to each element of a Series and returns a Series. np.where() has as its input parameters a condition that produces a Boolean array, and then an x parameter and a y parameter. The function returns an array with x wherever the condition was True and y where the condition was false.

Let’s make a new column in the boxscores DataFrame that has the values 'yes' if the player is a center or power forward. We will call the new column big_man, because that is the traditional term for players who play those positions. In the code below the expression {python} boxscores['position'].isin(['C', 'PF']) will create a Boolean array (an array of True and False values) of the same length as the boxscores DataFrame. The np.where() method will create another array that has ‘yes’ where that Boolean array has True and ‘no’ where that Boolean array has False.

boxscores['big_man'] = np.where(boxscores['position'].isin(['C', 'PF']),
                                'yes', 'no')

Let’s take a look at the player name, team, and position for a random sample of 5 “big men”.

(boxscores
 .loc[boxscores['big_man'] == 'yes', ['player', 'team', 'position']]
 .sample(5)
)
player team position
4443 D. Reath POR C
25155 Z. Collins SA PF
2747 T. Prince LAL PF
3518 Jaden McDaniels MIN PF
13303 K. Durant PHX PF

2.6 The transform() method of a groupby object

The .transform method is used to calculate an aggregate statistic while keeping all the original rows (not collapsing to one row per group). This is useful when we want to use aggregates calculated on different groups at the same time.

Let’s create a small DataFrame named survey_df to work with. survey_df has information on survey respondents, including their county, their gender, and the rating they gave in their survey.

survey_df = pd.DataFrame({
    'county': ['Bucks', 'Allegheny', 'Allegheny', 'Allegheny', 
               'Bucks', 'Allegheny', 'Allegheny', 'Bucks', 'Cambria'],
    'gender': ['M', 'NB', 'F', 'F', 'NB', 'M', 'M', 'F', 'NB'],
    'rating': [1, 2, 5, 9, 9, 8, 4, 3, 7]
}
)

survey_df
county gender rating
0 Bucks M 1
1 Allegheny NB 2
2 Allegheny F 5
3 Allegheny F 9
4 Bucks NB 9
5 Allegheny M 8
6 Allegheny M 4
7 Bucks F 3
8 Cambria NB 7

Imagine that we want to display average rating by gender, but only for counties with at least 5 responses. In order to count responses by county we need to group by county, but to calculate average rating by gender we need to group by gender. If we first group by county and count responses, gender is no longer available in the result, because a normal groupby operation that produces a count for each group collapses the result down to one row for each group, as does the code in the cell below.

survey_df.groupby('county')['rating'].count()
county
Allegheny    5
Bucks        3
Cambria      1
Name: rating, dtype: int64

We can use the result above to count responses by county, but we can’t go on from there to show the average rating for counties with 5 or more responses, because the ratings are no longer part of the DataFrame.

The transform() method of a GroupBy object can help us solve this problem. It creates an aggregate statistic according to the function named in its argument, but in its result it reports that aggregate statistic for every row in the input DataFrame. For example, in the code below we create a new column named county_responses that holds the results of counting rows (with non-null values) by group. Because we use the transform() method, the result has the same number of rows (and the same row index) as the input DataFrame. For each row, the count of responses for the county of that row is displayed. Each row where the county is Allegheny shows 5 county responses. Each row where the county is Bucks shows 3 county responses, etc. Now we can apply the criteria on responses per county, but we still have the gender and rating columns available to us for further analysis.

survey_df['county_responses'] = (survey_df
                                 .groupby('county')['county']
                                 .transform('count')
                                )
survey_df
county gender rating county_responses
0 Bucks M 1 3
1 Allegheny NB 2 5
2 Allegheny F 5 5
3 Allegheny F 9 5
4 Bucks NB 9 3
5 Allegheny M 8 5
6 Allegheny M 4 5
7 Bucks F 3 3
8 Cambria NB 7 1

Now we can accomplishes our original goal of displaying average rating by gender, but only for counties with at least 5 responses.

(survey_df
 .loc[survey_df['county_responses'] >= 5]
 .groupby('gender')['rating']
 .mean()
)
gender
F    7.00
M    6.00
NB   2.00
Name: rating, dtype: float64

Above we created a new column to show how transform() works. We don’t need to create the new column. We can use the .transform result to create a Boolean array to use within loc[], as in the following example. First, let’s delete the county_responses column we created above.

survey_df.drop('county_responses', inplace = True, axis = 1)

survey_df
county gender rating
0 Bucks M 1
1 Allegheny NB 2
2 Allegheny F 5
3 Allegheny F 9
4 Bucks NB 9
5 Allegheny M 8
6 Allegheny M 4
7 Bucks F 3
8 Cambria NB 7

Next, we write the code to show average rating by gender for counties with 5 or more responses. We put the result column into a list so that the result is a DataFrame.

(survey_df
 .loc[survey_df.groupby('county')['rating'].transform('count') >= 5]
 .groupby('gender')[['rating']].mean()
)
rating
gender
F 7.00
M 6.00
NB 2.00

Let’s try another example that requires aggregation at more than one level. Let’s try to make a new column for the boxscores DataFrame named star. The column will have ‘yes’ if the player averages at least 25 points per game and ‘no’ if the player does not.

boxscores['star'] = np.where(boxscores
                             .groupby('player')['pts']
                             .transform('mean') >= 25, 'yes', 'no')

Now, let’s take a look at the player, team, and starcolumns from a random sample of 10 rows from the boxscores DataFrame and see if we get any stars in our sample.

boxscores.loc[:, ['player', 'team', 'star']].sample(10)
player team star
4650 D. Fox SAC yes
20635 S. Merrill CLE no
21403 D. Hunter ATL no
4415 Bogdan Bogdanovic ATL no
4841 N. Vucevic CHI no
12269 K. Towns MIN no
8665 P. Pritchard BOS no
23811 M. Lewis LAL no
15604 Tre Jones SA no
17336 J. Hayes LAL no