import pandas as pd
import numpy as np
'display.max_columns', 30)
pd.set_option('display.max_rows', 100)
pd.set_option(= '{:,.2f}'.format pd.options.display.float_format
2 Grouping and Aggregating Data with Pandas
Click here to download this chapter as a Jupyter (.ipynb) file.
Grouping refers to splitting data into separate groups according to the values in a column or set of columns, which then makes it possible for us to calculate various aggregate statistics for each group. For example, if you have sales data that is divided into the regions North, South, East, and West you could group by region and then sum sales to calculate total sales per region. This is a very common operation in data analysis. In this chapter we will learn how to use the groupby()
method in pandas to group our data, and how to then calculate aggregate statistics on the groups.
This chapter also covers a few more useful topics. We will learn about the rank()
method to rank values in a Series and the nlargest()
and nsmallest()
methods to identify the n largest or n smallest values in a Series, respectively. We will also learn how to use a callable within loc[]
, which allows us to create longer chains of methods, and how to create a column based on if-then logic applied to another column or Series using the numpy where()
function. Finally, we will see how the transform()
method of a GroupBy object can calculate an aggregate by group, but produce the aggregate value for each row of the input DataFrame, according to the group of each row.
We will start by learning about grouping. First, let’s import pandas and numpy, set our pandas options, and import the retail store data and NBA boxscore data.
# Read the retail store data into a DataFrame
= pd.read_csv("https://neuronjolt.com/data/superstore.csv")
store
# Read the NBA boxscores data into a DataFrame
= "https://neuronjolt.com/data/nba_bs_2023-24_season_cleaned.csv"
url = pd.read_csv(url) boxscores
2.1 Aggregating data by groups with .groupby()
Data analysts often need to divide a dataset into subsets, or groups, according to the values in one or more columns and then calculate some aggregate statistics on each of the groups. For example, we might want to take the sales data and calculate total sales per product ID. To do so we essentially divide the dataset into one group for each unique product ID and then sum the sales for each group. Another example would be if we wanted to calculate the total number of points allowed by each NBA team during the season. We would divide the data into groups according to the value in the opponent
column, and then sum the points for each group. These are both examples of grouping by one column and then calculating an aggregate statistic.
Sometimes we might want to group by a combination of columns. For example, instead of calculating the overall total sales per product ID we might want to calculate monthly total sales per product ID to see if we can identify any sales trends. To do so we would need to divide the data into groups, one group for each unique combination of month and product ID, and then calculate the total sales for each group.
A generic term that data analysts sometimes use for this type of data operation is “Split - Apply - Combine”. This is because the data is first split apart, then some function is applied to calculate a statistic or set of statistics for each group, and then the results of the group calculations are combined into a result.
Pandas provides the groupby()
method, which can be used to divide a DataFrame into groups according to the values in a column or set of columns, called the group keys. The object that has the DataFrame divided into groups according to the; group keys is called the GroupBy object. After grouping we can apply various methods to a column or columns selected by using []
on the GroupBy object similar to how we use []
to indicate a column or columns in a DataFrame. In this way we can calculate aggregate statistics for the groups. The sections below show different ways of calculating aggregate statistics after a groupby()
operation in pandas
.
2.1.1 Simple aggregation of one or more columns
This data operation involves grouping by one or more columns, applying an aggregating function to a selected column or columns in each of the groups, and then combining the group results for display. The output is a pandas Series if the aggregating function is applied to a single column and a DataFrame if the aggregating function is applied to a list of columns (even if there is only one column in the list). The groupby column or columns become the row index by default and the values are the results from applying the aggregating function(s) to the selected column(s) for each group. By default, the final results are presented in sorted order of the group keys (the unique values in the groupby columns). Some data analysts set the sort = False
parameter within the groupby()
to gain efficiency. When sort = False
is set the sort keys in the results will be presented in the order they appear in the DataFrame.
Let’s begin by looking at the simplest case: grouping by one column, selecting one column to aggregate, and specifying the aggregating function. The code below calculates total rebounds by position. Note that since only one column is specified to aggregate the result is a Series.
'position')['reb'].sum() boxscores.groupby(
position
C 24595
F 11843
G 7383
PF 19771
PG 11729
SF 15819
SG 15972
Name: reb, dtype: int64
We might want to sort this result in descending order of total rebounds. Because we are applying a sort operation after the aggregation it doesn’t make sense to sort the group keys, so we will set the sort = False
parameter here.
Since the result is a Series it only has one column and we can sort it without indicating the sort column.
(boxscores'position', sort = False)['reb']
.groupby(sum()
.= False)
.sort_values(ascending )
position
C 24595
PF 19771
SG 15972
SF 15819
F 11843
PG 11729
G 7383
Name: reb, dtype: int64
If we specify a list of columns to aggregate, even if only one column is in the list, the output will be a DataFrame. In the code below notice the double brackets around 'reb'
. The outer square brackets are used to specify the column or columns within each group that you want to aggregate. The inner square brackets indicate that 'reb'
is a list. Notice that the output is now a DataFrame. Notice, also, that the unique values of the groupby column, position
, have now become the row index values. This is the default behavior for groupby()
.
'position')[['reb']].sum() boxscores.groupby(
reb | |
---|---|
position | |
C | 24595 |
F | 11843 |
G | 7383 |
PF | 19771 |
PG | 11729 |
SF | 15819 |
SG | 15972 |
If we don’t want the groupby column values to become the row index we can set the as_index
parameter to False
within the groupby()
method, as in the code below. Notice that when we set as_index = False
in the groupby()
method position
will remain a column in the output instead of becoming the row index.
'position', as_index = False)[['reb']].sum() boxscores.groupby(
position | reb | |
---|---|---|
0 | C | 24595 |
1 | F | 11843 |
2 | G | 7383 |
3 | PF | 19771 |
4 | PG | 11729 |
5 | SF | 15819 |
6 | SG | 15972 |
Since this result is a DataFrame the sort column must be indicated if we want to sort it.
(boxscores'position', as_index = False)[['reb']]
.groupby(sum()
.= 'reb', ascending = False)
.sort_values(by )
position | reb | |
---|---|---|
0 | C | 24595 |
3 | PF | 19771 |
6 | SG | 15972 |
5 | SF | 15819 |
1 | F | 11843 |
4 | PG | 11729 |
2 | G | 7383 |
We can have more than one level of grouping. To specify more than one level of grouping use a list of column names as the argument for the groupby()
method. Notice that when we group by two columns, team
and position
in the code below, the aggregates are calculated for each unique combination of team and position, and team
and position
become a multi-level index. We will cover multi-level indexes in more detail in a later chapter.
'team', 'position'])[['reb']].sum() boxscores.groupby([
reb | ||
---|---|---|
team | position | |
ATL | C | 776 |
F | 40 | |
G | 155 | |
PF | 569 | |
PG | 169 | |
... | ... | ... |
WSH | G | 111 |
PF | 400 | |
PG | 179 | |
SF | 1282 | |
SG | 647 |
206 rows × 1 columns
We can also specify more than one column to aggregate after the groupby. The following code groups the sales data in the store
DataFrame by product category and region, and then calculates total sales and total profit for each group.
(store'Category', 'Region'])
.groupby(['Sales', 'Profit']]
[[sum()
. )
Sales | Profit | ||
---|---|---|---|
Category | Region | ||
Furniture | Central | 163,797.16 | -2,871.05 |
East | 208,291.20 | 3,046.17 | |
South | 117,298.68 | 6,771.21 | |
West | 252,612.74 | 11,504.95 | |
Office Supplies | Central | 167,026.42 | 8,879.98 |
East | 205,516.05 | 41,014.58 | |
South | 125,651.31 | 19,986.39 | |
West | 220,853.25 | 52,609.85 | |
Technology | Central | 170,416.31 | 33,697.43 |
East | 264,973.98 | 47,462.04 | |
South | 148,771.91 | 19,991.83 | |
West | 251,991.83 | 44,303.65 |
In the examples above we have applied the sum()
method. To see all the methods that can be applied to a GroupBy object consult the pandas documentation. Some of the methods most commonly used to calculate aggregate statistics on specified columns of a GroupBy object include the following:
sum()
count()
- count of the group, excluding missing datadescribe()
- generate descriptive statistics for each groupfirst()
- first value in the groupidxmax()
- index of first occurrence of maximumidxmin()
- index of first occurrence of minimummax()
min()
mean()
median()
quantile()
- can return values at a specified quantile for each groupsize()
- size of the group, including missing datastd()
- standard deviationvalue_counts()
- appliesvalue_counts()
within each group
2.1.2 Aggregates on multiple columns with multiple functions
If we want to apply different aggregating functions to different columns we can use the agg()
method, and then specify the columns and aggregating functions in one of several ways: - with a dictionary, which doesn’t give us control over the names of the resulting columns - with named aggregation, which does give control over the names of the resulting columns and can be accomplished in two different ways - concise format - using NamedAgg
objects
Let’s look at some examples of each of these methods of applying different aggregating functions to different columns.
2.1.2.1 Using a python dict
The code below uses a python dict to map columns to the function or functions to apply to those columns. In the code below, the dict specifies that the mean and standard deviation should be calculated for both rebounds and assists by position. Note that the functions to be applied are indicated with strings. Note, also, that in the result the original column names become the first level of a multi-level index (called a MultiIndex in pandas), and the function names become the second level of the MultiIndex.
= boxscores.groupby('position').agg({'reb': ['mean', 'std'],
df 'ast': ['mean', 'std']})
df
reb | ast | |||
---|---|---|---|---|
mean | std | mean | std | |
position | ||||
C | 6.57 | 4.60 | 1.72 | 2.12 |
F | 3.46 | 3.10 | 1.31 | 1.62 |
G | 2.18 | 2.22 | 1.73 | 2.07 |
PF | 5.41 | 3.83 | 2.28 | 2.43 |
PG | 3.30 | 2.51 | 4.88 | 3.38 |
SF | 3.97 | 3.01 | 2.30 | 2.32 |
SG | 3.36 | 2.55 | 2.94 | 2.65 |
We will cover multiple-level indexes in more detail in a later chapter. For now, let’s take a look at how to work with them when selecting data. In the code above we saved the result to a DataFrame named df
. Let’s take a look at some examples of how to access the various columns.
First, let’s look at the column that represents the standard deviation of assists.
'ast', 'std')]] df.loc[:, [(
ast | |
---|---|
std | |
position | |
C | 2.12 |
F | 1.62 |
G | 2.07 |
PF | 2.43 |
PG | 3.38 |
SF | 2.32 |
SG | 2.65 |
Next, let’s look at both stats for assists.
'ast'] df.loc[:,
mean | std | |
---|---|---|
position | ||
C | 1.72 | 2.12 |
F | 1.31 | 1.62 |
G | 1.73 | 2.07 |
PF | 2.28 | 2.43 |
PG | 4.88 | 3.38 |
SF | 2.30 | 2.32 |
SG | 2.94 | 2.65 |
To provide more control over the names of the result columns, and to avoid creating a MultiIndex, we can use a named aggregation. Below we will see two ways to do a named aggregation.
2.1.2.2 Named aggregation example 1 - concise format
Within the .agg()
method keywords are the output column names and the values assigned to those keywords are tuples whose first element is the column to use and the second element is the aggregation to apply to that column. Note here that built-in pandas functions are quoted as strings.
'position').agg(mean_reb = ('reb', 'mean'),
boxscores.groupby(= ('reb', 'std'),
std_reb = ('ast', 'mean'),
mean_ast = ('ast', 'std')) st_ast
mean_reb | std_reb | mean_ast | st_ast | |
---|---|---|---|---|
position | ||||
C | 6.57 | 4.60 | 1.72 | 2.12 |
F | 3.46 | 3.10 | 1.31 | 1.62 |
G | 2.18 | 2.22 | 1.73 | 2.07 |
PF | 5.41 | 3.83 | 2.28 | 2.43 |
PG | 3.30 | 2.51 | 4.88 | 3.38 |
SF | 3.97 | 3.01 | 2.30 | 2.32 |
SG | 3.36 | 2.55 | 2.94 | 2.65 |
2.1.2.3 Named aggregation example 2 - using NamedAgg
objects
Another way to do a named aggregation is to use a NamedAgg
object, as in the example below. Within the agg()
method the keywords are the names for the result columns, but a pandas NamedAgg
object is created to indicate the column and the aggregating function to be applied to that column. This is similar to the example above, but more verbose.
(boxscores'team')
.groupby(=pd.NamedAgg(column = 'player',
.agg(Number_of_Players= 'nunique'),
aggfunc =pd.NamedAgg(column = 'reb',
Total_Rebounds= 'sum'))
aggfunc )
Number_of_Players | Total_Rebounds | |
---|---|---|
team | ||
ATL | 19 | 3663 |
BKN | 21 | 3613 |
BOS | 19 | 3799 |
CHA | 26 | 3303 |
CHI | 18 | 3593 |
CLE | 19 | 3550 |
DAL | 22 | 3521 |
DEN | 17 | 3643 |
DET | 31 | 3553 |
GS | 18 | 3830 |
HOU | 19 | 3727 |
IND | 22 | 3405 |
LAC | 22 | 3523 |
LAL | 21 | 3535 |
MEM | 36 | 3494 |
MIA | 21 | 3468 |
MIL | 21 | 3622 |
MIN | 19 | 3577 |
NO | 21 | 3605 |
NY | 27 | 3704 |
OKC | 22 | 3447 |
ORL | 18 | 3469 |
PHI | 30 | 3522 |
PHX | 22 | 3614 |
POR | 22 | 3505 |
SA | 20 | 3627 |
SAC | 20 | 3607 |
TOR | 30 | 3498 |
UTAH | 23 | 3727 |
WSH | 27 | 3368 |
2.1.3 Using agg()
with a custom aggregation function
The agg()
method of a GroupBy object can also designate a function to use to do the aggregation. When used in this way its functionality overlaps with the basic groupby, but we can also provide a custom function using a type of callable called a lambda function to define the function. A lambda function is a temporary, anonymous (unnamed) function defined in-place with the lambda
keyword. A lambda function can evaluate and return only one expression. It has the following form:
lambda input: expression
The lambda function is an unnamed function object that behaves like a function with the following definition:
def <lambda>(input):
return expression
Let’s define a simple DataFrame that has some missing values to illustrate this use of agg()
. This DataFrame is called scores
and contains student IDs, sections, and their scores on two quizzes.
= pd.DataFrame({
scores 'student': 's1 s2 s3 s4 s5 s6 s7 s8 s9 s10'.split(),
'section': 'A A B B A B B A A B'.split(),
'q1': [99, 97, 75, None, 86, None, 79, 91, 90, 88],
'q2': [None, 89, None, None, 88, 60, 85, 81, 80, None]
})
scores
student | section | q1 | q2 | |
---|---|---|---|---|
0 | s1 | A | 99.00 | NaN |
1 | s2 | A | 97.00 | 89.00 |
2 | s3 | B | 75.00 | NaN |
3 | s4 | B | NaN | NaN |
4 | s5 | A | 86.00 | 88.00 |
5 | s6 | B | NaN | 60.00 |
6 | s7 | B | 79.00 | 85.00 |
7 | s8 | A | 91.00 | 81.00 |
8 | s9 | A | 90.00 | 80.00 |
9 | s10 | B | 88.00 | NaN |
Let’s use a basic groupby syntax to calculate the mean score on each quiz for each section. Recall that the mean()
method by default ignores missing values.
'section')[['q1', 'q2']].mean() scores.groupby(
q1 | q2 | |
---|---|---|
section | ||
A | 92.60 | 84.50 |
B | 80.67 | 72.50 |
We could also use the agg()
method to calculate the mean for each quiz. This functionality overlaps with the functionality of the basic groupby.
'section')[['q1', 'q2']].agg('mean') scores.groupby(
q1 | q2 | |
---|---|---|
section | ||
A | 92.60 | 84.50 |
B | 80.67 | 72.50 |
Within agg()
the function that is specified is applied to each of the input Series. In the code example above the mean()
function is applied to q1
in section A, q2
in section A, q1
in section B, and q2
in section B to produce the output.
With agg()
we can also specify a custom function to use to do the aggregation. Recall that we can calculate the proportion of a Series that is missing by applying isna()
to the Series to create an array of Boolean values, and then applying mean()
to the array of Boolean values to calculate the proportion of them that are True
. In the code below we use the lambda
keyword to define a custom function to operate on each of the input Series. This custom function applies isna()
to the Series and then chains that result to the mean()
method using the dot operator .
. The x
after the lambda
keyword is the name we give to the input Series so that we can refer to them in the expression that comes after the :
. The expression that comes after the :
is what the lambda function returns.
So, the code below will calculate the proportion of missing values for each quiz for each section!
'section')[['q1', 'q2']].agg(lambda x: x.isna().mean()) scores.groupby(
q1 | q2 | |
---|---|---|
section | ||
A | 0.00 | 0.20 |
B | 0.40 | 0.60 |
2.2 Ranking values with rank()
It is common to want to rank data values, or to want to see the largest or smallest values. In this section we will see how to do various types of ranking with the rank()
method, and how to find a specified number of largest or smallest values with the nlargest()
and nsmallest()
methods.
The rank()
method returns the rank of a value, and provides several options for dealing with ties in ranking. Let’s use the rank)
method to investigate the top NBA players according to total points, rebounds, and assists over the season. First, we use groupby()
to calculate totals by player.
= (boxscores
player_totals 'player', as_index = False)
.groupby(= ('pts', 'sum'),
.agg(total_points = ('reb', 'sum'),
total_rebounds = ('ast', 'sum'))
total_assists
)6) player_totals.sample(
player | total_points | total_rebounds | total_assists | |
---|---|---|---|---|
560 | Tre Mann | 382 | 149 | 165 |
64 | Bojan Bogdanovic | 868 | 152 | 97 |
386 | M. Bamba | 253 | 238 | 41 |
574 | X. Moon | 33 | 18 | 21 |
90 | C. Porter | 285 | 109 | 118 |
462 | P. Beverley | 451 | 239 | 212 |
Next, let’s add columns for the players’ ranks on each of those statistics, in descending order, so that the player with rank one is the player with the highest total. We will also add a column for the sum of the player’s ranks in each of those statistics.
'pts_rank'] = player_totals['total_points'].rank(ascending = False)
player_totals['reb_rank'] = player_totals['total_rebounds'].rank(ascending = False)
player_totals['ast_rank'] = player_totals['total_assists'].rank(ascending = False)
player_totals['rank_sum'] = (player_totals['pts_rank']
player_totals[+ player_totals['reb_rank']
+ player_totals['ast_rank'])
So, the player with the best average rank across the three statistics is the player with the lowest number in the rank_sum
column. Let’s sort the player_totals
DataFrame by that column, in ascending order, and take a look at the top 5 players.
= 'rank_sum').head(5) player_totals.sort_values(by
player | total_points | total_rebounds | total_assists | pts_rank | reb_rank | ast_rank | rank_sum | |
---|---|---|---|---|---|---|---|---|
434 | N. Jokic | 2085 | 976 | 708 | 5.00 | 3.00 | 2.00 | 10.00 |
369 | L. Doncic | 2370 | 647 | 686 | 1.00 | 18.00 | 3.00 | 22.00 |
179 | G. Antetokounmpo | 2222 | 841 | 476 | 3.00 | 5.00 | 16.00 | 24.00 |
145 | D. Sabonis | 1593 | 1120 | 673 | 26.00 | 1.00 | 4.00 | 31.00 |
372 | L. James | 1822 | 518 | 589 | 14.00 | 34.00 | 6.50 | 54.50 |
Notice that LeBron James is ranked 6.50 for assists. What is going on there? Let’s take a look at the top 8 players in total assists to investigate. We sort the player_totals
DataFrame by total assists, in descending order, and then use the head()
method to display the first 8 rows.
(player_totals= 'total_assists', ascending = False)
.sort_values(by 'player', 'total_assists']]
.loc[:, [8)
.head( )
player | total_assists | |
---|---|---|
529 | T. Haliburton | 752 |
434 | N. Jokic | 708 |
369 | L. Doncic | 686 |
145 | D. Sabonis | 673 |
241 | J. Harden | 614 |
176 | F. VanVleet | 589 |
372 | L. James | 589 |
558 | Trae Young | 583 |
Notice that there is a tie in the 6th and 7th positions. Fred VanVleet and LeBron James both totaled 589 assists on the season. This brings up an important issue to consider when ranking; how should we handle ties? Fortunately, the rank()
method has several ways of dealing with ties, specified by the argument to the method
parameter. Following are the arguments and what they mean:
'average'
- average rank of all the ties in the group. So, for example, if it was a tie in the 6th and 7th position, like VanVleet and James, they both get the average rank, 6.5'min'
- the lowest rank of all the ties in the group'max'
- the highest rank of all the ties in the group'first'
- ranks are assigned in order they appear in the input array'dense'
- like'min'
but rank always increases by 1 between groups
LeBron James and Fred VanVleet both have rank 6.5 for assists because by default rank()
uses the 'average'
method.
Let’s take a look at an example to see these different ranking methods in action. Below a DataFrame named df_1
is created, with several names. Each name has a total and is part of a group.
= pd.DataFrame({
df_1 'name': 'Bob Frank Jim Lisa Steph Ava Aria Mel Biff Sofia'.split(),
'total': [1, 2, 3, 3, 4, 4, 4, 4, 5, 6],
'group': 'a b a a a b a b a b'.split()
})
df_1
name | total | group | |
---|---|---|---|
0 | Bob | 1 | a |
1 | Frank | 2 | b |
2 | Jim | 3 | a |
3 | Lisa | 3 | a |
4 | Steph | 4 | a |
5 | Ava | 4 | b |
6 | Aria | 4 | a |
7 | Mel | 4 | b |
8 | Biff | 5 | a |
9 | Sofia | 6 | b |
In the code example below we create mew columns to show the results of ranking according to the various methods. Take a look at the various ranks, compared to the values in the total
column. The data is presented in increasing order of the total
column to make the ranks easier to understand.
'rank_avg'] = df_1['total'].rank(method = 'average')
df_1['rank_min'] = df_1['total'].rank(method = 'min')
df_1['rank_max'] = df_1['total'].rank(method = 'max')
df_1['rank_first'] = df_1['total'].rank(method = 'first')
df_1['rank_dense'] = df_1['total'].rank(method = 'dense')
df_1[
df_1
name | total | group | rank_avg | rank_min | rank_max | rank_first | rank_dense | |
---|---|---|---|---|---|---|---|---|
0 | Bob | 1 | a | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
1 | Frank | 2 | b | 2.00 | 2.00 | 2.00 | 2.00 | 2.00 |
2 | Jim | 3 | a | 3.50 | 3.00 | 4.00 | 3.00 | 3.00 |
3 | Lisa | 3 | a | 3.50 | 3.00 | 4.00 | 4.00 | 3.00 |
4 | Steph | 4 | a | 6.50 | 5.00 | 8.00 | 5.00 | 4.00 |
5 | Ava | 4 | b | 6.50 | 5.00 | 8.00 | 6.00 | 4.00 |
6 | Aria | 4 | a | 6.50 | 5.00 | 8.00 | 7.00 | 4.00 |
7 | Mel | 4 | b | 6.50 | 5.00 | 8.00 | 8.00 | 4.00 |
8 | Biff | 5 | a | 9.00 | 9.00 | 9.00 | 9.00 | 5.00 |
9 | Sofia | 6 | b | 10.00 | 10.00 | 10.00 | 10.00 | 6.00 |
We can also rank within groups. Because each individual value within each group is ranked, the result of the rank()
method produces a Series of the same length as the input DataFrame, with the same row index.
'group')['total'].rank() df_1.groupby(
0 1.00
1 1.00
2 2.50
3 2.50
4 4.50
5 2.50
6 4.50
7 2.50
8 6.00
9 4.00
Name: total, dtype: float64
Let’s make a new column, names rnk_within_group
to hold the rank within each group (according to the default 'average'
method, and then display the names, along with their total, their group, and their rank within their group.
'rnk_within_group'] = df_1.groupby('group')['total'].rank()
df_1[
(df_1'name', 'total', 'group', 'rnk_within_group']]
.loc[:, [= 'group')
.sort_values(by )
name | total | group | rnk_within_group | |
---|---|---|---|---|
0 | Bob | 1 | a | 1.00 |
2 | Jim | 3 | a | 2.50 |
3 | Lisa | 3 | a | 2.50 |
4 | Steph | 4 | a | 4.50 |
6 | Aria | 4 | a | 4.50 |
8 | Biff | 5 | a | 6.00 |
1 | Frank | 2 | b | 1.00 |
5 | Ava | 4 | b | 2.50 |
7 | Mel | 4 | b | 2.50 |
9 | Sofia | 6 | b | 4.00 |
2.3 Using the nlargest()
and nsmallest()
methods
Instead of sorting and then using the head()
method as we have been doing to show top performances, we can use the .nlargest()
method. It has a more compact syntax than sorting and then selecting a number of rows, and it has a keep
parameter that allows you to specify how ties are handled on the smallest item. keep = 'all'
will keep all the ties on the smallest item.
Let’s try to use nlargest()
to find the top 5 scoring games by players on the Memphis Grizzlies (‘MEM’ in the team
column) for the season. Show just the player, date, opponent, and point total.
(boxscores'team'] == 'MEM']
.loc[boxscores[= 5, columns = 'pts', keep = 'all')
.nlargest(n 'player', 'date', 'opponent', 'pts']]
.loc[:, [ )
player | date | opponent | pts | |
---|---|---|---|---|
6374 | D. Bane | 20231206 | DET | 49 |
7360 | J. Jackson Jr. | 20231213 | HOU | 44 |
26321 | GG Jackson | 20240414 | DEN | 44 |
7045 | J. Jackson Jr. | 20231211 | DAL | 41 |
24018 | J. Jackson Jr. | 20240401 | DET | 40 |
2.4 Using a callable as a selector with loc[]
We have seen that loc[]
may be used to select rows and columns of a dataframe by label. It can also be used to select rows and columns using a callable. A ‘callable’ is a python object that can be called, such as a function. Both loc[]
and iloc[]
indexing can accept a callable as indexer. The callable must be a function with one argument (the calling Series or DataFrame). One of the key uses of indexing by callable is that it helps us to chain data selection operations without using a temporary variable.
For example, let’s take a look at average minutes, points, rebounds, assists, and steals for players who have played at least 50 games and average at least 25 points per game. Sort the result by average points descending. To apply the criteria (played at least 50 games and average at least 25 points per game) we need to first group by player and aggregate, because the criteria to be applied don’t exist in the original DataFrame; we need to calculate total games played and scoring average. In the example code below we first do the grouping and aggregating, assigning the result to a DataFrame named player_avgs
, and then apply the criteria and sort to get our final result.
# Step 1 - Make new DataFrame with the aggregated fields
= (boxscores
player_avgs 'player')
.groupby(= ('date', 'nunique'),
.agg(tot_games = ('min', 'mean'),
avg_min = ('pts', 'mean'),
avg_pts = ('reb', 'mean'),
avg_reb = ('ast', 'mean'),
avg_ast = ('stl', 'mean')
avg_stl
)
)
# Step 2 - Apply the criteria to the new DataFrame
'tot_games'] >= 50) &
(player_avgs.loc[(player_avgs['avg_pts'] >= 25)]
(player_avgs[= 'avg_pts', ascending = False)
.sort_values(by )
tot_games | avg_min | avg_pts | avg_reb | avg_ast | avg_stl | |
---|---|---|---|---|---|---|
player | ||||||
L. Doncic | 70 | 37.49 | 33.86 | 9.24 | 9.80 | 1.41 |
G. Antetokounmpo | 73 | 35.25 | 30.44 | 11.52 | 6.52 | 1.19 |
S. Gilgeous-Alexander | 75 | 34.05 | 30.05 | 5.53 | 6.20 | 2.00 |
J. Brunson | 77 | 35.44 | 28.73 | 3.61 | 6.74 | 0.91 |
K. Durant | 75 | 37.20 | 27.09 | 6.60 | 5.04 | 0.92 |
D. Booker | 68 | 35.96 | 27.07 | 4.53 | 6.94 | 0.90 |
J. Tatum | 74 | 35.78 | 26.85 | 8.12 | 4.92 | 1.01 |
Donovan Mitchell | 55 | 35.35 | 26.60 | 5.09 | 6.05 | 1.84 |
D. Fox | 74 | 35.93 | 26.57 | 4.59 | 5.65 | 2.03 |
Steph Curry | 74 | 32.70 | 26.43 | 4.46 | 5.12 | 0.73 |
N. Jokic | 79 | 34.65 | 26.39 | 12.35 | 8.96 | 1.37 |
T. Maxey | 70 | 37.49 | 25.94 | 3.69 | 6.19 | 0.99 |
A. Edwards | 79 | 35.13 | 25.94 | 5.44 | 5.13 | 1.28 |
Trae Young | 54 | 35.89 | 25.72 | 2.76 | 10.80 | 1.33 |
L. James | 71 | 35.32 | 25.66 | 7.30 | 8.30 | 1.25 |
K. Irving | 58 | 34.98 | 25.64 | 5.00 | 5.16 | 1.28 |
The two-step approach illustrated above has a potential disadvantage: it requires creating an intermediate DataFrame. This uses memory and creates unnecessary clutter if you don’t plan to use that intermediate DataFrame in further analysis.
If we don’t want to create the intermediate DataFrame, player_avgs
, we can use method chaining. There is a challenge to the method chaining, though, because we need to refer to the calculated total games and scoring average in a step that applies the criteria (played at least 50 games and scoring average of 25 or higher). How do we refer to these values if they are not part of the boxscores
DataFrame? We can use a type of callable called a “lambda function,” which we introduced in a previous section in this chapter, to apply the criteria within the .loc[]
. As we learned earlier, a lambda function is a temporary, anonymous (unnamed) function defined in-place with the lambda
keyword. A lambda function can evaluate and return only one expression. It has the following form:
lambda input: expression
and it behaves like a function with the following definition:
def <lambda>(input):
return expression
In the code below the lambda function is in the loc[]
method, which is part of the method chain, so its argument is the result of the previous step. The result of the previous step is a DataFrame that has the total games, average minutes, average points, etc. for each player. This DataFrame becomes the input for the lambda function, and is temporarily named df
, since df
is what follows the lambda
keyword. Now that the DataFrame coming into that step has a temporary name we can refer to its columns to state the criteria that create the Boolean array within the loc[]
method.
'player')
(boxscores.groupby(= ('date', pd.Series.nunique),
.agg(tot_games = ('min', 'mean'),
avg_min = ('pts', 'mean'),
avg_pts = ('reb', 'mean'),
avg_reb = ('ast', 'mean'),
avg_ast = ('stl', 'mean')
avg_stl
)lambda df: (df['tot_games'] >= 50) &
.loc['avg_pts'] >= 25)]
(df[= 'avg_pts', ascending = False)
.sort_values(by )
tot_games | avg_min | avg_pts | avg_reb | avg_ast | avg_stl | |
---|---|---|---|---|---|---|
player | ||||||
L. Doncic | 70 | 37.49 | 33.86 | 9.24 | 9.80 | 1.41 |
G. Antetokounmpo | 73 | 35.25 | 30.44 | 11.52 | 6.52 | 1.19 |
S. Gilgeous-Alexander | 75 | 34.05 | 30.05 | 5.53 | 6.20 | 2.00 |
J. Brunson | 77 | 35.44 | 28.73 | 3.61 | 6.74 | 0.91 |
K. Durant | 75 | 37.20 | 27.09 | 6.60 | 5.04 | 0.92 |
D. Booker | 68 | 35.96 | 27.07 | 4.53 | 6.94 | 0.90 |
J. Tatum | 74 | 35.78 | 26.85 | 8.12 | 4.92 | 1.01 |
Donovan Mitchell | 55 | 35.35 | 26.60 | 5.09 | 6.05 | 1.84 |
D. Fox | 74 | 35.93 | 26.57 | 4.59 | 5.65 | 2.03 |
Steph Curry | 74 | 32.70 | 26.43 | 4.46 | 5.12 | 0.73 |
N. Jokic | 79 | 34.65 | 26.39 | 12.35 | 8.96 | 1.37 |
T. Maxey | 70 | 37.49 | 25.94 | 3.69 | 6.19 | 0.99 |
A. Edwards | 79 | 35.13 | 25.94 | 5.44 | 5.13 | 1.28 |
Trae Young | 54 | 35.89 | 25.72 | 2.76 | 10.80 | 1.33 |
L. James | 71 | 35.32 | 25.66 | 7.30 | 8.30 | 1.25 |
K. Irving | 58 | 34.98 | 25.64 | 5.00 | 5.16 | 1.28 |
2.5 Using np.where()
to create new columns
np.where
is a function very similar to the IF function in Excel, except that it is vectorized. The term vectorized means that it applies to each element in an array individually. So, np.where()
can apply to each element of a Series and returns a Series. np.where()
has as its input parameters a condition that produces a Boolean array, and then an x
parameter and a y
parameter. The function returns an array with x
wherever the condition was True and y
where the condition was false.
Let’s make a new column in the boxscores
DataFrame that has the values 'yes'
if the player is a center or power forward. We will call the new column big_man
, because that is the traditional term for players who play those positions. In the code below the expression {python} boxscores['position'].isin(['C', 'PF'])
will create a Boolean array (an array of True
and False
values) of the same length as the boxscores
DataFrame. The np.where()
method will create another array that has ‘yes’ where that Boolean array has True
and ‘no’ where that Boolean array has False
.
'big_man'] = np.where(boxscores['position'].isin(['C', 'PF']),
boxscores['yes', 'no')
Let’s take a look at the player name, team, and position for a random sample of 5 “big men”.
(boxscores'big_man'] == 'yes', ['player', 'team', 'position']]
.loc[boxscores[5)
.sample( )
player | team | position | |
---|---|---|---|
4443 | D. Reath | POR | C |
25155 | Z. Collins | SA | PF |
2747 | T. Prince | LAL | PF |
3518 | Jaden McDaniels | MIN | PF |
13303 | K. Durant | PHX | PF |
2.6 The transform()
method of a groupby object
The .transform
method is used to calculate an aggregate statistic while keeping all the original rows (not collapsing to one row per group). This is useful when we want to use aggregates calculated on different groups at the same time.
Let’s create a small DataFrame named survey_df
to work with. survey_df
has information on survey respondents, including their county, their gender, and the rating they gave in their survey.
= pd.DataFrame({
survey_df 'county': ['Bucks', 'Allegheny', 'Allegheny', 'Allegheny',
'Bucks', 'Allegheny', 'Allegheny', 'Bucks', 'Cambria'],
'gender': ['M', 'NB', 'F', 'F', 'NB', 'M', 'M', 'F', 'NB'],
'rating': [1, 2, 5, 9, 9, 8, 4, 3, 7]
}
)
survey_df
county | gender | rating | |
---|---|---|---|
0 | Bucks | M | 1 |
1 | Allegheny | NB | 2 |
2 | Allegheny | F | 5 |
3 | Allegheny | F | 9 |
4 | Bucks | NB | 9 |
5 | Allegheny | M | 8 |
6 | Allegheny | M | 4 |
7 | Bucks | F | 3 |
8 | Cambria | NB | 7 |
Imagine that we want to display average rating by gender, but only for counties with at least 5 responses. In order to count responses by county we need to group by county, but to calculate average rating by gender we need to group by gender. If we first group by county and count responses, gender is no longer available in the result, because a normal groupby operation that produces a count for each group collapses the result down to one row for each group, as does the code in the cell below.
'county')['rating'].count() survey_df.groupby(
county
Allegheny 5
Bucks 3
Cambria 1
Name: rating, dtype: int64
We can use the result above to count responses by county, but we can’t go on from there to show the average rating for counties with 5 or more responses, because the ratings are no longer part of the DataFrame.
The transform()
method of a GroupBy object can help us solve this problem. It creates an aggregate statistic according to the function named in its argument, but in its result it reports that aggregate statistic for every row in the input DataFrame. For example, in the code below we create a new column named county_responses
that holds the results of counting rows (with non-null values) by group. Because we use the transform()
method, the result has the same number of rows (and the same row index) as the input DataFrame. For each row, the count of responses for the county of that row is displayed. Each row where the county is Allegheny shows 5 county responses. Each row where the county is Bucks shows 3 county responses, etc. Now we can apply the criteria on responses per county, but we still have the gender
and rating
columns available to us for further analysis.
'county_responses'] = (survey_df
survey_df['county')['county']
.groupby('count')
.transform(
) survey_df
county | gender | rating | county_responses | |
---|---|---|---|---|
0 | Bucks | M | 1 | 3 |
1 | Allegheny | NB | 2 | 5 |
2 | Allegheny | F | 5 | 5 |
3 | Allegheny | F | 9 | 5 |
4 | Bucks | NB | 9 | 3 |
5 | Allegheny | M | 8 | 5 |
6 | Allegheny | M | 4 | 5 |
7 | Bucks | F | 3 | 3 |
8 | Cambria | NB | 7 | 1 |
Now we can accomplishes our original goal of displaying average rating by gender, but only for counties with at least 5 responses.
(survey_df'county_responses'] >= 5]
.loc[survey_df['gender')['rating']
.groupby(
.mean() )
gender
F 7.00
M 6.00
NB 2.00
Name: rating, dtype: float64
Above we created a new column to show how transform()
works. We don’t need to create the new column. We can use the .transform
result to create a Boolean array to use within loc[]
, as in the following example. First, let’s delete the county_responses
column we created above.
'county_responses', inplace = True, axis = 1)
survey_df.drop(
survey_df
county | gender | rating | |
---|---|---|---|
0 | Bucks | M | 1 |
1 | Allegheny | NB | 2 |
2 | Allegheny | F | 5 |
3 | Allegheny | F | 9 |
4 | Bucks | NB | 9 |
5 | Allegheny | M | 8 |
6 | Allegheny | M | 4 |
7 | Bucks | F | 3 |
8 | Cambria | NB | 7 |
Next, we write the code to show average rating by gender for counties with 5 or more responses. We put the result column into a list so that the result is a DataFrame.
(survey_df'county')['rating'].transform('count') >= 5]
.loc[survey_df.groupby('gender')[['rating']].mean()
.groupby( )
rating | |
---|---|
gender | |
F | 7.00 |
M | 6.00 |
NB | 2.00 |
Let’s try another example that requires aggregation at more than one level. Let’s try to make a new column for the boxscores
DataFrame named star
. The column will have ‘yes’ if the player averages at least 25 points per game and ‘no’ if the player does not.
'star'] = np.where(boxscores
boxscores['player')['pts']
.groupby('mean') >= 25, 'yes', 'no') .transform(
Now, let’s take a look at the player
, team
, and star
columns from a random sample of 10 rows from the boxscores
DataFrame and see if we get any stars in our sample.
'player', 'team', 'star']].sample(10) boxscores.loc[:, [
player | team | star | |
---|---|---|---|
4650 | D. Fox | SAC | yes |
20635 | S. Merrill | CLE | no |
21403 | D. Hunter | ATL | no |
4415 | Bogdan Bogdanovic | ATL | no |
4841 | N. Vucevic | CHI | no |
12269 | K. Towns | MIN | no |
8665 | P. Pritchard | BOS | no |
23811 | M. Lewis | LAL | no |
15604 | Tre Jones | SA | no |
17336 | J. Hayes | LAL | no |