3 Concatenating, Merging and Reshaping Data

Click here to download this chapter as a Jupyter (.ipynb) file.

In the first two chapters we have worked with data in a single DataFrame or Series. We have seen how to investigate various properties of the data, how to sort the data, and how to filter or subset the data by position, label, or criteria. In this chapter we will learn how to put data from two or more different DataFrames together into one DataFrame by concatenating them along an axis or merging on a common key or set of keys. We will also learn how to reshape the data in a DataFrame in various ways. When we concatenate, merge, or reshape data we will have to pay attention to index values, and sometimes we will need to manipulate the indexes in some way. So, in this chapter we will also learn a bit more about pandas indexes, particularly indexes that have multiple levels.

Let’s start by importing pandas and numpy.

import pandas as pd
import numpy as np

Next we will read the boxscores data into a DataFrame.

baseurl = "https://neuronjolt.com/data/"
boxscores = pd.read_csv(baseurl + 'nba_bs_2023-24_season_cleaned.csv')

3.1 Concatenating DataFrames with `concat()`

Pandas has a concat() function that can be used to concatenate multiple Series or DataFrame objects along an axis. concat() can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the axis of concatenation.

Below we create three simple DataFrames and use them to illustrate how concat() works. These DataFrames include store numbers and counts of units sold. There is a DataFrame for the East region for January, one for the West region for January, and one for the East region for February.

east_jan = pd.DataFrame({
    'store_no': [1, 2, 3, 4, 5],
    'units_sold': [12, 15, 9, 17, 21]   
})

west_jan = pd.DataFrame({
    'store_no': [6, 7, 8, 9],
    'units_sold': [31, 15, 13, 19]
})

east_feb = pd.DataFrame({
        'store_no': [1, 2, 3, 4, 5],
    'units_sold': [11, 17, 10, 14, 23]  
})

east_jan

	store_no	units_sold
0	1	12
1	2	15
2	3	9
3	4	17
4	5	21

west_jan

	store_no	units_sold
0	6	31
1	7	15
2	8	13
3	9	19

east_feb

	store_no	units_sold
0	1	11
1	2	17
2	3	10
3	4	14
4	5	23

Below is the function signature for the concat() function from the pandas documentation:

pandas.concat(objs, *, axis=0, join='outer', ignore_index=False, 
              keys=None, levels=None, names=None, 
              verify_integrity=False, sort=False, copy=None)

Let’s concatenate the January data from the East and West regions into one DataFrame. Notice in the east_jan and west_jan DataFrames that the stores in the East region are numbered from 1 to 6 and the stores in the West region are numbered from 6 to 9. In the concatenated DataFrame the store number can thus provide us with information about what region each row represents. The indexes from the input DataFrames, which have no meaning for this set of data, are simply concatenated, leading to duplicate index values in the concatenated DataFrame.

pd.concat([east_jan, west_jan])

	store_no	units_sold
0	1	12
1	2	15
2	3	9
3	4	17
4	5	21
0	6	31
1	7	15
2	8	13
3	9	19

To avoid duplicate index values we can set the parameter ignore_index = True. This results in a new index being created for the concatenated DataFrame.

pd.concat([east_jan, west_jan], 
          ignore_index = True)

	store_no	units_sold
0	1	12
1	2	15
2	3	9
3	4	17
4	5	21
5	6	31
6	7	15
7	8	13
8	9	19

We can use the keys parameter to add an index level to the concatenated DataFrame with information that we specify. In the code below we add index keys ‘east’ for the data from the first input DataFrame, which represents the East region values, and ‘west’ for the data from the second input DataFrame, which represents the West region values.

pd.concat([east_jan, west_jan], 
          keys = ('east', 'west'))

		store_no	units_sold
east	0	1	12
	1	2	15
	2	3	9
	3	4	17
	4	5	21
west	0	6	31
	1	7	15
	2	8	13
	3	9	19

Below is another example of adding an index level after concatenation. In this case we are concatenating the two DataFrames for the East region. The first is the data from January and the second is the data from February. In the concatenated DataFrame we add index keys to preserve the information about what month each row of data represents.

east_combined = pd.concat([east_jan, east_feb], 
                          keys = ('jan', 'feb'))
east_combined

		store_no	units_sold
jan	0	1	12
	1	2	15
	2	3	9
	3	4	17
	4	5	21
feb	0	1	11
	1	2	17
	2	3	10
	3	4	14
	4	5	23

We can use the names parameter to specify names for the levels of the row MultiIndex created for the concatenated DataFrame. For example, the east_combined DataFrame has a 2-level row index. The first level of the index, which we added with the keys parameter when we did the concatenation, represents the month. The second level of the index has the original index values. Below we use the names parameter during the concatenaton to give these index levels meaningful names.

east_combined = pd.concat([east_jan, east_feb], 
                          keys = ('jan', 'feb'),
                          names = ['month', 'original_idx'])
east_combined

		store_no	units_sold
month	original_idx
jan	0	1	12
	1	2	15
	2	3	9
	3	4	17
	4	5	21
feb	0	1	11
	1	2	17
	2	3	10
	3	4	14
	4	5	23

The values that represent the month would probably make more sense as a column. We can move an index level or levels into columns with the reset_index() method. In the code example below both index levels are moved into new columns and a new row index is created. Since the row index levels that created the column had names, month and original_idx, those names are used for the column names. If the row index levels didn’t have names the new columns would have been named level_0 and level_1.

east_combined.reset_index()

	month	original_idx	store_no	units_sold
0	jan	0	1	12
1	jan	1	2	15
2	jan	2	3	9
3	jan	3	4	17
4	jan	4	5	21
5	feb	0	1	11
6	feb	1	2	17
7	feb	2	3	10
8	feb	3	4	14
9	feb	4	5	23

If the levels of the row MultiIndex were not named we can always rename the columns that are created when we reset the index. For example, the code below concatenates the DataFrames for January and February for the East region, then resets the index, which moves the two row index levels to columns with names level_0 and level_1. Then, it renames the column created from the index level with the months, and drops the column with the original index values from before the concatenation, which we no longer need. We add parentheses at the beginning and end of the code so that we can break it into separate lines.

(pd.concat([east_jan, east_feb], 
           keys = ('jan', 'feb'))
 .reset_index()
 .rename(columns = {'level_0': 'month'})
 .drop(labels = 'level_1', axis = 1)
)

	month	store_no	units_sold
0	jan	1	12
1	jan	2	15
2	jan	3	9
3	jan	4	17
4	jan	5	21
5	feb	1	11
6	feb	2	17
7	feb	3	10
8	feb	4	14
9	feb	5	23

We can also concatenate Series objects. Below we concatenate the units sold from both the East and West regions for January. We use the keys parameter to add a level to the index to indicate which month to which each value applies.

pd.concat([east_jan['units_sold'], west_jan['units_sold']], 
          keys = ('east', 'west'))

east  0    12
      1    15
      2     9
      3    17
      4    21
west  0    31
      1    15
      2    13
      3    19
Name: units_sold, dtype: int64

The concat() function can also be used to combine DataFrames along the columns, but that same functionality can be accomplished with the merge() function, which is more powerful and provides more flexibility. In a later section in this chapter we will learn more about the merge() function and the .join() method, which provides a subset of the functionality of the merge() function

3.2 Working with pandas `MultiIndex` objects

Pandas uses a type called a MultiIndex to implement indexes with multiple levels. We are starting to see more of them, so we should take some time here to go over some techniques that we can use to work with Indexes and MultiIndexes.

3.2.1 Labels in a MultiIndex

To refer to a specific row or column when that axis has a MultiIndex we use a tuple for the label, with each position of the tuple corresponding to a level of the index. Let’s display the row index of the east_combined DataFrame as an example.

east_combined.index

MultiIndex([('jan', 0),
            ('jan', 1),
            ('jan', 2),
            ('jan', 3),
            ('jan', 4),
            ('feb', 0),
            ('feb', 1),
            ('feb', 2),
            ('feb', 3),
            ('feb', 4)],
           names=['month', 'original_idx'])

If we want to look at the row for February with original index 2 in the east_combined DataFrame we would use the tuple ('feb', 2) to indicate that specific row, as in the code below.

east_combined.loc[('feb', 2)]

store_no       3
units_sold    10
Name: (feb, 2), dtype: int64

To see more than one specific row we could use a list of tuples to indicate the labels.

east_combined.loc[[('jan', 1), ('feb', 2)]]

		store_no	units_sold
month	original_idx
jan	1	2	15
feb	2	3	10

3.2.2 Providing Names for Index levels with `set_names()`

We can name or rename index levels with the set_names() index method, which is used to give names to index levels. For row index names use index.set_names() and for column index names use columns.set_names(). Let’s provide names for the index levels in the east_combined DataFrame.

east_combined.index.set_names(('month','original_index'), 
                              inplace = True)
east_combined

		store_no	units_sold
month	original_index
jan	0	1	12
	1	2	15
	2	3	9
	3	4	17
	4	5	21
feb	0	1	11
	1	2	17
	2	3	10
	3	4	14
	4	5	23

In the example above we provided names for all levels of the row index. To set a name for just one of the levels we can use set_names() with the level parameter set to the desired level of the MultiIndex.

east_combined.index.set_names('source_idx', 
                              level = 1, 
                              inplace = True)
east_combined

		store_no	units_sold
month	source_idx
jan	0	1	12
	1	2	15
	2	3	9
	3	4	17
	4	5	21
feb	0	1	11
	1	2	17
	2	3	10
	3	4	14
	4	5	23

3.2.3 Resetting Indexes with `reset_index()`

The reset_index() method resets the row index and, by default, moves the row index levels into new columns. If drop = True is set new columns are not created from the index levels. With a MultiIndex the level parameter may be used to indicate the level or levels of the index to reset. If only a subset of the index levels are reset the remaining level(s) becomes the new row index.

Let’s look at east_combined with the row MultiIndex reset. Note how both levels of the index (month and source_idx) are moved to new columns, and a new consecutive-integer row index is created.

east_combined.reset_index()

	month	source_idx	store_no	units_sold
0	jan	0	1	12
1	jan	1	2	15
2	jan	2	3	9
3	jan	3	4	17
4	jan	4	5	21
5	feb	0	1	11
6	feb	1	2	17
7	feb	2	3	10
8	feb	3	4	14
9	feb	4	5	23

To reset just one part of the MultiIndex specify the level parameter. Note that in the code below only one level of the original MultiIndex is moved to the columns. The other level remains as the row index.

east_combined.reset_index(level = 1)

	source_idx	store_no	units_sold
month
jan	0	1	12
jan	1	2	15
jan	2	3	9
jan	3	4	17
jan	4	5	21
feb	0	1	11
feb	1	2	17
feb	2	3	10
feb	3	4	14
feb	4	5	23

3.2.4 Moving Columns to the Row Index with `set_index()`

In Chapter 1 we used set_index() to specify which column or columns to use as the new row index for a DataFrame. Now that we have learned about reset_index() we have the tools to move columns to the row index and move row index levels to columns. In one of the examples above we used reset_index() to move the two levels of the row MdultiIndex of the east_combined DataFrame, month and source_idx to the column index (i.e. transformed the row index levels into columns). Let’s review that operation, and then use set_index() to move month and source_idx back to the row MultiIndex. First, let’s display east_combined to review its structure.

east_combined

		store_no	units_sold
month	source_idx
jan	0	1	12
	1	2	15
	2	3	9
	3	4	17
	4	5	21
feb	0	1	11
	1	2	17
	2	3	10
	3	4	14
	4	5	23

We can see that month and source_idx are the two levels of the row MultiIndex. Let’s move them to the column index (i.e. tansform them into columns) with reset_index(). We use the inplace = True to actually change the east_combined DataFrame, rather than simply returning a copy of it with the change implemented.

east_combined.reset_index(inplace = True)

east_combined

	month	source_idx	store_no	units_sold
0	jan	0	1	12
1	jan	1	2	15
2	jan	2	3	9
3	jan	3	4	17
4	jan	4	5	21
5	feb	0	1	11
6	feb	1	2	17
7	feb	2	3	10
8	feb	3	4	14
9	feb	4	5	23

Now we will use set_index() to undo that change and move month and source_idx back to the row index.

east_combined.set_index(['month', 'source_idx'], inplace = True)

east_combined

		store_no	units_sold
month	source_idx
jan	0	1	12
	1	2	15
	2	3	9
	3	4	17
	4	5	21
feb	0	1	11
	1	2	17
	2	3	10
	3	4	14
	4	5	23

3.2.5 Using `get_level_values()` to Access MultiIndex Values

In Chapter 1 we used query() and loc[] to subset a DataFrame according to the values in a column or columns. How can we subset a DataFrame according to the values in a level of the row MultiIndex? To do that we can use query(), which can recognize labels for levels of the row MultiIndex, or we can use the get_level_values() method within loc[] to create a Boolean array used for selection. Note that the get_level_values() method is applied to an index, rather than to a DataFrame or a Series. That is why it is preceded by .index when it is to be applied to the row index. It can also be used to access the values of a column MultiIndex if it is preceded with .columns.

For example, if we wanted to see the data from the east_combined DataFrame just for February we could use query(), as in the code below.

east_combined.query('month == "feb"')

		store_no	units_sold
month	source_idx
feb	0	1	11
	1	2	17
	2	3	10
	3	4	14
	4	5	23

We could also use get_level_values() to create a Boolean selection array based on the values in the month level of the row MultiIndex, as in the code below.

east_combined.loc[east_combined.index.get_level_values('month') == "feb"]

		store_no	units_sold
month	source_idx
feb	0	1	11
	1	2	17
	2	3	10
	3	4	14
	4	5	23

3.3 Removing Levels of a MultiIndex with `droplevel()`

The droplevel() method may be applied to a DataFrame or a Series to remove levels of a row or column MultiIndex. It takes an axis parameter to specify the axis from which to remove the label. To remove a level of a row MultiIndex specify axis = 0 and to remove a level of the column MultiIndex specify axis = 1. The level parameter is used to specify which level(s) to drop. The level may be specified by position or label.

Below let’s create a new DataFrame from the box scores data that shows scoring statistics by player. Notice that this code creates a DataFrame with a 3-level MultiIndex for the columns, and the index values aren’t ideal.

player_scoring = (boxscores
                  .groupby('player')
                  .agg({'pts': ['sum', 'describe']})
                 )

player_scoring

	pts
	sum	describe
	pts	count	mean	std	min	25%	50%	75%	max
player
A. Bailey	23	10.0	2.300000	3.497618	0.0	0.0	1.0	2.75	11.0
A. Black	316	69.0	4.579710	4.606280	0.0	1.0	4.0	7.00	23.0
A. Brooks	42	10.0	4.200000	5.977736	0.0	0.0	1.0	5.75	17.0
A. Burks	689	66.0	10.439394	7.722353	0.0	5.0	8.5	15.00	34.0
A. Caruso	715	71.0	10.070423	4.492292	2.0	7.0	10.0	12.00	23.0
...	...	...	...	...	...	...	...	...	...
Z. Nnaji	186	58.0	3.206897	3.177545	0.0	0.0	2.0	5.75	14.0
Z. Simpson	42	7.0	6.000000	5.416026	0.0	2.0	4.0	10.00	14.0
Z. Smith	0	1.0	0.000000	NaN	0.0	0.0	0.0	0.00	0.0
Z. Williams	420	51.0	8.235294	6.211564	0.0	4.0	8.0	13.00	27.0
Z. Williamson	1601	70.0	22.871429	7.243030	4.0	19.0	23.5	27.75	36.0

584 rows × 9 columns

First, let’s remove the first two levels of the column MultiIndex. Within the droplevel() method we specify axis = 0 to indicate that we want to work with the column index (the index for the dimension that crosses the rows). This method always returns a copy of the DataFrame with the operation applied, so we need to assign the result back to player_scoring for the changes to persist.

player_scoring = player_scoring.droplevel(level = [0, 1], axis = 1)

player_scoring

	pts	count	mean	std	min	25%	50%	75%	max
player
A. Bailey	23	10.0	2.300000	3.497618	0.0	0.0	1.0	2.75	11.0
A. Black	316	69.0	4.579710	4.606280	0.0	1.0	4.0	7.00	23.0
A. Brooks	42	10.0	4.200000	5.977736	0.0	0.0	1.0	5.75	17.0
A. Burks	689	66.0	10.439394	7.722353	0.0	5.0	8.5	15.00	34.0
A. Caruso	715	71.0	10.070423	4.492292	2.0	7.0	10.0	12.00	23.0
...	...	...	...	...	...	...	...	...	...
Z. Nnaji	186	58.0	3.206897	3.177545	0.0	0.0	2.0	5.75	14.0
Z. Simpson	42	7.0	6.000000	5.416026	0.0	2.0	4.0	10.00	14.0
Z. Smith	0	1.0	0.000000	NaN	0.0	0.0	0.0	0.00	0.0
Z. Williams	420	51.0	8.235294	6.211564	0.0	4.0	8.0	13.00	27.0
Z. Williamson	1601	70.0	22.871429	7.243030	4.0	19.0	23.5	27.75	36.0

584 rows × 9 columns

We should also change the column names so that they actually reflect the values that are in the columns. For example, the first column, pts, shows total points for the season. The second column shows number of games, and the other columns show per-game averages. The current column names don’t clearly communicate the column values. Below we rename the columns with more descriptive names, using the rename() method.

player_scoring.rename(columns = {
    'pts': 'season_total_pts',
    'count': 'games_played',
    'mean': 'per_game_mean',
    'std': 'per_game_std',
    'min': 'per_game_min',
    '25%': 'per_game_25%',
    '50%': 'per_game_50%',
    '75%': 'per_game_75%',
    'max': 'per_game_max'
}, inplace = True)

player_scoring

	season_total_pts	games_played	per_game_mean	per_game_std	per_game_min	per_game_25%	per_game_50%	per_game_75%	per_game_max
player
A. Bailey	23	10.0	2.300000	3.497618	0.0	0.0	1.0	2.75	11.0
A. Black	316	69.0	4.579710	4.606280	0.0	1.0	4.0	7.00	23.0
A. Brooks	42	10.0	4.200000	5.977736	0.0	0.0	1.0	5.75	17.0
A. Burks	689	66.0	10.439394	7.722353	0.0	5.0	8.5	15.00	34.0
A. Caruso	715	71.0	10.070423	4.492292	2.0	7.0	10.0	12.00	23.0
...	...	...	...	...	...	...	...	...	...
Z. Nnaji	186	58.0	3.206897	3.177545	0.0	0.0	2.0	5.75	14.0
Z. Simpson	42	7.0	6.000000	5.416026	0.0	2.0	4.0	10.00	14.0
Z. Smith	0	1.0	0.000000	NaN	0.0	0.0	0.0	0.00	0.0
Z. Williams	420	51.0	8.235294	6.211564	0.0	4.0	8.0	13.00	27.0
Z. Williamson	1601	70.0	22.871429	7.243030	4.0	19.0	23.5	27.75	36.0

584 rows × 9 columns

3.4 Merging DataFrames with `merge()` and `.join()`

Pandas DataFrames can be merged based on the values in their indexes or the values in specified columns using the .join() method and merge() function. We will call such operations “merges,” regardless of whether they are accomplished with .join() or merge(). In a merge the rows of the input DataFrames are combined horizontally, resulting in a DataFrame with wider rows, because the rows have the columns from all the DataFrames that were combined. The rows are lined up for combination based on like values in indexes or one or more columns that are specified within the .join() method or merge() function.

While they do similar things, .join() is only implemented as a method and is a bit simpler. merge() is more powerful and is implemented both as a pandas function and a DataFrame method. .join() actually uses merge() internally, but provides a somewhat simplified interface. Their default behavior also differs: merge() by default looks for a common (same name) column in the two DataFrames on which to combine the data. join() by default combines the columns based on their indexes. I recommend using merge(), and explicitly setting the parameters rather than relying on default behavior.

3.4.0.1 Creating example DataFrames

The code below creates two sample DataFrames that we will use to investigate various types of merge operations. The first DataFrame, named students, represents a student data table for which StudentID is the unique identifier for each row. Other data included in the table are the major and name for each students.

The second DataFrame, named members represents a table with data on a student organization. This table also uses student ID as the unique identifier, but the student ID is in a column named Member. Other columns in this table represent the student’s role in the organization, year in school (“First Year,” “Sophomore,” etc.) and name. The student’s names in the members DataFrame, however, are the students’ preferred names or nicknames, rather than the official names that are in the students DataFrame.

A useful merge of these two DataFrames would determine which rows from each DataFrame share the same student ID, and combine them, so that the resulting rows include the data from both input DataFrames.

Let’s go ahead and create the DataFrames and take a look.

students = pd.DataFrame({'StudentID': ['S1', 'S2', 'S3', 'S4', 'S5', 'S6'],
                         'Major': ['MGMT', 'ECON', 'MUSIC', 'BIO', 'HIST', 'BIO'],
                         'Name': ['Jennifer', 'Robert', 'James',
                                  'Hailey', 'Su-mei', 'Khartik']})

members = pd.DataFrame({'Member': ['S2', 'S3', 'S5', 'S6'],
                        'Role': ['President', 'Treasurer', 'Media Director', 'Secretary'],
                       'Year': ['Sophomore', 'Senior', 'Junior', 'Junior'],
                       'Name': ['Rob', 'Jimmy', 'Mei', 'Khartik']})

The two DataFrames are displayed below. Note that when we created the DataFrames we didn’t specify an index, so they were automatically indexed with integers. However, we can see that the two DataFrames are related logically by student IDs. In the students DataFrame the student IDs are in a column named StudentID and in the members DataFrame they are in a column named Member.

students

	StudentID	Major	Name
0	S1	MGMT	Jennifer
1	S2	ECON	Robert
2	S3	MUSIC	James
3	S4	BIO	Hailey
4	S5	HIST	Su-mei
5	S6	BIO	Khartik

members

	Member	Role	Year	Name
0	S2	President	Sophomore	Rob
1	S3	Treasurer	Senior	Jimmy
2	S5	Media Director	Junior	Mei
3	S6	Secretary	Junior	Khartik

3.4.1 Merging DataFrames with `merge()`

Here is the function header for merge() from the pandas documentation:

pandas.merge(left, right, how='inner', on=None, left_on=None, 
             right_on=None, left_index=False, right_index=False, 
             sort=False, suffixes=('_x', '_y'), copy=None, 
             indicator=False, validate=None)[source]

The default merge() will not work well with our two DataFrames. Since merge() looks for a commonly-named column to use as the key it will merge on Name by default, but the names don’t all match correctly. In the merge below we see that there was only one row from each table that matched based on the Name column, the row for a student named Khartik.

pd.merge(students, members)

	StudentID	Major	Name	Member	Role	Year
0	S6	BIO	Khartik	S6	Secretary	Junior

In the previous merge only one line is in the result, because the default type of merge is an inner merge (from the how = 'inner' default parameter). That means that only the rows for which the key value is equal in both DataFrames are included in the result DataFrame. Each row in one DataFrame is matched with every row in the other DataFrame that has a matching key.

We have seen that relying on the merge() function’s default behavior of merging on a same-named column doesn’t well with this particular data. It is also a good idea to be explicit when using pandas functions and not simply rely on the default parameter settings. So, let’s specify the column to use for the merge.

We can merge the two DataFrames and specify the two columns to use with the left_on and right_on parameters. Since the members DataFrame is first in the list of DataFrames to be merged it is considered the left DataFrame and the students DataFrame is considered the right DataFrame. We will use the Member column from the members DataFrame and the StudentID column from the students DataFrame for the merge, because both those columns hold the student’s student ID.

Note that merge() can combine in 4 different ways, which coincide with the SQL join types. These are specified with the how parameter. Possible values for the how parameter are ‘inner’, ‘outer’, ‘left’, and ‘right’. ‘inner’ is the default. It keeps all rows for which the value in the column on which the DataFrames are joined is the same in both DataFrames.

Note, also, that since both DataFrames have a column named Name the name columns in the result DataFrame will be suffixed with _x (for the left DataFrame) and _y (for the right DataFrame) by default.

pd.merge(members, students, 
         left_on='Member', 
         right_on='StudentID')

	Member	Role	Year	Name_x	StudentID	Major	Name_y
0	S2	President	Sophomore	Rob	S2	ECON	Robert
1	S3	Treasurer	Senior	Jimmy	S3	MUSIC	James
2	S5	Media Director	Junior	Mei	S5	HIST	Su-mei
3	S6	Secretary	Junior	Khartik	S6	BIO	Khartik

The suffixes appended to same-named columns from the merged DataFrames may be set with the suffixes parameter. Let’s use that to make the source of the name information in each column more explicit. Let’s also be explicit about the type of merge performed by setting the how parameter to 'inner'.

pd.merge(members, students, how = 'inner', 
         left_on='Member', right_on='StudentID', 
         suffixes=('_from_members', '_from_students'))

	Member	Role	Year	Name_from_members	StudentID	Major	Name_from_students
0	S2	President	Sophomore	Rob	S2	ECON	Robert
1	S3	Treasurer	Senior	Jimmy	S3	MUSIC	James
2	S5	Media Director	Junior	Mei	S5	HIST	Su-mei
3	S6	Secretary	Junior	Khartik	S6	BIO	Khartik

3.4.1.1 We can also use `merge()` as a DataFrame method

In this case the DataFrame before the dot is the left DataFrame and the DataFrame inside the parentheses is the right DataFrame.

members.merge(students, 
              left_on='Member', 
              right_on='StudentID')

	Member	Role	Year	Name_x	StudentID	Major	Name_y
0	S2	President	Sophomore	Rob	S2	ECON	Robert
1	S3	Treasurer	Senior	Jimmy	S3	MUSIC	James
2	S5	Media Director	Junior	Mei	S5	HIST	Su-mei
3	S6	Secretary	Junior	Khartik	S6	BIO	Khartik

3.4.1.2 The `indicator` parameter

If we set indicator=True a column called _merge is added to the resulting DataFrame with information on the source of each row’s key value. We will use it as we investigate the various types of merges.

members.merge(students, how = 'inner', 
              left_on='Member', 
              right_on='StudentID', 
              indicator=True)

	Member	Role	Year	Name_x	StudentID	Major	Name_y	_merge
0	S2	President	Sophomore	Rob	S2	ECON	Robert	both
1	S3	Treasurer	Senior	Jimmy	S3	MUSIC	James	both
2	S5	Media Director	Junior	Mei	S5	HIST	Su-mei	both
3	S6	Secretary	Junior	Khartik	S6	BIO	Khartik	both

3.4.1.3 Dropping unwanted columns from the merged DataFrame

We can use drop(), specifying axis=1, to drop columns by name from the resulting DataFrame. This gets rid of the columns with duplicate data. Let’s do the merge again, but drop the Name_x column from the result.

members.merge(students, 
              how = 'inner',
              left_on='Member', 
              right_on='StudentID').drop(['StudentID', 'Name_x'], axis=1)

	Member	Role	Year	Major	Name_y
0	S2	President	Sophomore	ECON	Robert
1	S3	Treasurer	Senior	MUSIC	James
2	S5	Media Director	Junior	HIST	Su-mei
3	S6	Secretary	Junior	BIO	Khartik

Now let’s look at some of the other types of merges: how='outer', ’how='left', and how='right'

3.4.1.4 Outer merge with `how='outer'`

An outer merge keeps all the key values from both merged DataFrames and merges each row in one DataFrame with every row in the other DataFrame that has a matching key, if any. Where a key value doesn’t exist in one of the input DataFrames the values for that DataFrame’s columns are missing in the merged DataFrame. Below we merge the members and students DataFrames with an outer merge. Students with IDs ‘S1’ and ‘S4’ are not in the members DataFrame, so the values in the rows for those two student IDs come only from the right input DataFrame, students. We can see this indicated in the _merge column on the far right. Since the values fro those rows come only from the students DataFrame the columns that would be populated with data from the members DataFrame are filled with missing value indicators.

members.merge(students, 
              left_on='Member', 
              right_on='StudentID', 
              how='outer', 
              indicator=True)

	Member	Role	Year	Name_x	StudentID	Major	Name_y	_merge
0	NaN	NaN	NaN	NaN	S1	MGMT	Jennifer	right_only
1	S2	President	Sophomore	Rob	S2	ECON	Robert	both
2	S3	Treasurer	Senior	Jimmy	S3	MUSIC	James	both
3	NaN	NaN	NaN	NaN	S4	BIO	Hailey	right_only
4	S5	Media Director	Junior	Mei	S5	HIST	Su-mei	both
5	S6	Secretary	Junior	Khartik	S6	BIO	Khartik	both

3.4.1.5 Right merge with `how='right'`

A right merge keeps all the keys from the right side and merges each row from the right DataFrame with every row in the left DataFrame that has a matching key. If there is no match missing values are placed in the columns that would be populated by values from the left DataFrame.
With the particular data we are using the result is the same as the outer merge.

members.merge(students, 
              left_on='Member', 
              right_on='StudentID', 
              how='right', 
              indicator=True)

	Member	Role	Year	Name_x	StudentID	Major	Name_y	_merge
0	NaN	NaN	NaN	NaN	S1	MGMT	Jennifer	right_only
1	S2	President	Sophomore	Rob	S2	ECON	Robert	both
2	S3	Treasurer	Senior	Jimmy	S3	MUSIC	James	both
3	NaN	NaN	NaN	NaN	S4	BIO	Hailey	right_only
4	S5	Media Director	Junior	Mei	S5	HIST	Su-mei	both
5	S6	Secretary	Junior	Khartik	S6	BIO	Khartik	both

Note that order matters with a right merge. Let’s try a right merge in the opposite order, with the students DataFrame as the left DataFrame and the members DataFrame as the right DataFrame.

students.merge(members, 
               left_on='StudentID', 
               right_on='Member', 
               how='right', 
               indicator=True)

	StudentID	Major	Name_x	Member	Role	Year	Name_y	_merge
0	S2	ECON	Robert	S2	President	Sophomore	Rob	both
1	S3	MUSIC	James	S3	Treasurer	Senior	Jimmy	both
2	S5	HIST	Su-mei	S5	Media Director	Junior	Mei	both
3	S6	BIO	Khartik	S6	Secretary	Junior	Khartik	both

3.4.1.6 Left merge with `how='left'`

Let’s try a left merge with students DataFrame as left. All the keys are kept from the left DataFrame, and row from the left DataFrame is merged with every row in the right DataFrame that has a matching key. If the right DataFrame doesn’t have one of the keys missing values are placed in the columns what would have come from the right DataFrame.

students.merge(members, 
               left_on='StudentID', 
               right_on='Member', 
               how='left', 
               indicator=True)

	StudentID	Major	Name_x	Member	Role	Year	Name_y	_merge
0	S1	MGMT	Jennifer	NaN	NaN	NaN	NaN	left_only
1	S2	ECON	Robert	S2	President	Sophomore	Rob	both
2	S3	MUSIC	James	S3	Treasurer	Senior	Jimmy	both
3	S4	BIO	Hailey	NaN	NaN	NaN	NaN	left_only
4	S5	HIST	Su-mei	S5	Media Director	Junior	Mei	both
5	S6	BIO	Khartik	S6	Secretary	Junior	Khartik	both

3.4.2 Merging DataFrames with `join()`

The join() method provides a subset of the functionality of the merge() function, but has different defaults. While merge() defaults to an inner merge, join() defaults to a left merge. join() also defaults to matching on the indexes rather than same-named columns. Since join() only provides a subset of the functionality of merge() I recommend just using merge() for all merges in your code. Information on join() is provided here, though, because you may see it used in others` code.

The method header for join() is as follows:

DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)

The default join() joins on the indexes. In our two DataFrames the indexes are just arbitrary integers, so the join doesn’t make sense. Note that since the Name column is in both input DataFrames we need to specify the lsuffix and/or rsuffix parameters to provide a suffix one or both of the same-named column.

The default join on indexes results in nonsense for our data.

members.join(students, lsuffix='_members')

	Member	Role	Year	Name_members	StudentID	Major	Name
0	S2	President	Sophomore	Rob	S1	MGMT	Jennifer
1	S3	Treasurer	Senior	Jimmy	S2	ECON	Robert
2	S5	Media Director	Junior	Mei	S3	MUSIC	James
3	S6	Secretary	Junior	Khartik	S4	BIO	Hailey

join() has an on parameter, but doesn’t allow us to specify different columns for left and right like merge() does. One way of dealing with this is to set the indexes to the appropriate columns so that the join works correctly. The set_index() method is used to specify the column to use for the index. Note below how the numeric index is replaced by the student IDs from the member column when we set the index to the Member column.

members.set_index('Member')

	Role	Year	Name
Member
S2	President	Sophomore	Rob
S3	Treasurer	Senior	Jimmy
S5	Media Director	Junior	Mei
S6	Secretary	Junior	Khartik

3.4.2.1 We can join the members and students DataFrames by giving them a common index

In the code below, before performing the join we set the indexes of both input DataFrames to the column that has the student ID value.

joined_DataFrame = (members.set_index('Member')
                    .join(students.set_index('StudentID'), 
                          lsuffix='_nickname'))
joined_DataFrame

	Role	Year	Name_nickname	Major	Name
Member
S2	President	Sophomore	Rob	ECON	Robert
S3	Treasurer	Senior	Jimmy	MUSIC	James
S5	Media Director	Junior	Mei	HIST	Su-mei
S6	Secretary	Junior	Khartik	BIO	Khartik

If we want to make Member a regular column again instead of the index we can use reset_index() to do so.

joined_DataFrame.reset_index(inplace=True)
joined_DataFrame

	Member	Role	Year	Name_nickname	Major	Name
0	S2	President	Sophomore	Rob	ECON	Robert
1	S3	Treasurer	Senior	Jimmy	MUSIC	James
2	S5	Media Director	Junior	Mei	HIST	Su-mei
3	S6	Secretary	Junior	Khartik	BIO	Khartik

3.5 Reshaping DataFrames

With tabular data it is sometimes necessary to reshape the data by moving the values between rows and columns. Similarly, with pandas DataFrames we sometimes want to move index levels from the rows to the columns or vice versa. In this section we will investigate several functions and methods that we can use to do such reshaping operations. The functions that we will cover include the following:

pivot() - used to move values from one column to the column index (column names)
melt() - inverse of pivot(). Combines multiple columns into a set of two columns
stack() - used to move an index level from column index to row index
unstack() - inverse of stack(). Used to move an index level from row index to column index

First, we will read in the NBA box score data and use it to create some sample DataFrames in “wide format” and “long format”. You will gain an understanding of what “wide format” and “long format” mean when you look at the sample DataFrames. They show the same data, selected statistics for three NBA superstars, but in different configurations.

boxscores = boxscores.assign(
    date = pd.to_datetime(boxscores['date'], format="%Y%m%d")
)

The code below creates the wide format DataFrame. It has a separate column for each statistic.

big3_wide = (boxscores
             .loc[boxscores['player'].isin(['G. Antetokounmpo', 
                                            'J. Embiid', 
                                            'N. Jokic']),
                  ['date', 'player', 'min', 'pts', 'ast', 'reb', 'stl', 'to']]
             .reset_index(drop=True)
            )
big3_wide.head(7)

	date	player	min	pts	ast	reb	stl	to
0	2023-10-24	N. Jokic	36	29	11	13	1	2
1	2023-10-26	J. Embiid	36	24	6	7	0	7
2	2023-10-26	G. Antetokounmpo	35	23	3	12	2	7
3	2023-10-27	N. Jokic	35	22	7	12	1	9
4	2023-10-28	J. Embiid	35	34	8	9	1	1
5	2023-10-29	N. Jokic	30	28	5	14	0	4
6	2023-10-29	G. Antetokounmpo	29	26	3	11	1	3

The code below puts the same data into a format called “long,” “stacked,” or “record” format. This format puts all the values for the statistics in one column, and puts indicators of which statistic the value is in another column.

big3_long = pd.melt(big3_wide, 
                    id_vars = ['player', 'date'], 
                    value_name = 'value', 
                    var_name = 'stat')

big3_long.sort_values(by=['player','date']).head(7)

	player	date	stat	value
2	G. Antetokounmpo	2023-10-26	min	35
193	G. Antetokounmpo	2023-10-26	pts	23
384	G. Antetokounmpo	2023-10-26	ast	3
575	G. Antetokounmpo	2023-10-26	reb	12
766	G. Antetokounmpo	2023-10-26	stl	2
957	G. Antetokounmpo	2023-10-26	to	7
6	G. Antetokounmpo	2023-10-29	min	29

3.5.1 Using `pivot()` to go from long to wide format

We can use the .pivot() method or pivot() top-level function to transform long-format data to wide format. It moves values from a specified column into the column index (i.e. makes them column names). Take a look at the sample of the big3_long DataFrame below. It has a column, stat, that indicates the statistic of the row and a value column that indicates the value for that statistic for that row.

big3_long.sample(8)

	player	date	stat	value
908	G. Antetokounmpo	2024-02-25	stl	0
1107	N. Jokic	2024-03-05	to	7
160	N. Jokic	2024-03-13	min	38
367	J. Embiid	2024-04-02	pts	24
933	G. Antetokounmpo	2024-03-26	stl	2
754	J. Embiid	2024-04-06	reb	12
391	N. Jokic	2023-10-30	ast	11
254	G. Antetokounmpo	2023-12-11	pts	32

The pivot() method or function can be used to transform the big3_long data into wide format by using the values in the stat column as new column names. The index parameter indicates which column or columns are to be used to create the index for the resulting DataFrame. The columns parameter indicates the column or columns for which the values will become column index values, and the values parameter indicates the column or columns that hold the data values.

big3_pivoted_wide = pd.pivot(big3_long, 
                             index = ['date', 'player'], 
                             columns = 'stat', 
                             values = 'value')

big3_pivoted_wide

	stat	ast	min	pts	reb	stl	to
date	player
2023-10-24	N. Jokic	11	36	29	13	1	2
2023-10-26	G. Antetokounmpo	3	35	23	12	2	7
2023-10-26	J. Embiid	6	36	24	7	0	7
2023-10-27	N. Jokic	7	35	22	12	1	9
2023-10-28	J. Embiid	8	35	34	9	1	1
...	...	...	...	...	...	...	...
2024-04-09	N. Jokic	7	34	28	13	3	5
2024-04-10	N. Jokic	7	38	41	11	3	2
2024-04-12	J. Embiid	7	32	32	13	0	4
2024-04-12	N. Jokic	12	39	22	7	2	3
2024-04-14	N. Jokic	5	31	15	15	4	4

191 rows × 6 columns

The columns used to create the new index have created a MultiIndex for the rows. Recall that the row index labels for a DataFrame with a MultiIndex are tuples, with each position in the tuple corresponding to a level of the MultiIndex. Below, the row index of the big3_pivoted_wide DataFrame is displayed. It is an array of tuples, with one tuple corresponding to each row of the DataFrame.

big3_pivoted_wide.index

MultiIndex([('2023-10-24',         'N. Jokic'),
            ('2023-10-26', 'G. Antetokounmpo'),
            ('2023-10-26',        'J. Embiid'),
            ('2023-10-27',         'N. Jokic'),
            ('2023-10-28',        'J. Embiid'),
            ('2023-10-29', 'G. Antetokounmpo'),
            ('2023-10-29',        'J. Embiid'),
            ('2023-10-29',         'N. Jokic'),
            ('2023-10-30', 'G. Antetokounmpo'),
            ('2023-10-30',         'N. Jokic'),
            ...
            ('2024-04-06',        'J. Embiid'),
            ('2024-04-06',         'N. Jokic'),
            ('2024-04-07', 'G. Antetokounmpo'),
            ('2024-04-09', 'G. Antetokounmpo'),
            ('2024-04-09',        'J. Embiid'),
            ('2024-04-09',         'N. Jokic'),
            ('2024-04-10',         'N. Jokic'),
            ('2024-04-12',        'J. Embiid'),
            ('2024-04-12',         'N. Jokic'),
            ('2024-04-14',         'N. Jokic')],
           names=['date', 'player'], length=191)

We can filter the DataFrame by label, using these tuples as the labels. Below we examine the statistics from two particular games for Nikola Jokic.

big3_pivoted_wide.loc[[('2024-04-06', 'N. Jokic'), ('2024-04-09', 'N. Jokic')]]

	stat	ast	min	pts	reb	stl	to
date	player
2024-04-06	N. Jokic	11	31	19	14	2	4
2024-04-09	N. Jokic	7	34	28	13	3	5

Recall that if we want to move the new index values back to columns we can do so with the reset_index() method.

big3_pivoted_wide.reset_index()

stat	date	player	ast	min	pts	reb	stl	to
0	2023-10-24	N. Jokic	11	36	29	13	1	2
1	2023-10-26	G. Antetokounmpo	3	35	23	12	2	7
2	2023-10-26	J. Embiid	6	36	24	7	0	7
3	2023-10-27	N. Jokic	7	35	22	12	1	9
4	2023-10-28	J. Embiid	8	35	34	9	1	1
...	...	...	...	...	...	...	...	...
186	2024-04-09	N. Jokic	7	34	28	13	3	5
187	2024-04-10	N. Jokic	7	38	41	11	3	2
188	2024-04-12	J. Embiid	7	32	32	13	0	4
189	2024-04-12	N. Jokic	12	39	22	7	2	3
190	2024-04-14	N. Jokic	5	31	15	15	4	4

191 rows × 8 columns

3.5.2 Using `melt()` to go from wide to long format

The top-level melt() function and the corresponding DataFrame.melt() method are used to transform data from wide to long format, where one or more columns are identifier variables, while all other columns, considered measured variables, are “unpivoted” to the row axis, leaving just two non-identifier columns, “variable” and “value”. The names of those columns can be customized by supplying the var_name and value_name parameters. When transforming a DataFrame using melt(), the row index will be ignored by default and a new row index created.

big3_wide.head()

	date	player	min	pts	ast	reb	stl	to
0	2023-10-24	N. Jokic	36	29	11	13	1	2
1	2023-10-26	J. Embiid	36	24	6	7	0	7
2	2023-10-26	G. Antetokounmpo	35	23	3	12	2	7
3	2023-10-27	N. Jokic	35	22	7	12	1	9
4	2023-10-28	J. Embiid	35	34	8	9	1	1

big3_wide.melt(id_vars = ['date', 'player'], 
               var_name = 'stat')

	date	player	stat	value
0	2023-10-24	N. Jokic	min	36
1	2023-10-26	J. Embiid	min	36
2	2023-10-26	G. Antetokounmpo	min	35
3	2023-10-27	N. Jokic	min	35
4	2023-10-28	J. Embiid	min	35
...	...	...	...	...
1141	2024-04-09	N. Jokic	to	5
1142	2024-04-10	N. Jokic	to	2
1143	2024-04-12	J. Embiid	to	4
1144	2024-04-12	N. Jokic	to	3
1145	2024-04-14	N. Jokic	to	4

1146 rows × 4 columns

3.5.3 Pivoting an index level from column to row with `stack()`

The stack() and unstack() methods are designed to work together with MultiIndex objects. stack() is used to “pivot” a level of the column labels, moving them to the row index and thus returning a DataFrame index that has a new inner-most level of row labels.

With both stack() and unstack() we can specify which level (by name or index number) to unstack or stack. The default is the innermost level (for rows) and lowest level (for columns).

3.5.3.1 Create a Sample DataFrame to Illustrate

The purpose of the code below is to create a sample DataFrame that we can work with. You don’t need to understand all the code used to do so.

tuples = list(
    zip(     ["north", "north", "south", "south", "east", "east", "west", "west"],
             ["red", "blue", "red", "blue", "red", "blue", "red", "blue"],
     )
)

index = pd.MultiIndex.from_tuples(tuples, names=["region", "team"])


df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["score_1", "score_2"])
df

		score_1	score_2
region	team
north	red	0.006234	-2.068063
north	blue	0.276481	1.512233
south	red	-0.640897	1.552428
south	blue	-1.350282	0.816724
east	red	-0.574734	-0.518014
east	blue	0.366108	-0.717324
west	red	0.353186	0.404742
west	blue	0.753643	-0.802089

We can use stack() to move the column index to a third-level of the row index, as in the code below. The result is a pandas Series object, because it just has one column of values, with a MultiIndex as the row index.

stacked = df.stack()
stacked

region  team         
north   red   score_1    0.006234
              score_2   -2.068063
        blue  score_1    0.276481
              score_2    1.512233
south   red   score_1   -0.640897
              score_2    1.552428
        blue  score_1   -1.350282
              score_2    0.816724
east    red   score_1   -0.574734
              score_2   -0.518014
        blue  score_1    0.366108
              score_2   -0.717324
west    red   score_1    0.353186
              score_2    0.404742
        blue  score_1    0.753643
              score_2   -0.802089
dtype: float64

stacked.index

MultiIndex([('north',  'red', 'score_1'),
            ('north',  'red', 'score_2'),
            ('north', 'blue', 'score_1'),
            ('north', 'blue', 'score_2'),
            ('south',  'red', 'score_1'),
            ('south',  'red', 'score_2'),
            ('south', 'blue', 'score_1'),
            ('south', 'blue', 'score_2'),
            ( 'east',  'red', 'score_1'),
            ( 'east',  'red', 'score_2'),
            ( 'east', 'blue', 'score_1'),
            ( 'east', 'blue', 'score_2'),
            ( 'west',  'red', 'score_1'),
            ( 'west',  'red', 'score_2'),
            ( 'west', 'blue', 'score_1'),
            ( 'west', 'blue', 'score_2')],
           names=['region', 'team', None])

3.5.4 Pivot index level from row to column with `unstack()`

unstack() is the inverse of stack().

unstack() is used to “pivot” a level of the row index to the column axis, producing a reshaped DataFrame with a new inner-most level of column labels. Recall the stacked Series we created in the previous section. It has a three-level MultiIndex as its row index.

stacked

region  team         
north   red   score_1    0.006234
              score_2   -2.068063
        blue  score_1    0.276481
              score_2    1.512233
south   red   score_1   -0.640897
              score_2    1.552428
        blue  score_1   -1.350282
              score_2    0.816724
east    red   score_1   -0.574734
              score_2   -0.518014
        blue  score_1    0.366108
              score_2   -0.717324
west    red   score_1    0.353186
              score_2    0.404742
        blue  score_1    0.753643
              score_2   -0.802089
dtype: float64

Let’s use unstack() to pivot the inner row index (the default) of the stacked Series to a new level of the column index, which becomes the column index for the resulting DataFrame.

stacked.unstack()

		score_1	score_2
region	team
east	blue	0.366108	-0.717324
east	red	-0.574734	-0.518014
north	blue	0.276481	1.512233
north	red	0.006234	-2.068063
south	blue	-1.350282	0.816724
south	red	-0.640897	1.552428
west	blue	0.753643	-0.802089
west	red	0.353186	0.404742

We can also specify which level of the row MultiIndex to move to the column index, by name or by level, as in the examples below. When specifying the level, remember that the levels are numbered from outside to inside (for rows) or from top to bottom (for columns) beginning with zero.

# Move the second row index level to the column index
stacked.unstack(level = 1)

	team	blue	red
region
east	score_1	0.366108	-0.574734
east	score_2	-0.717324	-0.518014
north	score_1	0.276481	0.006234
north	score_2	1.512233	-2.068063
south	score_1	-1.350282	-0.640897
south	score_2	0.816724	1.552428
west	score_1	0.753643	0.353186
west	score_2	-0.802089	0.404742

# Move the region level to the column index
stacked.unstack(level = 'region')

	region	east	north	south	west
team
blue	score_1	0.366108	0.276481	-1.350282	0.753643
blue	score_2	-0.717324	1.512233	0.816724	-0.802089
red	score_1	-0.574734	0.006234	-0.640897	0.353186
red	score_2	-0.518014	-2.068063	1.552428	0.404742

stack() or unstack() can also take a list of levels (labels or indices) to pivot.

stacked.unstack(['region', 'team'])

region	north		south		east		west
team	red	blue	red	blue	red	blue	red	blue
score_1	0.006234	0.276481	-0.640897	-1.350282	-0.574734	0.366108	0.353186	0.753643
score_2	-2.068063	1.512233	1.552428	0.816724	-0.518014	-0.717324	0.404742	-0.802089

3.6 Changing the Type of a Series with `astype()`

The astype() method can operate on a Series to change its data type. One common use is to transform strings that represent numbers into numeric types. For example, below a small DataFrame is created to represent teams and their win-loss records. The record column contains strings. If we split them apart to create wins and losses columns we need to transform the substrings into integers so that we can do math with them, such as calculating a win percentage.

records = pd.DataFrame({
    'team': 'red green blue orange'.split(),
    'record': ['12-8', '16-4', '7-13', '10-10']
})

records

	team	record
0	red	12-8
1	green	16-4
2	blue	7-13
3	orange	10-10

records['wins'] = records['record'].str.split('-').str.get(0).astype(int)
records['losses'] = records['record'].str.split('-').str.get(1).astype(int)
records['win_pct'] = records['wins'] / (records['wins'] + records['losses'])

records

	team	record	wins	losses	win_pct
0	red	12-8	12	8	0.60
1	green	16-4	16	4	0.80
2	blue	7-13	7	13	0.35
3	orange	10-10	10	10	0.50

3.1 Concatenating DataFrames with concat()

3.2 Working with pandas MultiIndex objects

3.2.1 Labels in a MultiIndex

3.2.2 Providing Names for Index levels with set_names()

3.2.3 Resetting Indexes with reset_index()

3.2.4 Moving Columns to the Row Index with set_index()

3.2.5 Using get_level_values() to Access MultiIndex Values

3.3 Removing Levels of a MultiIndex with droplevel()

3.4 Merging DataFrames with merge() and .join()

3.4.0.1 Creating example DataFrames

3.4.1 Merging DataFrames with merge()

3.4.1.1 We can also use merge() as a DataFrame method

3.4.1.2 The indicator parameter

3.4.1.3 Dropping unwanted columns from the merged DataFrame

3.4.1.4 Outer merge with how='outer'

3.4.1.5 Right merge with how='right'

3.4.1.6 Left merge with how='left'

3.4.2 Merging DataFrames with join()

3.4.2.1 We can join the members and students DataFrames by giving them a common index

3.5 Reshaping DataFrames

3.5.1 Using pivot() to go from long to wide format

3.5.2 Using melt() to go from wide to long format

3.5.3 Pivoting an index level from column to row with stack()

3.5.3.1 Create a Sample DataFrame to Illustrate

3.5.4 Pivot index level from row to column with unstack()

3.6 Changing the Type of a Series with astype()

3.1 Concatenating DataFrames with `concat()`

3.2 Working with pandas `MultiIndex` objects

3.2.2 Providing Names for Index levels with `set_names()`

3.2.3 Resetting Indexes with `reset_index()`

3.2.4 Moving Columns to the Row Index with `set_index()`

3.2.5 Using `get_level_values()` to Access MultiIndex Values

3.3 Removing Levels of a MultiIndex with `droplevel()`

3.4 Merging DataFrames with `merge()` and `.join()`

3.4.1 Merging DataFrames with `merge()`

3.4.1.1 We can also use `merge()` as a DataFrame method

3.4.1.2 The `indicator` parameter

3.4.1.4 Outer merge with `how='outer'`

3.4.1.5 Right merge with `how='right'`

3.4.1.6 Left merge with `how='left'`

3.4.2 Merging DataFrames with `join()`

3.5.1 Using `pivot()` to go from long to wide format

3.5.2 Using `melt()` to go from wide to long format

3.5.3 Pivoting an index level from column to row with `stack()`

3.5.4 Pivot index level from row to column with `unstack()`

3.6 Changing the Type of a Series with `astype()`