Advanced DataFrame Manipulation and Joins with Python and Pandas

This post will cover some of the more advanced DataFrames manipulations, such as sorting, subsetting, grouping and joining.

It is assumed that you have a basic knowledge of DataFrames and the pandas library. For a quick introduction to DataFrames, you can read my introductory post .

Let’s get to it!

DataFrame Manipulation Essentials

Sorting and Subsetting Data

Sorting and subsetting are the first tools to consider when starting DataFrame analysis and looking for interesting trends.

Sorting is pretty straightforward. You can sort a DataFrame or a Series by using the sort_values method:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


import pandas as pd

# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [24, 35, 18, 29],
        'Salary': [70000, 80000, 50000, 60000]}

df = pd.DataFrame(data)

# Sort by the 'Age' column in ascending order
sorted_df = df.sort_values(by='Age')

Here we are sorting the DataFrame by the Age column in ascending order, which is the default. DataFrame also supports sorting by multiple columns. To achieve this, provide a list of column names to the by argument:

1

sorted_df = df.sort_values(by=['Age', 'Salary'], ascending=[True, False])

Now the DataFrame will be sorted by the Age and Salary columns, in the order they are listed, using multi-level sorting. Notice how the method also supports specifying the sorting direction for each of the columns provided via the ascending argument.

Subsetting Data

In data analysis, subsetting is just another word for filtering.

The simplest way to subset rows is by using a bracket notation on a DataFrame with a boolean condition within:

1

df_subset = df[df['Age'] > 10]

In the code above, we are creating a new DataFrame that is a subset of the original. This DataFrame will contain only the rows from the original DataFrame with the value in the Age column greater than 10.

For more complex scenarios, you can combine multiple boolean conditions using logical operators. For instance, here’s how to select rows where the age is between 20 and 30:

1

df_subset = df[(df['Age'] > 20) & (df['Age'] < 30)]

You might also have a need to select rows with values contained in a certain list. This can be done using the isin() method from the Numpy package:

1

df_subset = df[np.isin(df['Name'], ['Alice', 'David'])]

Data Aggregation

Now, let’s cover some of the ways you can aggregate data and calculate various summaries.

Summary statistics

Summary statistics are calculations that give you an overview of a dataset and help you identify its key properties. Some of the common summary statistics are mean, median, mode, standard deviation, and quantiles.

Pandas provides an easy way to quickly get the common summary statistics for a DataFrame using the describe method:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


import pandas as pd

# Example DataFrame
data = {
    'Age': [22, 25, 29, 35, 40, 50],
    'Salary': [40000, 50000, 55000, 70000, 75000, 90000]
}

df = pd.DataFrame(data)

# Get summary statistics
summary_stats = df.describe()

print(summary_stats)

Here’s what the code above would produce:

1
2
3
4
5
6
7
8
9


             Age        Salary
count   6.000000      6.000000
mean   33.500000  63333.333333
std    10.753025  19364.917931
min    22.000000  40000.000000
25%    25.250000  51250.000000
50%    32.000000  62500.000000
75%    39.250000  73750.000000
max    50.000000  90000.000000

As you can see, using a single line of code, we were able to get an overview of the main summary statistics for the DataFrame.

However, if describe doesn’t provide the calculations you need and you need a more custom solution, you can use the agg method:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


import pandas as pd
import numpy as np

# Sample DataFrame
data = {
    'Age': [22, 25, 29, 35, 40, 50],
    'Salary': [40000, 50000, 55000, 70000, 75000, 90000],
    'Experience': [1, 3, 4, 5, 8, 10]
}

df = pd.DataFrame(data)

# Apply aggregation to all columns
result = df.agg([np.mean, np.min, np.max, np.std])

In the example above, we specified a custom list of statistics we need by combining the agg method with the list of Numpy utility functions. The result is a DataFrame with the calculations of mean, maximum value, minimum value and standard deviation on each of the columns of the original dataset:

1
2
3
4
5


             Age        Salary    Experience
mean    33.500000  63333.333333    5.166667
amin    22.000000  40000.000000    1.000000
amax    50.000000  90000.000000   10.000000
std     10.850469  18855.980157    3.070498

Cumulative statistics

Now let’s move on to cumulative statistics. Cumulative statistics are metrics that provide the running totals of data. They track how data builds up over time.

Pandas provides a way to easily calculate most common cumulative statistics using methods such as cumsum(), cummax() cumprod() and others:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


import pandas as pd

# Sample data
data = [5, 3, 7, 1, 4, 8]
df = pd.DataFrame({'Values': data})

# Calculate cumulative statistics
df['Cumulative Sum'] = df['Values'].cumsum()
df['Cumulative Max'] = df['Values'].cummax()
df['Cumulative Product'] = df['Values'].cumprod()
df['Cumulative Min'] = df['Values'].cummin()

print(df)

Here’s what the code above will output:

1
2
3
4
5
6
7


   Values  Cumulative Sum  Cumulative Max  Cumulative Product  Cumulative Min
0       5               5               5                   5               5
1       3               8               5                  15               3
2       7              15               7                 105               3
3       1              16               7                 105               1
4       4              20               7                 420               1
5       8              28               8                3360               1

And here’s what each of these newly created columns mean:

cumsum(): Adds up values progressively. Each row contains the sum of all the previous rows up to each point.
cumprod(): Computes the product progressively. Each row contains the product of multiplying all of the previous rows up to each point.
cummax(): Tracks the highest value up to each point.
cummin(): Tracks the lowest value up to each point.

Grouped summary

In data analysis, it is a common practice to analyze data based on categories or subsets within the data. In order to do that, you can first group the data by one or more columns using the groupBy method and then applying the calculation to the grouped object:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


import pandas as pd

# Sample DataFrame
data = {
    'Department': ['HR', 'HR', 'IT', 'IT', 'Sales', 'Sales'],
    'Salary': [50000, 52000, 70000, 72000, 65000, 68000]
}
df = pd.DataFrame(data)

# Group by Department and sum Salary (without using reset_index)
grouped_series = df.groupby('Department')['Salary'].sum()

print(grouped_series)

The code above will produce a Series object with the following content:

1
2
3
4
5


Department
HR       102000
IT       142000
Sales    133000
Name: Salary, dtype: int64

First, we grouped the data by the Department column and then calculated the total sum of the salary for each department by calling the sum method on the Salary column of each group. All with a single line of code.

Joining Data Frames: An Introduction to Data Merging

Now, let’s discuss another important data manipulation operation: merging. When analysing data, you will often find yourself needing to merge multiple DataFrames or their parts. To do that, Pandas provides the .merge() method.

Let’s start with a basic example of using the method to merge two DataFrames:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


import pandas as pd

# Sample DataFrames
df1 = pd.DataFrame({
    'Employee': ['Alice', 'Bob', 'Charlie', 'David'],
    'Department': ['HR', 'IT', 'IT', 'Sales']
})

df2 = pd.DataFrame({
    'Employee': ['Alice', 'Bob', 'Charlie', 'Eve'],
    'Salary': [50000, 60000, 70000, 80000]
})

# Merging DataFrames on the 'Employee' column
merged_df = pd.merge(df1, df2, on='Employee')

print(merged_df)

The code above will produce a new DataFrame that will contain the combined data from the first two:

1
2
3
4


  Employee Department  Salary
0    Alice         HR   50000
1      Bob         IT   60000
2  Charlie         IT   70000

Here the merge method takes two DataFrames and the on argument. The on argument specifies the common column from both DataFrames that will be used for merging.

Often times, when merging two DataFrames you will find that they contain columns with overlapping names. So how can you differentiate them? That’s when you can use the suffixes argument of the merge method to separate columns with the same name:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


import pandas as pd

# Sample DataFrames with overlapping column names
df1 = pd.DataFrame({
    'Employee': ['Alice', 'Bob', 'Charlie', 'David'],
    'Department': ['HR', 'IT', 'IT', 'Sales'],
    'Salary': [45000, 55000, 60000, 70000]
})

df2 = pd.DataFrame({
    'Employee': ['Alice', 'Bob', 'Charlie', 'Eve'],
    'Department': ['Finance', 'IT', 'Sales', 'HR'],
    'Salary': [50000, 60000, 70000, 80000]
})

# Merging with suffixes to handle overlapping columns
merged_df = pd.merge(df1, df2, on='Employee', suffixes=('_df1', '_df2'))

print(merged_df)

Here’s the output of the code:

1
2
3
4


Employee Department_df1  Salary_df1 Department_df2  Salary_df2
0    Alice             HR       45000        Finance       50000
1      Bob             IT       55000             IT       60000
2  Charlie             IT       60000          Sales       70000

The Department and Salary columns are present in both DataFrames. But because we specified the suffixes argument, these columns were separated and given suffixes at the end of their names in order to differentiate them. Note that the Employee column was not given a suffix since it is the key column used for merging.

You might’ve noticed that employee record for Eve from the second DataFrame was omitted from the final result. That’s because the default behavior of the merge method is to only include rows present in both DataFrames. But do not fret, we will cover modifying this behavior shortly.

The merge method also allows merging on multiple columns. To do that, all you need to do is provide a list to the on argument:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


import pandas as pd

# Sample DataFrames with matching columns
df1 = pd.DataFrame({
    'Employee': ['Alice', 'Bob', 'Charlie', 'David'],
    'Department': ['HR', 'IT', 'IT', 'Sales'],
    'Salary': [45000, 55000, 60000, 70000]
})

df2 = pd.DataFrame({
    'Employee': ['Alice', 'Bob', 'Charlie', 'Eve'],
    'Department': ['HR', 'IT', 'Sales', 'HR'],
    'Bonus': [5000, 6000, 7000, 8000]
})

# Merging on 'Employee' and 'Department' (default inner join)
merged_df = pd.merge(df1, df2, on=['Employee', 'Department'])

print(merged_df)

Here’s the result we will get:

1
2
3


   Employee Department  Salary  Bonus
0     Alice         HR   45000   5000
1       Bob         IT   55000   6000

Now the final result contains even fewer rows and Charlie was omitted. That’s because while both of the DataFrames contain an employee with the name Charlie, the Department values for each of them is different.

Inner, Left, Right, and Outer Joins in Pandas

At this point, we’ve covered the basics of merging DataFrames. So I think now is a good time to cover different types of joints you can apply.

Inner join

By default, the merge method uses inner join for merging. If you have worked with databases before, this might sound familiar. As a quick refresher, inner join is the join operation where the result will contain only the rows with matching values present in both the left and right tables of the merging operation.

Left and Right joins

Another types of joins available are left and right joins.

Left join will keep all of the rows from left DataFrame and only matching rows from right DataFrame. It’s like saying “I want to keep everything from the left data set, regardless of whether there’s a match in the right one.” To specify which type of join you want to use, you can use the how argument of the merge method:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


import pandas as pd

# Sample DataFrames
df1 = pd.DataFrame({
    'Employee': ['Alice', 'Bob', 'Charlie', 'David'],
    'Department': ['HR', 'IT', 'IT', 'Sales'],
    'Salary': [45000, 55000, 60000, 70000]
})

df2 = pd.DataFrame({
    'Employee': ['Alice', 'Bob', 'Charlie', 'Eve'],
    'Bonus': [5000, 6000, 7000, 8000]
})

# Perform a left join on 'Employee'
left_merged = pd.merge(df1, df2, on='Employee', how='left')

print(left_merged)

Here’s the output:

1
2
3
4
5


   Employee Department  Salary  Bonus
0     Alice         HR   45000  5000.0
1       Bob         IT   55000  6000.0
2   Charlie         IT   60000  7000.0
3     David      Sales   70000     NaN

Notice that the output contains a record for David from Sales even though the second DataFrame does not have a matching record. Missing values are handled using NaN.

Right join is the mirror image of left join. It will keep all of the rows in the right DataFrame, regardless of whether there is a match in the left one.

Outer join

Let’s move on to the last join type we will cover - outer join. Outer join preserves the rows from both DataFrames, regardless of whether there is a match. As mentioned before, Pandas fills the missing values with NaN:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


import pandas as pd

# Sample DataFrames
df1 = pd.DataFrame({
    'Employee': ['Alice', 'Bob', 'Charlie', 'David'],
    'Department': ['HR', 'IT', 'IT', 'Sales'],
    'Salary': [45000, 55000, 60000, 70000]
})

df2 = pd.DataFrame({
    'Employee': ['Alice', 'Bob', 'Charlie', 'Eve'],
    'Department': ['HR', 'IT', 'Sales', 'HR'],
    'Bonus': [5000, 6000, 7000, 8000]
})

# Perform an outer join
outer_merged = pd.merge(df1, df2, on=['Employee', 'Department'], how='outer')

print(outer_merged)

Here’s the final output:

1
2
3
4
5
6
7


   Employee Department  Salary   Bonus
0     Alice         HR   45000  5000.0
1       Bob         IT   55000  6000.0
2   Charlie         IT   60000     NaN
3     David      Sales   70000     NaN
4   Charlie      Sales     NaN  7000.0
5       Eve         HR     NaN  8000.0

Merging on different column names

Up to this point, both DataFrames in our merging examples had matching names for the key column. However, that is not a prerequisite for merging and you can still merge DataFrames with different key column names. To do that, you can specify the name of the key column for each of the DataFrames using left_on and right_on arguments:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


import pandas as pd

# Sample DataFrames with different key column names
df1 = pd.DataFrame({
    'Emp_Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Department': ['HR', 'IT', 'IT', 'Sales'],
    'Salary': [45000, 55000, 60000, 70000]
})

df2 = pd.DataFrame({
    'Employee': ['Alice', 'Bob', 'Charlie', 'Eve'],
    'Bonus': [5000, 6000, 7000, 8000]
})

# Merge using different key column names (default inner join)
merged_df = pd.merge(df1, df2, left_on='Emp_Name', right_on='Employee')

print(merged_df)

And here’s the output:

1
2
3
4


   Emp_Name Department  Salary Employee  Bonus
0     Alice         HR   45000    Alice   5000
1       Bob         IT   55000      Bob   6000
2   Charlie         IT   60000  Charlie   7000

The code above demonstrates merging DataFrames with different key column names. The merge method uses the Emp_Name as the key from the left DataFrame and Employee from the right one. The result contains both columns.

Conclusion

In this post, we covered important topics such as sorting and subsetting. We also talked about advanced aggregation techniques, including summary statistics, cumulative statistics, and grouping. Finally, we did a deep dive into merging DataFrames using the merge method as well as different types of joins you can apply.

While we covered a lot of theoretical ground, in order to master and grasp these concepts better, it’s important to practice these operations and techniques on your own.

If you’d like to get more web development, React and TypeScript tips consider following me on Twitter, where I share things as I learn them.

Happy coding!