This post will cover some of the more advanced DataFrames manipulations, such as sorting, subsetting, grouping and joining.
It is assumed that you have a basic knowledge of DataFrames and the pandas library. For a quick introduction to DataFrames, you can read my introductory post .
Let’s get to it!
DataFrame Manipulation Essentials
Sorting and Subsetting Data
Sorting and subsetting are the first tools to consider when starting DataFrame analysis and looking for interesting trends.
Sorting is pretty straightforward. You can sort a DataFrame or a Series by using the sort_values
method:
|
|
Here we are sorting the DataFrame by the Age
column in ascending order, which is the default. DataFrame also supports sorting by multiple columns. To achieve this, provide a list of column names to the by
argument:
|
|
Now the DataFrame will be sorted by the Age
and Salary
columns, in the order they are listed, using multi-level sorting. Notice how the method also supports specifying the sorting direction for each of the columns provided via the ascending
argument.
Subsetting Data
In data analysis, subsetting is just another word for filtering.
The simplest way to subset rows is by using a bracket notation on a DataFrame with a boolean condition within:
|
|
In the code above, we are creating a new DataFrame that is a subset of the original. This DataFrame will contain only the rows from the original DataFrame with the value in the Age
column greater than 10.
For more complex scenarios, you can combine multiple boolean conditions using logical operators. For instance, here’s how to select rows where the age is between 20 and 30:
|
|
You might also have a need to select rows with values contained in a certain list. This can be done using the isin()
method from the
Numpy
package:
|
|
Data Aggregation
Now, let’s cover some of the ways you can aggregate data and calculate various summaries.
Summary statistics
Summary statistics are calculations that give you an overview of a dataset and help you identify its key properties. Some of the common summary statistics are mean, median, mode, standard deviation, and quantiles.
Pandas provides an easy way to quickly get the common summary statistics for a DataFrame using the describe
method:
|
|
Here’s what the code above would produce:
|
|
As you can see, using a single line of code, we were able to get an overview of the main summary statistics for the DataFrame.
However, if describe
doesn’t provide the calculations you need and you need a more custom solution, you can use the agg
method:
|
|
In the example above, we specified a custom list of statistics we need by combining the agg
method with the list of Numpy utility functions. The result is a DataFrame with the calculations of mean, maximum value, minimum value and standard deviation on each of the columns of the original dataset:
|
|
Cumulative statistics
Now let’s move on to cumulative statistics. Cumulative statistics are metrics that provide the running totals of data. They track how data builds up over time.
Pandas provides a way to easily calculate most common cumulative statistics using methods such as cumsum()
, cummax()
cumprod()
and others:
|
|
Here’s what the code above will output:
|
|
And here’s what each of these newly created columns mean:
cumsum()
: Adds up values progressively. Each row contains the sum of all the previous rows up to each point.cumprod()
: Computes the product progressively. Each row contains the product of multiplying all of the previous rows up to each point.cummax()
: Tracks the highest value up to each point.cummin()
: Tracks the lowest value up to each point.
Grouped summary
In data analysis, it is a common practice to analyze data based on categories or subsets within the data. In order to do that, you can first group the data by one or more columns using the groupBy
method and then applying the calculation to the grouped object:
|
|
The code above will produce a Series
object with the following content:
|
|
First, we grouped the data by the Department
column and then calculated the total sum of the salary for each department by calling the sum
method on the Salary
column of each group. All with a single line of code.
Joining Data Frames: An Introduction to Data Merging
Now, let’s discuss another important data manipulation operation: merging. When analysing data, you will often find yourself needing to merge multiple DataFrames or their parts. To do that, Pandas provides the .merge()
method.
Let’s start with a basic example of using the method to merge two DataFrames:
|
|
The code above will produce a new DataFrame that will contain the combined data from the first two:
|
|
Here the merge
method takes two DataFrames and the on
argument. The on
argument specifies the common column from both DataFrames that will be used for merging.
Often times, when merging two DataFrames you will find that they contain columns with overlapping names. So how can you differentiate them? That’s when you can use the suffixes
argument of the merge
method to separate columns with the same name:
|
|
Here’s the output of the code:
|
|
The Department
and Salary
columns are present in both DataFrames. But because we specified the suffixes
argument, these columns were separated and given suffixes at the end of their names in order to differentiate them. Note that the Employee
column was not given a suffix since it is the key column used for merging.
You might’ve noticed that employee record for Eve
from the second DataFrame was omitted from the final result. That’s because the default behavior of the merge
method is to only include rows present in both DataFrames. But do not fret, we will cover modifying this behavior shortly.
The merge
method also allows merging on multiple columns. To do that, all you need to do is provide a list to the on
argument:
|
|
Here’s the result we will get:
|
|
Now the final result contains even fewer rows and Charlie
was omitted. That’s because while both of the DataFrames contain an employee with the name Charlie
, the Department
values for each of them is different.
Inner, Left, Right, and Outer Joins in Pandas
At this point, we’ve covered the basics of merging DataFrames. So I think now is a good time to cover different types of joints you can apply.
Inner join
By default, the merge
method uses inner join for merging. If you have worked with databases before, this might sound familiar. As a quick refresher, inner join is the join operation where the result will contain only the rows with matching values present in both the left and right tables of the merging operation.
Left and Right joins
Another types of joins available are left and right joins.
Left join will keep all of the rows from left DataFrame and only matching rows from right DataFrame. It’s like saying “I want to keep everything from the left data set, regardless of whether there’s a match in the right one.” To specify which type of join you want to use, you can use the how
argument of the merge
method:
|
|
Here’s the output:
|
|
Notice that the output contains a record for David from Sales even though the second DataFrame does not have a matching record. Missing values are handled using NaN
.
Right join is the mirror image of left join. It will keep all of the rows in the right DataFrame, regardless of whether there is a match in the left one.
Outer join
Let’s move on to the last join type we will cover - outer join. Outer join preserves the rows from both DataFrames, regardless of whether there is a match. As mentioned before, Pandas fills the missing values with NaN
:
|
|
Here’s the final output:
|
|
Merging on different column names
Up to this point, both DataFrames in our merging examples had matching names for the key column. However, that is not a prerequisite for merging and you can still merge DataFrames with different key column names. To do that, you can specify the name of the key column for each of the DataFrames using left_on
and right_on
arguments:
|
|
And here’s the output:
|
|
The code above demonstrates merging DataFrames with different key column names. The merge
method uses the Emp_Name
as the key from the left DataFrame and Employee
from the right one. The result contains both columns.
Conclusion
In this post, we covered important topics such as sorting and subsetting. We also talked about advanced aggregation techniques, including summary statistics, cumulative statistics, and grouping. Finally, we did a deep dive into merging DataFrames using the merge
method as well as different types of joins you can apply.
While we covered a lot of theoretical ground, in order to master and grasp these concepts better, it’s important to practice these operations and techniques on your own.
If you’d like to get more web development, React and TypeScript tips consider following me on Twitter, where I share things as I learn them.
Happy coding!