Intro to DataFrames with Pandas

This tutorial is a brief introduction to DataFrame, a powerful data structure from the Pandas library designed for efficiently handling and analyzing tabular data.

Why Pandas Is Essential for Data Analysis

Pandas is one of the most popular Python libraries for scientific computations. It is designed for efficient data manipulation, especially when working with large and complex datasets.

And the most used data structure provided by Pandas is a DataFrame. DataFrame is a perfect fit for data analysis and computation of tabular data.

Here are some of the benefits of using DataFrame:

DataFrame’s simplified syntax allows for easy cleaning, transformation, and exploration of data.
DataFrame supports heterogeneous data types, which means you can mix strings, numbers, booleans and others all within the same table.
DataFrame is compatible with other Python libraries like NumPy and Matplotlib which enhances its utility in scientific computing and visualization.
DataFrame is optimized for high performance.
DataFrame is able to handle missing data seamlessly.

Popular use cases for DataFrame are exploratory data analysis tasks such as data wrangling, preprocessing, and massaging of data for machine learning.

The Basic Structure of DataFrames

Now, let’s take a closer look at the basic structure of DataFrames. At a high level, DataFrame is made up of 3 components, also available as attributes on the instance of a DataFrame object:

values: A two-dimensional NumPy array of values.
columns: An index for storing column names.
index: An index for storing either row numbers or row labels.

Unlike Numpy arrays, DataFrames can store columns of different types. This makes them much more versatile and better suited for real-world scenarios when you need to analyze and manipulate complex data sets.

Series

When working with DataFrames, you will inevitably come across Series as well. A Series is another data structure provided by Pandas. It is a one-dimensional array-like object that can hold any data type. It is essentially a column in a DataFrame, with the following key characteristics:

Index. Just like DataFrame, a Series has an index which acts like row labels, and can be customized.
Homogeneous Data. A Series typically holds homogeneous data, meaning all the values in the Series are of the same data type.
Flexible Data Types. You can store various data types in a Series, including integers, floats, strings, and even Python objects.
Similar to a NumPy Array. Series is built on top of NumPy, so it inherits many NumPy-like properties and operations.

Here is how you would create a Series from a list.

1

s = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])

In the example above, we are initializing a new Series and providing both the values and the customized index.

Initializing DataFrames

There are several ways to create DataFrames. The two most common methods are using Python data structures like dictionaries and importing data from external files such as CSVs.

Let’s take a look at some examples. Here’s how you would create a DataFrame from a dictionary:

1
2
3
4


import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)

In this example, keys represent column names and values are lists containing column data.

However, a simple dictionary like that isn’t the only data structure you can use to initialize a DataFrame. You have several other options at your disposal, including lists of lists, dictionaries of dictionaries, and even NumPy arrays.

But a more realistic scenario is to import a large amount of data through files like CSV. Here’s how you would do that using the read_csv method:

1
2
3
4


import pandas as pd

# Assuming you have a CSV file named 'data.csv' in the same directory
df = pd.read_csv('data.csv')

read_csv also provides a number of ways to customize the DataFrame. For example, you can use the usecols parameter to load only the columns you’re interested in. Or you can use the names parameter to define your own column names. To get the full list of the parameters available for this method, you can check out the official documentation .

Common utility methods provided by DataFrames

As we discussed earlier, one of the major advantages of using DataFrame is the variety of utility methods it offers. Let’s explore some of the most popular ones that help us understand the structure of data within a DataFrame:

.head() method returns the first few rows of the DataFrame. This is typically my go to method for taking a first quick look at the data I’m working with.
.info() method returns the names of columns, data types they contain and whether they have any missing values.
.shape is an attribute that returns the number of rows and columns.
.describe() method returns quick stats on a DataFrame such as mean and median.

Accessing Data in DataFrame

Now let’s cover accessing data in a DataFrame. Pandas makes it extremely easy to access and manipulate a DataFrame. The syntax is not that different from how you would work with data in Python dictionaries.

Here’s how you would access a single column in a DataFrame.

1
2


# Single brackets - returns a Series
df['Name']

Using single brackets [] return a Series object. If you want to get another DataFrame, you need to use double brackets.

1
2


# Double brackets - returns a DataFrame
df[['Name']]

You can access multiple columns by specifying column names in an array.

1
2


# Access multiple columns
print(df[['Name', 'City']])

To select specific rows, however, you need to use slicing on DataFrames.

Slicing in DataFrames

Slicing in DataFrames refers to selecting a subset of rows based on their start and end indices. Pandas makes row slicing incredibly straightforward, generally following Python’s slicing syntax.

1

df[1:3]  # Slices rows from index 1 to 2

The code above selects the rows from index 1 to 2. Note that the start index is inclusive while the end index is exclusive, meaning the end index will not be included in the final result.

For more advanced slicing operations, DataFrame provides two methods:

.loc allows selecting rows and columns in a DataFrame by their labels, rather than by numerical index. It is particularly useful when the DataFrame is indexed by a non-numerical column, such as dates or strings. Note, however, that to successfully slice a DataFrame by an index that contain dates, you need to make sure that it is sorted properly to avoid unexpected results.
.iloc Allows selecting rows and columns based on the numerical index positioning, regardless of the label values.

Here’s how you would use them in your code.

1
2
3
4
5


# Selecting rows between two date labels
df.loc['2023-09-02':'2023-09-04']

# Using iloc to select rows by position
df.iloc[0:2]

Conclusion

In conclusion, the DataFrame provided by Pandas is a powerful tool for analyzing and working with large sets of tabular data.

In this post, we only scratched the surface of what you can do with DataFrames. But hopefully it gave you a basic understanding of the data structure and its capabilities.

If you’d like to get more web development, React and TypeScript tips consider following me on Twitter, where I share things as I learn them.

Happy coding!