Optimizing Pandas Performance

Pandas is a powerful tool for data analysis in Python, but when working with large datasets, performance can become an issue. Here are some tips and sample code to help you optimize your Pandas code for better speed and efficiency.

import pandas as pd
import numpy as np

# Creating a random DataFrame
df = pd.DataFrame({
    'float_column': np.random.rand(1000),
    'int_column': np.random.randint(0, 1000, size=(1000))
})

Data Type Optimization

Optimizing data types can significantly reduce memory usage and improve performance.

# Assume df is your DataFrame
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
float_column    1000 non-null float64
int_column      1000 non-null int64
dtypes: float64(1), int64(1)
memory usage: 15.7 KB

>>> # Assume df is your DataFrame
... df['float_column'] = df['float_column'].astype('float32')
>>> df['int_column'] = pd.to_numeric(df['int_column'], downcast='integer')
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
float_column    1000 non-null float32
int_column      1000 non-null int16
dtypes: float32(1), int16(1)
memory usage: 5.9 KB  # memory optimization
More on data type optimization
  • Use category for string columns with a limited set of values.

  • Convert object types to more specific types like int, float, or datetime when possible.

Removing Unnecessary Data

Eliminate columns and rows that are not needed for your analysis to reduce the size of the dataset.

# CPU times: user 14 s
df = (df.rank(axis=1, pct=True, method='max') >= 0.95).astype(float)

# CPU times: user 0.5 s
df = (
    df.dropna(how="all", axis=1) \
        .dropna(how="all", axis=0) \
        .rank(axis=1, pct=True, method='max') >= 0.95
    ).astype(float)

Vectorized Operations

Vectorized operations are much faster than applying functions in a loop. Here's an example of a vectorized operation compared to its loop-based counterpart:

# Vectorized operation using Pandas built-in function
df['discounted_price'] = df['price'] * (1 - df['discount_rate'])

# Loop-based operation (slower)
df['discounted_price'] = 0
for i in range(len(df)):
    df['discounted_price'].iat[i] = df['price'].iat[i] * (1 - df['discount_rate'].iat[i])

Some examples of vectorized operations in Pandas include functions like mean(), sum(), std(), and many others that are applied directly to Pandas Series or DataFrame objects without the need for explicit loops.

Function Optimization

Avoid using apply() when possible as it can be slower than vectorized operations.

# Using map() instead of apply()
# CPU times: user 742 ยตs
df['log_column'] = np.log(df['int_column'])

# CPU times: user 1.63 ms
df['log_column'] = df['int_column'].map(lambda x: np.log(x))

# CPU times: user 1.77 ms
df['log_column'] = df['int_column'].apply(lambda x: np.log(x))

Processing in Chunks

For very large datasets, process data in chunks.

chunk_size = 10000
chunks = pd.read_csv('large_file.csv', chunksize=chunk_size)

for chunk in chunks:
    # Process each chunk

By following these tips and using the sample code provided, you can make your Pandas code run faster and handle larger datasets more efficiently.

Last updated