Pandas is a powerful tool for data analysis in Python, but when working with large datasets, performance can become an issue. Here are some tips and sample code to help you optimize your Pandas code for better speed and efficiency.
import pandas as pdimport numpy as np# Creating a random DataFramedf = pd.DataFrame({'float_column': np.random.rand(1000),'int_column': np.random.randint(0, 1000, size=(1000))})
Data Type Optimization
Optimizing data types can significantly reduce memory usage and improve performance.
# Assume df is your DataFrame>>> df.info()<class'pandas.core.frame.DataFrame'>RangeIndex:1000 entries,0 to 999Data columns (total 2 columns):float_column 1000 non-null float64int_column 1000 non-null int64dtypes:float64(1),int64(1)memory usage:15.7 KB>>># Assume df is your DataFrame... df['float_column']= df['float_column'].astype('float32')>>> df['int_column']= pd.to_numeric(df['int_column'], downcast='integer')>>> df.info()<class'pandas.core.frame.DataFrame'>RangeIndex:1000 entries,0 to 999Data columns (total 2 columns):float_column 1000 non-null float32int_column 1000 non-null int16dtypes:float32(1),int16(1)memory usage:5.9 KB # memory optimization
More on data type optimization
Use category for string columns with a limited set of values.
Convert object types to more specific types like int, float, or datetime when possible.
Removing Unnecessary Data
Eliminate columns and rows that are not needed for your analysis to reduce the size of the dataset.
# CPU times: user 14 sdf = (df.rank(axis=1, pct=True, method='max')>=0.95).astype(float)# CPU times: user 0.5 sdf = ( df.dropna(how="all", axis=1)\.dropna(how="all", axis=0)\.rank(axis=1, pct=True, method='max')>=0.95 ).astype(float)
Vectorized Operations
Vectorized operations are much faster than applying functions in a loop. Here's an example of a vectorized operation compared to its loop-based counterpart:
# Vectorized operation using Pandas built-in functiondf['discounted_price']= df['price']* (1- df['discount_rate'])# Loop-based operation (slower)df['discounted_price']=0for i inrange(len(df)): df['discounted_price'].iat[i]= df['price'].iat[i]* (1- df['discount_rate'].iat[i])
Some examples of vectorized operations in Pandas include functions like mean(), sum(), std(), and many others that are applied directly to Pandas Series or DataFrame objects without the need for explicit loops.
Function Optimization
Avoid using apply() when possible as it can be slower than vectorized operations.
# Using map() instead of apply()# CPU times: user 742 ยตsdf['log_column']= np.log(df['int_column'])# CPU times: user 1.63 msdf['log_column']= df['int_column'].map(lambdax: np.log(x))# CPU times: user 1.77 msdf['log_column']= df['int_column'].apply(lambdax: np.log(x))
Processing in Chunks
For very large datasets, process data in chunks.
chunk_size =10000chunks = pd.read_csv('large_file.csv', chunksize=chunk_size)for chunk in chunks:# Process each chunk
By following these tips and using the sample code provided, you can make your Pandas code run faster and handle larger datasets more efficiently.