pandas Guide 4: Data Manipulation & Aggregation

4. Data Manipulation & Aggregation

pandas excels at manipulating and aggregating data, allowing you to derive insights from raw datasets.

4.1. Applying Functions

Use .apply() to apply a function along an axis of the DataFrame or Series, and .map() for Series-specific element-wise transformation.

# Apply a custom function to a column
df['Age_Group'] = df['Age'].apply(lambda x: 'Adult' if x >= 18 else 'Minor')
print(df.head())

# Map values in a column
gender_map = {'M': 'Male', 'F': 'Female'}
df['Gender_Full'] = df['Gender'].map(gender_map)
print(df.head())

4.2. Grouping and Aggregation

.groupby() is fundamental for summarizing data. You can apply various aggregation functions (e.g., .sum(), .mean(), .count(), .max(), .min()).

# Group by 'City' and calculate the average 'Age'
avg_age_by_city = df.groupby('City')['Age'].mean()
print(avg_age_by_city)

# Group by multiple columns and get multiple aggregations
agg_data = df.groupby(['City', 'Gender']).agg(
    Total_Count=('Name', 'count'),
    Average_Age=('Age', 'mean')
)
print(agg_data)

4.3. Merging and Concatenating DataFrames

Combine multiple DataFrames using .merge() (for SQL-style joins) or .concat() (for stacking DataFrames).

Merging DataFrames

# Assuming 'df_orders' DataFrame with 'CustomerID'
# merged_df = pd.merge(df, df_orders, on='CustomerID', how='inner')
# print(merged_df.head())

# Example with dummy data
data1 = {'ID': [1, 2, 3], 'Name': ['A', 'B', 'C']}
data2 = {'ID': [1, 2, 4], 'Score': [90, 85, 95]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
merged_df = pd.merge(df1, df2, on='ID', how='inner') # inner join on 'ID'
print(merged_df)

Concatenating DataFrames

# Assuming 'df_new_entries' DataFrame with same columns as 'df'
# combined_df = pd.concat([df, df_new_entries], ignore_index=True)
# print(combined_df.tail())

# Example with dummy data
df_part1 = pd.DataFrame({'Col1': [1, 2], 'Col2': ['A', 'B']})
df_part2 = pd.DataFrame({'Col1': [3, 4], 'Col2': ['C', 'D']})
concatenated_df = pd.concat([df_part1, df_part2], ignore_index=True)
print(concatenated_df)