3. Data Cleaning & Preprocessing
Real-world data is often messy. Data cleaning is a crucial step to ensure the quality and accuracy of your analysis.
3.1. Handling Missing Values
Missing values (NaN) are common. pandas provides methods to identify, fill, or drop them.
df.isna().sum(): Count missing values per column.df.fillna(value): Fill missing values with a specified value (e.g., 0, mean, median, mode).df.dropna(): Remove rows or columns with missing values.
# Check for missing values
print(df.isna().sum())
# Fill missing 'Age' values with the mean
df['Age'].fillna(df['Age'].mean(), inplace=True)
# Drop rows where 'City' is missing
df.dropna(subset=['City'], inplace=True)
print(df.isna().sum())
3.2. Changing Data Types
Ensure columns have appropriate data types for analysis.
# Convert 'Age' to integer type
df['Age'] = df['Age'].astype(int)
# Convert 'Date' column to datetime objects
df['Date'] = pd.to_datetime(df['Date'])
print(df.info())
3.3. Removing Duplicates
Duplicate rows can skew analysis. Use .drop_duplicates() to remove them.
# Remove duplicate rows based on all columns
df.drop_duplicates(inplace=True)
# Remove duplicates based on specific columns
df.drop_duplicates(subset=['Name', 'Age'], inplace=True)
print(df.shape)
3.4. Renaming Columns
Rename columns for clarity and consistency.
df.rename(columns={'old_name': 'new_name', 'Another Old': 'Another New'}, inplace=True)
print(df.columns)