Page 3: Data Cleaning & Preprocessing

3. Data Cleaning & Preprocessing

Real-world data is often messy. Data cleaning is a crucial step to ensure the quality and accuracy of your analysis.

3.1. Handling Missing Values

Missing values (NaN) are common. pandas provides methods to identify, fill, or drop them.

# Check for missing values
print(df.isna().sum())

# Fill missing 'Age' values with the mean
df['Age'].fillna(df['Age'].mean(), inplace=True) 

# Drop rows where 'City' is missing
df.dropna(subset=['City'], inplace=True)
print(df.isna().sum())

3.2. Changing Data Types

Ensure columns have appropriate data types for analysis.

# Convert 'Age' to integer type
df['Age'] = df['Age'].astype(int) 

# Convert 'Date' column to datetime objects
df['Date'] = pd.to_datetime(df['Date'])
print(df.info())

3.3. Removing Duplicates

Duplicate rows can skew analysis. Use .drop_duplicates() to remove them.

# Remove duplicate rows based on all columns
df.drop_duplicates(inplace=True)

# Remove duplicates based on specific columns
df.drop_duplicates(subset=['Name', 'Age'], inplace=True)
print(df.shape)

3.4. Renaming Columns

Rename columns for clarity and consistency.

df.rename(columns={'old_name': 'new_name', 'Another Old': 'Another New'}, inplace=True)
print(df.columns)