Monday, 30 June 2025

Data Preprocessing Techniques

1. Handling Missing Values

Why it matters: Missing data can lead to inaccurate models or errors during training.

Techniques:

  • Remove rows/columns with missing values
  • Impute values using mean, median, mode, or advanced methods
import pandas as pd
df = pd.DataFrame({
    'Age': [25, 30, None, 40],
    'Salary': [50000, 60000, 52000, None]
})

df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Salary'].fillna(df['Salary'].median(), inplace=True)

2. Handling Outliers

Why it matters: Outliers can skew results and harm model performance.

Techniques: Z-score, IQR, or data transformation

import numpy as np
from scipy import stats

data = np.array([10, 12, 14, 15, 13, 1000])
z_scores = np.abs(stats.zscore(data))
filtered_data = data[z_scores < 3]

3. Normalization / Scaling

Why it matters: Ensures features are on similar scales for models that rely on distance or gradient.

Techniques: Min-Max Scaling, Z-score Standardization

from sklearn.preprocessing import MinMaxScaler

data = [[100], [200], [300]]
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data)

4. Encoding Categorical Variables

Why it matters: ML models require numerical input.

Techniques: Label Encoding (ordinal), One-Hot Encoding (nominal)

import pandas as pd

df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green']})
df_encoded = pd.get_dummies(df, columns=['Color'])

5. Feature Extraction

Why it matters: Derived features can expose patterns and improve accuracy.

Techniques: Use domain knowledge or extract from text, date, image, etc.

df = pd.DataFrame({'Date': pd.to_datetime(['2024-01-01', '2024-06-15'])})
df['Month'] = df['Date'].dt.month
df['Weekday'] = df['Date'].dt.weekday

Summary Table

Technique Goal Example Tool
Handling Missing Values Fill/remove gaps in data fillna(), Imputation
Handling Outliers Reduce skewed data Z-score, IQR
Normalization/Scaling Equalize feature ranges MinMaxScaler, StandardScaler
Encoding Categorical Vars Convert text to numbers get_dummies(), LabelEncoder
Feature Extraction Create informative features TF-IDF, .dt.month etc.

No comments:

Post a Comment