1. Handling Missing Values
Why it matters: Missing data can lead to inaccurate models or errors during training.
Techniques:
- Remove rows/columns with missing values
- Impute values using mean, median, mode, or advanced methods
import pandas as pd
df = pd.DataFrame({
'Age': [25, 30, None, 40],
'Salary': [50000, 60000, 52000, None]
})
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Salary'].fillna(df['Salary'].median(), inplace=True)
2. Handling Outliers
Why it matters: Outliers can skew results and harm model performance.
Techniques: Z-score, IQR, or data transformation
import numpy as np
from scipy import stats
data = np.array([10, 12, 14, 15, 13, 1000])
z_scores = np.abs(stats.zscore(data))
filtered_data = data[z_scores < 3]
3. Normalization / Scaling
Why it matters: Ensures features are on similar scales for models that rely on distance or gradient.
Techniques: Min-Max Scaling, Z-score Standardization
from sklearn.preprocessing import MinMaxScaler
data = [[100], [200], [300]]
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data)
4. Encoding Categorical Variables
Why it matters: ML models require numerical input.
Techniques: Label Encoding (ordinal), One-Hot Encoding (nominal)
import pandas as pd
df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green']})
df_encoded = pd.get_dummies(df, columns=['Color'])
5. Feature Extraction
Why it matters: Derived features can expose patterns and improve accuracy.
Techniques: Use domain knowledge or extract from text, date, image, etc.
df = pd.DataFrame({'Date': pd.to_datetime(['2024-01-01', '2024-06-15'])})
df['Month'] = df['Date'].dt.month
df['Weekday'] = df['Date'].dt.weekday
Summary Table
| Technique | Goal | Example Tool |
|---|---|---|
| Handling Missing Values | Fill/remove gaps in data | fillna(), Imputation |
| Handling Outliers | Reduce skewed data | Z-score, IQR |
| Normalization/Scaling | Equalize feature ranges | MinMaxScaler, StandardScaler |
| Encoding Categorical Vars | Convert text to numbers | get_dummies(), LabelEncoder |
| Feature Extraction | Create informative features | TF-IDF, .dt.month etc. |
No comments:
Post a Comment