Friday, 27 June 2025

Essential Data Transformation Techniques for EDA

Transformation Purpose Applies To Description Example
One-Hot Encoding Convert categorical data into numeric Categorical variables Creates binary columns for each category Color: Red, Blue → Red: [1, 0], Blue: [0, 1]
Label Encoding Encode categories as integers Ordinal/nominal categories Assigns integer values to each category Low, Medium, High → 0, 1, 2
Normalization (Min-Max) Scale values to range [0,1] Continuous numerical data (x - min) / (max - min) Age: 20–60 → 0.0–1.0
Standardization (Z-score) Center around 0 with unit variance Continuous numerical data (x - mean) / std deviation Height: mean=170, std=10 → 180 → 1.0
Log Transformation Reduce right skew Skewed numerical data Applies log function to reduce magnitude difference Sales: 1000 → log1p(1000) ≈ 6.91
Binning Convert continuous to categorical Continuous data Divide values into intervals Age: 0–18 (Child), 19–65 (Adult), 65+ (Senior)
Handling Missing Values Deal with NaNs Any data type Impute or remove null values Salary: NaN → impute with median (e.g., 50000)
Outlier Handling Reduce extreme values' effect Numerical data Detect using Z-score, IQR, etc. and remove or cap Income > Q3 + 1.5*IQR → outlier
Feature Scaling Ensure consistent scale across features Numerical features Often a synonym for normalization/standardization Feature A: 0–1000, Feature B: 0–1 → scale A to [0,1]
Target Encoding Encode categorical using target mean Categorical w/ target var Replace category with average of target variable City A (avg income 60K), B (40K) → A:60, B:40
Date/Time Feature Extraction Leverage datetime info Date/Time variables Extract year, month, day, weekday, etc. 2025-06-27 → Month: 6, Weekday: Friday
Text Vectorization Convert text to numeric form Text data TF-IDF, CountVectorizer, word embeddings "happy sad happy" → {'happy': 2, 'sad': 1}
Dimensionality Reduction Reduce number of features High-dimensional data PCA, t-SNE, UMAP to reduce noise/complexity 100 features → reduce to 10 with 95% variance retained

No comments:

Post a Comment