| Transformation | Purpose | Applies To | Description | Example |
|---|---|---|---|---|
| One-Hot Encoding | Convert categorical data into numeric | Categorical variables | Creates binary columns for each category | Color: Red, Blue → Red: [1, 0], Blue: [0, 1] |
| Label Encoding | Encode categories as integers | Ordinal/nominal categories | Assigns integer values to each category | Low, Medium, High → 0, 1, 2 |
| Normalization (Min-Max) | Scale values to range [0,1] | Continuous numerical data | (x - min) / (max - min) | Age: 20–60 → 0.0–1.0 |
| Standardization (Z-score) | Center around 0 with unit variance | Continuous numerical data | (x - mean) / std deviation | Height: mean=170, std=10 → 180 → 1.0 |
| Log Transformation | Reduce right skew | Skewed numerical data | Applies log function to reduce magnitude difference | Sales: 1000 → log1p(1000) ≈ 6.91 |
| Binning | Convert continuous to categorical | Continuous data | Divide values into intervals | Age: 0–18 (Child), 19–65 (Adult), 65+ (Senior) |
| Handling Missing Values | Deal with NaNs | Any data type | Impute or remove null values | Salary: NaN → impute with median (e.g., 50000) |
| Outlier Handling | Reduce extreme values' effect | Numerical data | Detect using Z-score, IQR, etc. and remove or cap | Income > Q3 + 1.5*IQR → outlier |
| Feature Scaling | Ensure consistent scale across features | Numerical features | Often a synonym for normalization/standardization | Feature A: 0–1000, Feature B: 0–1 → scale A to [0,1] |
| Target Encoding | Encode categorical using target mean | Categorical w/ target var | Replace category with average of target variable | City A (avg income 60K), B (40K) → A:60, B:40 |
| Date/Time Feature Extraction | Leverage datetime info | Date/Time variables | Extract year, month, day, weekday, etc. | 2025-06-27 → Month: 6, Weekday: Friday |
| Text Vectorization | Convert text to numeric form | Text data | TF-IDF, CountVectorizer, word embeddings | "happy sad happy" → {'happy': 2, 'sad': 1} |
| Dimensionality Reduction | Reduce number of features | High-dimensional data | PCA, t-SNE, UMAP to reduce noise/complexity | 100 features → reduce to 10 with 95% variance retained |
Friday, 27 June 2025
Essential Data Transformation Techniques for EDA
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment