Target Encoding in Data Science
Target encoding is a technique used to encode categorical variables by replacing each category with the mean (or another statistic) of the target variable for that category. It is useful when categorical features have high cardinality.
How Target Encoding Works
- Group Data by Category: For a given categorical feature, group the data based on unique categories.
- Calculate the Target Mean: Compute the mean of the target variable for each category.
- Replace Categories with Their Mean: Assign the calculated mean to all occurrences of that category.
Example
Imagine a dataset with a categorical feature City and a binary target variable Purchase (0 = No, 1 = Yes).
Sample Data
| City | Purchase |
|---|---|
| New York | 1 |
| New York | 0 |
| Los Angeles | 1 |
| Los Angeles | 1 |
| Chicago | 0 |
Step 1: Compute Mean Purchase for Each City
- New York: (1 + 0) / 2 = 0.5
- Los Angeles: (1 + 1) / 2 = 1.0
- Chicago: (0) / 1 = 0.0
Step 2: Replace Cities with Target Mean
| City | Purchase | Target Encoded City |
|---|---|---|
| New York | 1 | 0.5 |
| New York | 0 | 0.5 |
| Los Angeles | 1 | 1.0 |
| Los Angeles | 1 | 1.0 |
| Chicago | 0 | 0.0 |
Why Use Target Encoding?
- Handles High Cardinality: Works well with many unique categories.
- Reduces Dimensionality: Unlike one-hot encoding, it replaces a categorical feature with a single numerical column.
- Captures Information About Target: Directly relates to the target variable.
Challenges & Considerations
- Data Leakage: If computed on the entire dataset before splitting, it can lead to overfitting. Use K-fold mean encoding or smoothing techniques.
- Bias with Small Sample Sizes: If a category has few instances, its mean target value might be unreliable.
- Not Suitable for Unsupervised Learning: Since it relies on the target variable.
Alternative Encoding Methods
- One-Hot Encoding: Converts each category into a binary column.
- Label Encoding: Assigns an arbitrary number to each category.
- Frequency Encoding: Replaces categories with their occurrence count.
No comments:
Post a Comment