Tuesday, 18 March 2025

Target Encoding in Data Science

Target Encoding in Data Science

Target encoding is a technique used to encode categorical variables by replacing each category with the mean (or another statistic) of the target variable for that category. It is useful when categorical features have high cardinality.

How Target Encoding Works

  1. Group Data by Category: For a given categorical feature, group the data based on unique categories.
  2. Calculate the Target Mean: Compute the mean of the target variable for each category.
  3. Replace Categories with Their Mean: Assign the calculated mean to all occurrences of that category.

Example

Imagine a dataset with a categorical feature City and a binary target variable Purchase (0 = No, 1 = Yes).

Sample Data

CityPurchase
New York1
New York0
Los Angeles1
Los Angeles1
Chicago0

Step 1: Compute Mean Purchase for Each City

  • New York: (1 + 0) / 2 = 0.5
  • Los Angeles: (1 + 1) / 2 = 1.0
  • Chicago: (0) / 1 = 0.0

Step 2: Replace Cities with Target Mean

CityPurchaseTarget Encoded City
New York10.5
New York00.5
Los Angeles11.0
Los Angeles11.0
Chicago00.0

Why Use Target Encoding?

  • Handles High Cardinality: Works well with many unique categories.
  • Reduces Dimensionality: Unlike one-hot encoding, it replaces a categorical feature with a single numerical column.
  • Captures Information About Target: Directly relates to the target variable.

Challenges & Considerations

  • Data Leakage: If computed on the entire dataset before splitting, it can lead to overfitting. Use K-fold mean encoding or smoothing techniques.
  • Bias with Small Sample Sizes: If a category has few instances, its mean target value might be unreliable.
  • Not Suitable for Unsupervised Learning: Since it relies on the target variable.

Alternative Encoding Methods

  • One-Hot Encoding: Converts each category into a binary column.
  • Label Encoding: Assigns an arbitrary number to each category.
  • Frequency Encoding: Replaces categories with their occurrence count.

No comments:

Post a Comment