Feature Engineering

Feature Engineering is one of the most important topic for Machine learning enthusiats

Goals of Feature Engineering

  • Improve efficiency and performance of machine learning models
  • Clean and prepare dataset

Major Feature Engineering Techniques

1. Imputation of Missing Data

  • Numerical Imputation :
    • Replace missing values with mean or median of rest of columns . Not very accurate and cannot be used in categorical features.
    • SMOTE (Synthetic Minority over sampling technique) : Use of KNN to artificaially generate new samples of minority class using nearest neighbor.
  • Categorical Imputation : Use Deep learning for categorical data or Hamming list.
  • Random sample imputation
  • Best way is to “Get more Training Data”

2. Handling Outliers

Lets understand variance and standard deviation

  • Variance : How Spread out data is ?
  • Standard Deviation : Identify Outliers

Example:

Given Set = {1,4,5,4,8}

Lets find out Standard deviation

  • Mean = (1+4+5+4+8)/5 = 4.4
  • Find difference with mean [-3.4,-0.4,0.6,-0.4,-3.6]
  • Find Square difference [11.56,0.16,0.36,0.16,12.96]
  • Find average of Square difference (11.56+0.16+0.36+0.16+12.96)/5 = 5.04

Standard deviation = sqrt(variance) = sqrt(5.04) = 2.24

If we take 4.4 +- 2.4 which is (2,6.8)

Clearly from the given set {1,4,5,4,8} - 1 and 8 are outliers

Random Cut Forrest can be utilized to detect Outliers

3. Binning

  • Applies to both numerical and categorical data
  • Bins help to prevent overfitting of model
  • Categorizes and regularizes data in proper way
  • Costly operation

4. Transformation

Apply some function to a feature to make better suited for training.

  • Handles skewed data
  • Decreases the effect of outliers due to normalization

For example : Youtube recommendation represent as say “x” - we calculate sqrt(x) and x^2. In that way, we learn on super and sub linear function.

5. Encoding

  • More specific “One-Hot Encoding”
  • Create buckets for categories (1 or 0)
  • Binary values express the relationship between grouped and encoded column.
  • Common in Deep learning where categories are represented by individual opposite neurons

6. Scaling / Normalization

  • Most neural nets prefer features data to be normally distributed around 0
  • For example : Modelling age and income is wrong choice. As income value will be much higher than age. In this case , standardization is required.
  • In sckit-learn - we generally use “MinMaxScaler”

7. Grouping Operations

  • Categorical grouping : Using a pivot table or grouping based on aggregate functions using lambda.
  • Numerical grouping : Numerical columns are grouped using sum and mean functions in most of the cases. Median can be used instead of mean - for better accuracy