Feature Engineering
Feature Engineering is one of the most important topic for Machine learning enthusiats
Goals of Feature Engineering
- Improve efficiency and performance of machine learning models
- Clean and prepare dataset
Major Feature Engineering Techniques
1. Imputation of Missing Data
- Numerical Imputation :
- Replace missing values with mean or median of rest of columns . Not very accurate and cannot be used in categorical features.
- SMOTE (Synthetic Minority over sampling technique) : Use of KNN to artificaially generate new samples of minority class using nearest neighbor.
- Categorical Imputation : Use Deep learning for categorical data or Hamming list.
- Random sample imputation
- Best way is to “Get more Training Data”
2. Handling Outliers
Lets understand variance and standard deviation
- Variance : How Spread out data is ?
- Standard Deviation : Identify Outliers
Example:
Given Set = {1,4,5,4,8}
Lets find out Standard deviation
- Mean = (1+4+5+4+8)/5 = 4.4
- Find difference with mean [-3.4,-0.4,0.6,-0.4,-3.6]
- Find Square difference [11.56,0.16,0.36,0.16,12.96]
- Find average of Square difference (11.56+0.16+0.36+0.16+12.96)/5 = 5.04
Standard deviation = sqrt(variance) = sqrt(5.04) = 2.24
If we take 4.4 +- 2.4 which is (2,6.8)
Clearly from the given set {1,4,5,4,8} - 1 and 8 are outliers
Random Cut Forrest can be utilized to detect Outliers
3. Binning
- Applies to both numerical and categorical data
- Bins help to prevent overfitting of model
- Categorizes and regularizes data in proper way
- Costly operation
Apply some function to a feature to make better suited for training.
- Handles skewed data
- Decreases the effect of outliers due to normalization
For example : Youtube recommendation represent as say “x” - we calculate sqrt(x) and x^2.
In that way, we learn on super and sub linear function.
5. Encoding
- More specific “One-Hot Encoding”
- Create buckets for categories (1 or 0)
- Binary values express the relationship between grouped and encoded column.
- Common in Deep learning where categories are represented by individual opposite neurons
6. Scaling / Normalization
- Most neural nets prefer features data to be normally distributed around 0
- For example : Modelling age and income is wrong choice. As income value will be much higher than age.
In this case , standardization is required.
- In sckit-learn - we generally use “MinMaxScaler”
7. Grouping Operations
- Categorical grouping : Using a pivot table or grouping based on aggregate functions using lambda.
- Numerical grouping : Numerical columns are grouped using sum and mean functions in most of the cases.
Median can be used instead of mean - for better accuracy