Feature Engineering

Feature Engineering is one of the most important topic for Machine learning enthusiats

Numerical Imputation :
- Replace missing values with mean or median of rest of columns . Not very accurate and cannot be used in categorical features.
- SMOTE (Synthetic Minority over sampling technique) : Use of KNN to artificaially generate new samples of minority class using nearest neighbor.
Categorical Imputation : Use Deep learning for categorical data or Hamming list.
Random sample imputation
Best way is to “Get more Training Data”

Lets understand variance and standard deviation

Example:

Given Set = {1,4,5,4,8}

Lets find out Standard deviation

Standard deviation = sqrt(variance) = sqrt(5.04) = 2.24

If we take 4.4 +- 2.4 which is (2,6.8)

Clearly from the given set {1,4,5,4,8} - 1 and 8 are outliers

Random Cut Forrest can be utilized to detect Outliers

Apply some function to a feature to make better suited for training.

For example : Youtube recommendation represent as say “x” - we calculate sqrt(x) and x^2. In that way, we learn on super and sub linear function.

More specific “One-Hot Encoding”
Create buckets for categories (1 or 0)
Binary values express the relationship between grouped and encoded column.
Common in Deep learning where categories are represented by individual opposite neurons

Most neural nets prefer features data to be normally distributed around 0
For example : Modelling age and income is wrong choice. As income value will be much higher than age. In this case , standardization is required.
In sckit-learn - we generally use “MinMaxScaler”

Categorical grouping : Using a pivot table or grouping based on aggregate functions using lambda.
Numerical grouping : Numerical columns are grouped using sum and mean functions in most of the cases. Median can be used instead of mean - for better accuracy