One of the key factors that data analysts focuses to improve the accuracy of a machine learning output is to choose the correct predictors that is feed into the system for prediction.

Inclusion of non impacting features will unnecessarily increases the complexity of the model and in general most of the algorithms suffer from curse of dimensionality and starts underperforming towards reduced accuracy. It becomes crucial to include the appropriate predictors and as well eliminate the non-impacting features for better results.

Feature Engineering is an area in machine learning that focus on enabling the correct features for the model being developed. Below are the some of the challenges that comes to the mind of a data Analyst and statistics has way of answering to those questions.

- How to make sure that a feature (variable) chosen creates an impact on the prediction outcome?
- Will the accuracy of my model increases with eliminating the non-impacting features?
- How do I understand and bring the correlation between the features in the model to improve accuracy?

Let’s try to analyze these challenges and some options to address them.. I have used R-language to show case the solutions..

- How do I make sure that a feature (variable) chosen creates impacts the prediction outcome?

Let us take a dataset Advertising Sales data (from reference ISLR book) from media TV, Radio, Newspaper and see if these features impacts the sales.

We are trying to load the data here and trying to fit a linear prediction for sales value given the expense spent on TV, Radio, Newspaper marketing.

*sales = read.csv(file=’Advertising.csv’,header = TRUE)*

* fix(sales)*

*attach(sales)*

* lm.fit1 = (lm(Sales~TV+Radio+Newspaper,data=sales))*

P value tests the null hypothesis that the coefficient is equal to zero . A low p-value ( <0.05 ) indicates that the feature makes meaningful addition to the model where a high value of p-value shows that the feature has less impact on the result. R language has the summary command, feeding model to the command will show lot of details including p-value of each feature as below…

*summary(lm.fit1)*

##

## Call:

## lm(formula = Sales ~ TV + Radio + Newspaper, data = sales)

##

## Residuals:

## Min 1Q Median 3Q Max

## -8.8277 -0.8908 0.2418 1.1893 2.8292

##

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) 2.938889 0.311908 9.422 <2e-16 ***

## TV 0.045765 0.001395 32.809 <2e-16 ***

## Radio 0.188530 0.008611 21.893 <2e-16 ***

## Newspaper -0.001037 0.005871 -0.177 0.86

## —

## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

##

## Residual standard error: 1.686 on 196 degrees of freedom

## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956

## F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16

The above command shows features TV, Radio has p-value less than 2e-16 where feature Newspaper has p-value of 0.86 which is very high. We can assume that newspaper feature creates the least impact to the prediction result.

- How to make sure that my model accuracy increases with eliminating the non-impacting features?

- We have found just now newspaper feature is not creating impacting to the prediction result. let’s exclude newspaper and create the model , lets validate the stats of the new model without newspaper feature.Adjusted R-square value denotes the accuracy of fitment of training data to the model. We can see the R-square value improved from 0.8956 to 0.8962 with the removal of newspaper in the model fitment.

*lm.fit2 = lm(Sales~TV+Radio,data=sales)*

* summary(lm.fit2)*

##

## Call:

## lm(formula = Sales ~ TV + Radio, data = sales)

##

## Residuals:

## Min 1Q Median 3Q Max

## -8.7977 -0.8752 0.2422 1.1708 2.8328

##

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) 2.92110 0.29449 9.919 <2e-16 ***

## TV 0.04575 0.00139 32.909 <2e-16 ***

## Radio 0.18799 0.00804 23.382 <2e-16 ***

## —

## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

##

## Residual standard error: 1.681 on 197 degrees of freedom

## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8962

## F-statistic: 859.6 on 2 and 197 DF, p-value: < 2.2e-16

Having too many non impacting features will make the model cluttered and more complex and impact the fitment of the model. It is always recommended to eliminate the non-impacting features.

- How features that are highly correlated impacts the prediction ?

- Features that are highly correlated makes duplicate impact and highly influence the result. It is recommended to identify features that are highly correlated and eliminate the impact.A simple plot can help to visualize the correlation between features as below..

*pairs(~TV+Radio+Newspaper,sales)*

Variance Inflation Factor ( VIF ) defines a measure to statistically measure correlation score among the feature columns. Smallest value of VIF ( close to 1 ) denotes complete absence of collinearity. A high value of VIF exceeding 5 or 10 denotes the presence of collinearity.

* *

* fmsb::VIF(lm(TV~Newspaper,data=sales))*

## [1] 1.003219

Based on the value of VIF , We can conclude features TV, newspaper are not correlated.

Generally it will be easy to analyze each column for feature selection when we have few features in the dataset but when the dataset is of high dimensional with 100s of features it will be difficult to do the analysis on n factorial combinations of collinearity.

Subset selection is a field in machine learning which defines best practices to for feature selection for a high dimensional dataset. Some approaches followed towards eliminating correlated feature are Principal Component Analysis ( PCA ) , Dimensionality reduction, forward selection, backward selection etc. I will detail about the process subset selection for high dimensional dataset detailed in a separate post.

Reference : The Elements of Statistical Learning