MLOps - Day 9

Today's learning :-

Started with a quick revision of things studied in last few days.
Dimensionality Reduction - Feature Elimination
Feature Selection(FS) - Filter(Correlation), Embedded(Lasso/L1 Regression), Wrapper(OLS - Ordinary Least Square)
Filter method is faster but less accurate.
Embedded method(Coefficient) is accurate but a slower approach.
Wrapper method help us to achieve both accuracy, speed and competitively high performance. That is only the reason we use wrapper method a lot in feature selection. In Deep Learning(DL) and Neural Network(NN) also we use wrapper method in background.
Feature Extraction(FE) - For this we use PCA(Principle Component Analysis). Why we do feature extraction? - One of the reason of it is performance, by doing this we can increase performance.

Feature selection using Wrapper method :-

If any of the variable you want to use as feature(X) and if it is a categorical variable then first of all you have convert(encoding) into dummy variable. For this we can use One-Hot.

Code :-
import pandas as pd
dataset = pd.read_csv('50_Startups.csv')
dataset.head()
dataset.columns
y = dataset['Profit']
X = dataset[['R&D Spend', 'Administration', 'Marketing Spend', 'State']]
y.head()
X.head()
#Encoded 'State' feature into dummy variable
X = pd.get_dummies(X, drop_first=True

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20 , random_state=42)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred

Output:-

array([126362.87908255,  84608.45383634,  99677.49425147,  46357.46068582,
       128750.48288504,  50912.4174188 , 109741.35032702, 100643.24281647,
        97599.27574594, 113097.42524432])

y_test.head()

Output:-

13    134307.35
39     81005.76
30     99937.59
45     64926.08
17    125370.37

model.coef_

array([ 8.05630064e-01, -6.87878823e-02,  2.98554429e-02,  9.38793006e+02,
        6.98775997e+00])

In this example Coefficient help to know, where we should optimize/spent/reduce investment so that we can get highest profit.
Now we will do all the below steps to improve performance using dimensionality reduction using Wrapper feature selection method.
So let's find out which feature is not important or has less importance and then remove it.
For this we have genenral standard that if the value of P>|t| is greater than 0.05(Significant Level) we can remove that feature.

Code:-
import statsmodels.api as sm
model_ols = sm.OLS(endog=y, exog=X).fit()
model_ols.summary()

But here we can see that there are only 5 feature that means we don't have Bias(b, y = b + c1x1 + c2x2 + c3x3 ), that is not possible. If we don't have bias then we can't predict accurately.
OLS doesn't take bias automatically hence we have to add a constant.

Code:-
import numpy as np
ones = np.ones( (50, 1))
np.append(arr=ones, values=X, axis=1)
model_ols = sm.OLS(endog=y, exog=X_new).fit()
model_ols.summary()

Now in this time summery we can see that last feature x5 the value of P>|t| is greater than 0.05 (SL) hence we can remove it.

Code:-
X_new = X_new[:, 0:5]
model_ols = sm.OLS(endog=y, exog=X_new).fit()
model_ols.summary()

Same we have follow for all the features which have P>|t| is greater than 0.05
When eliminate(remove) any feature your Adj. R-Squared value should increase, but if it decrease that means you should not remove that feature.

Rakesh Kumar

Search this blog

MLOps - Day 9

Labels

Comments

Post a Comment

Popular posts from this blog

error: db5 error(11) from dbenv->open: Resource temporarily unavailable

Failed to get D-Bus connection: Operation not permitted

call to function "map" failed: the "map" function was deprecated in Terrafrom