Skip to main content

MLOps - Day 9

Today's learning :-
    • Started with a quick revision of things studied in last few days.
    • Dimensionality Reduction - Feature Elimination
    • Feature Selection(FS) - Filter(Correlation), Embedded(Lasso/L1 Regression), Wrapper(OLS - Ordinary Least Square)
    • Filter method is faster but less accurate.
    • Embedded method(Coefficient) is accurate but a slower approach.
    • Wrapper method help us to achieve both accuracy, speed and competitively high performance. That is only the reason we use wrapper method a lot in feature selection. In Deep Learning(DL) and Neural Network(NN) also we use wrapper method in background.
    • Feature Extraction(FE) - For this we use PCA(Principle Component Analysis). Why we do feature extraction? - One of the reason of it is performance, by doing this we can increase performance.
    Feature selection using Wrapper method :-
    • If any of the variable you want to use as feature(X) and if it is a categorical variable then first of all you have convert(encoding) into dummy variable. For this we can use One-Hot.
    Code :-
    import pandas as pd
    dataset = pd.read_csv('50_Startups.csv')
    dataset.head()

    dataset.columns
    y = dataset['Profit']
    X = dataset[['R&D Spend', 'Administration', 'Marketing Spend', 'State']]

    y.head()
    X.head()
    #Encoded 'State' feature into dummy variable
    X = pd.get_dummies(X, drop_first=True

    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20 , random_state=42)
    from sklearn.linear_model import LinearRegression
    model = LinearRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_pred
    Output:- 
    array([126362.87908255,  84608.45383634,  99677.49425147,  46357.46068582,
           128750.48288504,  50912.4174188 , 109741.35032702, 100643.24281647,
            97599.27574594, 113097.42524432])
     
    y_test.head()
    Output:-
    13    134307.35
    39     81005.76
    30     99937.59
    45     64926.08
    17    125370.37
     
    model.coef_
    array([ 8.05630064e-01, -6.87878823e-02,  2.98554429e-02,  9.38793006e+02,
            6.98775997e+00])
    • In this example Coefficient help to know, where we should optimize/spent/reduce investment so that we can get highest profit.
    • Now we will do all the below steps to improve performance using dimensionality reduction using Wrapper feature selection method.
    • So let's find out which feature is not important or has less importance and then remove it.
    • For this we have genenral standard that if the value of P>|t| is greater than 0.05(Significant Level) we can remove that feature.
    Code:-
    import statsmodels.api as sm
    model_ols = sm.OLS(endog=y, exog=X).fit()
    model_ols.summary()
    • But here we can see that there are only 5 feature that means we don't have Bias(b, y = b + c1x1 + c2x2 + c3x3 ), that is not possible. If we don't have bias then we can't predict accurately.
    • OLS doesn't take bias automatically hence we have to add a constant.
    Code:-
    import numpy as np 
    ones = np.ones( (50, 1))
    np.append(arr=ones, values=X, axis=1) 
    model_ols = sm.OLS(endog=y, exog=X_new).fit()
    model_ols.summary()
    • Now in this time summery we can see that last feature x5 the value of P>|t| is greater than 0.05 (SL) hence we can remove it.
    Code:-
    X_new = X_new[:, 0:5]
    model_ols = sm.OLS(endog=y, exog=X_new).fit()
    model_ols.summary()

    • Same we have follow for all the features which have P>|t| is greater than 0.05
    • When eliminate(remove) any feature your Adj. R-Squared value should increase, but if it decrease that means you should not remove that feature.

    Comments