Today's learning :-
import pandas as pd
dataset = pd.read_csv('50_Startups.csv')
dataset.head()
dataset.columns
y = dataset['Profit']
X = dataset[['R&D Spend', 'Administration', 'Marketing Spend', 'State']]
y.head()
X.head()
#Encoded 'State' feature into dummy variable
X = pd.get_dummies(X, drop_first=True
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20 , random_state=42)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred
import statsmodels.api as sm
model_ols = sm.OLS(endog=y, exog=X).fit()
model_ols.summary()
import numpy as np
ones = np.ones( (50, 1))
np.append(arr=ones, values=X, axis=1)
model_ols = sm.OLS(endog=y, exog=X_new).fit()
model_ols.summary()
X_new = X_new[:, 0:5]
model_ols = sm.OLS(endog=y, exog=X_new).fit()
model_ols.summary()
- Started with a quick revision of things studied in last few days.
- Dimensionality Reduction - Feature Elimination
- Feature Selection(FS) - Filter(Correlation), Embedded(Lasso/L1 Regression), Wrapper(OLS - Ordinary Least Square)
- Filter method is faster but less accurate.
- Embedded method(Coefficient) is accurate but a slower approach.
- Wrapper method help us to achieve both accuracy, speed and competitively high performance. That is only the reason we use wrapper method a lot in feature selection. In Deep Learning(DL) and Neural Network(NN) also we use wrapper method in background.
- Feature Extraction(FE) - For this we use PCA(Principle Component Analysis). Why we do feature extraction? - One of the reason of it is performance, by doing this we can increase performance.
- If any of the variable you want to use as feature(X) and if it is a categorical variable then first of all you have convert(encoding) into dummy variable. For this we can use One-Hot.
import pandas as pd
dataset = pd.read_csv('50_Startups.csv')
dataset.head()
dataset.columns
y = dataset['Profit']
X = dataset[['R&D Spend', 'Administration', 'Marketing Spend', 'State']]
y.head()
X.head()
#Encoded 'State' feature into dummy variable
X = pd.get_dummies(X, drop_first=True
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20 , random_state=42)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred
Output:-
array([126362.87908255, 84608.45383634, 99677.49425147, 46357.46068582, 128750.48288504, 50912.4174188 , 109741.35032702, 100643.24281647, 97599.27574594, 113097.42524432])
y_test.head()
Output:-
13 134307.35 39 81005.76 30 99937.59 45 64926.08 17 125370.37
model.coef_
array([ 8.05630064e-01, -6.87878823e-02, 2.98554429e-02, 9.38793006e+02, 6.98775997e+00])
- In this example Coefficient help to know, where we should optimize/spent/reduce investment so that we can get highest profit.
- Now we will do all the below steps to improve performance using dimensionality reduction using Wrapper feature selection method.
- So let's find out which feature is not important or has less importance and then remove it.
- For this we have genenral standard that if the value of P>|t| is greater than 0.05(Significant Level) we can remove that feature.
import statsmodels.api as sm
model_ols = sm.OLS(endog=y, exog=X).fit()
model_ols.summary()
- But here we can see that there are only 5 feature that means we don't have Bias(b, y = b + c1x1 + c2x2 + c3x3 ), that is not possible. If we don't have bias then we can't predict accurately.
- OLS doesn't take bias automatically hence we have to add a constant.
import numpy as np
ones = np.ones( (50, 1))
np.append(arr=ones, values=X, axis=1)
model_ols = sm.OLS(endog=y, exog=X_new).fit()
model_ols.summary()
- Now in this time summery we can see that last feature x5 the value of P>|t| is greater than 0.05 (SL) hence we can remove it.
X_new = X_new[:, 0:5]
model_ols = sm.OLS(endog=y, exog=X_new).fit()
model_ols.summary()
- Same we have follow for all the features which have P>|t| is greater than 0.05
- When eliminate(remove) any feature your Adj. R-Squared value should increase, but if it decrease that means you should not remove that feature.
Comments
Post a Comment
Please share your experience.....