MLOps - Day 8

Today's learning :-

If a column/field help you to predict target then only it can be feature, else it won't be a feature or predictor.

In this example we can see that marks doesn't depend on Name and college ID. That means these can't be my feature.
Using correlation method we check if College_id is a feature for marks or not.
Feature selection using correlation is known as feature selection approach using filter.
But sometimes it doesn't give proper result hence we have another approach then we have to go for Embedded technique of feature selection.
Coefficient is one of way of feature selection using Embedded technique.
Feature selection using coefficient is a slow technique because first of all we have to create a model, then train this model, then find model coefficient. Which a long process (Create model > Train model > Find model coefficient = coefficient feature selection technique).
If you want to do feature selection using Embedded technique(because it is more accurate then Filter technique) but don't want to use use coefficient technique because it is slow process then we have one more technique "Lasso/ L1 Regularization"
Lasso is a faster and accurate technique of feature selection.

Code:-
from sklearn.linear_model import Lasso

from sklearn.feature_selection import SelectFromModel

sel = SelectFromModel(Lasso())

sel.fit(X, y)

sel.get_support()

If we do experiments(hit & try) on data and try to find out a formula/output, this kind of research known as Data Science and the person who do this process known as data scientist.
If we change string to a number(One -> 1), this process is known as encoding/transformation and this is one of example of Feature Engineering.
If you want your model support your data you have to do Feature Engineering.
OneHot encoding technique can be used to transform your categorical Data data into a new variable.

Code :- To predict startup company profit.

import pandas as pd

dataset = pd.read_csv('50_Startups.csv')

dataset.columns

X = dataset[['R&D Spend', 'Administration', 'Marketing Spend', 'State']]

y = dataset['Profit']

state = X['State']

state_dummy = pd.get_dummies(state)

X_new = X.iloc[:, 0:3]

f_state = state_dummy.iloc[:, :2]

X_new[['California', 'Florida']] = f_state

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=0.20, random_state=42)

from sklearn.linear_model import LinearRegression

model = LinearRegression()

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

Rakesh Kumar