Binary Classification :-
- If you want to predict something and output of it is to be happen or not(0/1) this kind of problem solved under Binary classification. For this we use an algorithms/models is Sigmoid.
- To solve binary classification problems we use sklearn, sklearn call logistic regression and logistic regression internally use Sigmoid function.
- Hypothesis - Creating a model is also known a hypothesis. Today I am going to analysis 'Titanic' passenger data set, and try to create a model and try predict something so that what we can do in future to avoid such casualties.
- Any data which has category is categorical data, doesn't matter if it contains integer or string.
import pandas as pd
dataset = pd.read_csv('train.csv')
dataset.head()
dataset.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float64 6 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float64 10 Cabin 204 non-null object 11 Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB
dataset.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype='object')
import seaborn as sns
sns.set()
sns.countplot(dataset['Survived'])
sns.countplot(dataset['Survived'], hue='Sex', data=dataset)
sns.countplot(dataset['Survived'], hue='Pclass', data=dataset)
sns.heatmap(dataset.isnull(), cbar=False, yticklabels=False, cmap='viridis')
age = dataset['Age']
sns.distplot(age.dropna())
sns.countplot(dataset['SibSp'], data=dataset, hue='Survived')
- If have have null values in a column and that column you want to use as a feature, because it has very much weightage then we have to feature engineering on it and this type of feature engieering is known as Imputation.
- Imputation is the process of replacing values into substitute values.
- We can find out mean using boxplot like below
sns.boxplot(data=dataset, y='Age',x='Pclass')
def n_age(cols):
age = cols[0]
Pclass = cols[1]
if pd.isnull(age):
if Pclass == 1:
return 38
elif Pclass == 2:
return 30
elif Pclass == 3:
return 25
else:
return 30
else:
return age
dataset['Age'] = dataset[['Age', 'Pclass']].apply(n_age,axis=1)
dataset['Age']
sns.heatmap(dataset.isnull(), cbar=False, yticklabels=False, cmap='viridis')
dataset.drop('Cabin', axis=1, inplace=True)
sns.heatmap(dataset.isnull(), cbar=False, yticklabels=False, cmap='viridis')
- We have removed all the Null values, this process is known as data cleaning.
- Please check next post for further practical of model creation......
Comments
Post a Comment
Please share your experience.....