Sunday, December 22, 2019

Exploratory Data Analysis (EDA)


Exploratory Data Analysis in Data Science

 

This post is about data plotting or visualization methods for data analysis and each technique has been explained with an example using iris dataset. You can also find some important posts in this blog like machine learning modelling process, list of regression techniquesconfusion matrix, root-mean-squared-error. If you are interested in reading research papers then kindly refer post on classifier and regression models. There you can find code of 77 regression models. In this post, you will discover about some important data visualization techniques, mainly:

  • 2-D scatter plot
  • 3-D scatter plot
  • Pair plot
  • Histogram
  • Box-plot 

Let's first talk about simple iris dataset, it has 4 features/independent variables or predictors (sepal length, sepal width, petal length, petal width), that means it is 4 dimensional array. Response/dependent variables or class labels  are virginica, setosa and versicolor. Dataset has 150 data points and it is balanced as number of data points for each class is the same i.e. 50 data points for each class. You can download dataset from here or see it using sklearn.datasets in python. For more information about iris dataset click on this link. To implement above mentioned visualization techniques in python, you must have pandas, seaborn, matplotlib and numpy libraries.

1. 2-D scatter plot:

Fig. 1 shows 2-D scatter plot of sepal_length and sepal_width and reports that blue points of 
setosa class are easily separable by green and orange data points by drawing a linear line. However, class labels versicolor and virginca are not easily separable with this 2-D feature combination (sepal_length and sepal_width). In this case, we can try for the other combinations for instance, petal_length and petal_width. 




Fig. 1 2-D scatter plot of sepal_length and sepal_width

2. 3-D scatter plot:

It plots data points into 3 dimensional space. Disadvantage of 3-D plot is that it requires many interactions with plot for interpretation so it is not more convenient technique.
Fig. 2 3-D scatter plot of petal_length, sepal_length and petal_width from iris dataset

3. Pair plot:

We can not do 4-D scatter plot instead, we use pair plot. It would be good solution in order to avoid checking lot of combinations using 2-D and many mouse interactions using 3-D scatter plot. Dataset with 4, 5, 6 or 7 dimensions, can easily interpret by pair plot however, it can not be good option if dimensions are more than that. To identify class labels, Fig. 3 presents petal_width and petal_length are two highly influential predictors where, setosa are linearly separable from class versicolor and virginica.  The diagonal elements are Probability Density Functions (PDF) of each feature.

Fig. 3 Pair plot of iris dataset

4. Histogram:

It is representation of probability distribution of data point. Better way to visualize one feature (1-D) is histogram. Lets take an example of sepal_length, shown in Fig. 3. The x-axis is sepal_length where y-axis is number of counts of sepal_length. Light blue, orange and green are the histograms of sepal_length of setosa, versicolor and virginica flower types, respectively (see fig. 4). Histogram tells us how many data points are there in the window of 4 to 6. It shows maximum setosa flowers (around 15) are exist when sepal_length size is 5. Height of histogram shows how often we find particular flower type given sepal_length. Smooth line is called PDF and it is smoothed form of histogram.   

Fig. 4 Histogram of sepal_length

5. Box-plot:

It is another technique of visualizing the 1-D scatter plot. Box plot uses median, percentiles and quantiles and put it into plot. By looking at Fig. 4, we do not know what is 25th, 50th or 75th percentile of setosa sepal_length. To know that, we use box-plot, it uses percentiles. In the Fig. 5, x-axis is flower types or 3 boxes corresponding to each class label and y-axis is septal_length. Lets understand green color box, it tells what 25th, 50th and 75th percentile value of sepal length for virginica. Whiskers are generally minimum and maximum value of feature for each class however, there is no standard way to draw it. Besides, box-plot helps us in writing a rules and finding mis-classifications or errors.




Fig. 5 Box-plot of sepal_length



Kindly follow my blog and stay tuned for more advanced posts on ML. 

Thank you!

Saturday, November 23, 2019

Machine Learning Overview


Machine Learning Overview



For easy understanding of ML overview, this post shows the cheat sheet of types of ML with some algorithms as well as examples.



Kindly follow my blog and stay tuned for more advanced posts on ML. 

Thank you!


Reference:

Fernandez-Delgado, M., Sirsat, M., Cernadas, E., Alawadi, S., Barro, S., Febrero-Bande, M., 2019. An extensive experimental survey of regression methods. Neural Networks 111, 11–34. URL: 10.1016/j.neunet.2018.12.010

Saturday, May 11, 2019

Root Mean Square Error (RMSE)


How to calculate Root-Mean-Square Error?



This post will cover most common ways to evaluate the regression model. The idea of regression model is to predict real or number or continuous values, not categorical or ordinal values. There are several ways to measure error rate of a regression model but now we will just consider RMSE measure. Let's look major points covered by the post.

  • What is residual?
  • Root Mean Square Error (RMSE)
  • How to calculate RMSE?

We will take a dataset which contains x and y values where, x is input value and y is output. Let's take input and output values as 1, 2, 2, 3 and 1, 2, 3, 4 resp. Figure 1 shows regression model ŷ = 1.5 x - 0.5. Value 1.5 is slope of the regression line and 0.5 is an intercept. In this case, it is linear model.


Figure 1


Let's move on Residual.

It is difference between predicted and actual values and mathematically, it can be represented  as Residual = yi - ŷi for ith data point.  Our goal is always to minimize such error. It can be negative or positive. Let's see RMSE.

Root Mean Square Error (RMSE) can be considered to be a standard deviation of residual and it is commonly used measure. It can only be compared between models whose errors are measured in the same units. Let's define mathematical equation of RMSE.


Equation tells use, take the error of each one of the data points, square it, add them together, take the average and take the square root of that. Let's move on how to calculate this measure. Moreover, you can easily determine it using functions available in R or Python.

How to compute RMSE? First, we will compute residual for each of the data points. Following regression plots will give you better understanding on how to calculate residual for given four data points.
For first data point, x=1 and y=1.
The below plot shows, ŷ is calculated by placing x = 1 into equation ŷ = 1.5 x - 0.5 and it turned out to be 1. Here, actual and predicted value is same thus, residual = yi - ŷi = (1-1) = 0

Figure 2

For second data point, x=2 and y=2.
The below plot shows, ŷ is calculated by placing x = 2 into equation ŷ = 1.5 x - 0.5 and got predicted output 1 therefore, residual = yi - ŷi = (2 - 2.5) = -0.5. As you can see residual is negative here. When input is below regression line then you will have negative residual.  


Figure 3

For second data point, x=2 and y=3.
The below plot shows, ŷ is calculated by placing x = 2 into equation ŷ = 1.5 x - 0.5 and got predicted value 1. Hence residual = yi - ŷi = (3 - 2.5) = 0.5. Here, we have positive residual.


Figure 4


For second data point, x=3 and y=4.
The below plot shows, ŷ is calculated by placing x = 3 into equation ŷ = 1.5 x - 0.5 and predicted outcome is 4 then residual = yi - ŷi = (4 - 4) = 0 . For this data point, actual and predicted value is same that is 4.


Figure 5

Up to this we have taken the error of each one of the data points. Now, we will square it and add them together, and then will take the average.



Let's take square root of 0.125


This way we can know how much our machine learning model disagrees with the actual data.


Kindly follow my blog and stay tuned for more advanced post on list of regression measures.

Thank you!.


Monday, May 6, 2019

Research Paper on Machine Learning

 

  Research Papers on Classifiers and Regression Models


In this article, I am going to write on two most important research papers which are related to comparison of list of classification and regression techniques. If you are interested in learning several supervised techniques then you must refer following two papers.

 

These research findings are very useful for machine learning fans. Paper 1st is on comparison of classification techniques and 2nd is about comparison of large collection of popular regression techniques. I also given URLs of these papers for download. Besides, you can easily access our 77 regression models.

 

  • Paper 1: Do we need hundreds of classifiers to solve real world classification problems?

This research paper focuses on 179 classifiers over 121 datasets from 17 machine learning families.


Figure 1 - Machine Learning Classification Families

 

Classification techniques are implemented in R, Weka, Matlab and C. According to the study, random forest classifier is the most likely to be the best classifier. Download this paper from link https://bit.ly/1yAuJa9 


  • Paper 2: An extensive experimental survey of regression methods.

Paper second is on machine learning regression techniques, published in neural network. It explains and compares 77 the most important models which belong to 19 machine learning families. Techniques are evaluated on 83 UCI regression datasets. Most of the techniques are implemented in R. I also mentioned list of Regression Techniques with their R package and references in my earlier article. Figure 2 shows 19 regression families.


Figure 2 - Machine Learning Regression Families


You can download above paper from link https://bit.ly/2J2OmTV 

Our code of 77 regression models is now available. Download it from https://bit.ly/2Y9LyI5 

And downloaded code you can try for your regression problem. 
  
Kindly follow my blog and stay tuned for more advanced post on dataset splitting. 

Thank you!


Monday, April 29, 2019

Confusion Matrix


What is Confusion Matrix and Advanced Classification Metrics?

 

After data preparation and model training, there is model evaluation phase which I mentioned in my earlier article Simple Picture of Machine Learning Modelling Process

Once model is developed, the next phase is to calculate the performance of the developed model using some evaluation metrics. In this article, you will just discover about confusion matrix though there are many classification metrics out there.


Mainly, it focuses on below points:

  • What is confusion matrix? 
  • Four outputs in confusion matrix
  • Advanced classification metrics 


    Table 1. Confusion matrix with advanced classification metrics


Confusion Matrix is a tool to determine the performance of classifier. It contains information about actual and predicted classifications. The below table shows confusion matrix of two-class, spam and non-spam classifier.
 
 Table 2. Confusion matrix of email classification 

Let’s understand four outputs in confusion matrix.


1. True Positive (TP) is the number of correct predictions that an example is positive which means positive class correctly identified as positive.
Example: Given class is spam and the classifier has been correctly predicted it as spam.      

2. False Negative (FN) is the number of incorrect predictions that an example is negative which means positive class incorrectly identified as negative.
Example: Given class is spam however, the classifier has been incorrectly predicted it as non-spam.  

3. False positive (FP) is the number of incorrect predictions that an example is positive which means negative class incorrectly identified as positive.
Example: Given class is non-spam however, the classifier has been incorrectly predicted it as spam. 

4. True Negative (TN) is the number of correct predictions that an example is negative which means negative class correctly identified as negative.
Example: Given class is spam and the classifier has been correctly predicted it as negative. 

Now, let’s see some advanced classification metrics based on confusion matrix. These metrics are mathematically expressed in Table 1 with example of email classification, shown in Table 2. Classification problem has spam and non-spam classes and dataset contains 100 examples, 65 are Spams and 35 are non-spams.

Sensitivity is also referred as True Positive Rate or Recall. It is measure of positive examples labeled as positive by classifier. It should be higher. For instance, proportion of emails which are spam among all spam emails.  

Table 3. Sensitivity in confusion matrix

Sensitivity = 45/(45+20) = 69.23% .  

The 69.23% spam emails are correctly classified and excluded from all non-spam emails. 

Specificity is also know as True Negative Rate. It is measure of negative examples labeled as negative by classifier. There should be high specificity. For instance, proportion of emails which are non-spam among all non-spam emails. 

Table 4. Specificity in confusion matrix

specificity = 30/(30+5) = 85.71% .

The 85.71% non-spam emails are accurately classified and excluded from all spam emails.

Precision is ratio of total number of correctly classified positive examples and the total number of predicted positive examples. It shows correctness achieved in positive prediction. 

Table 5. Precision in confusion matrix

Precision = 45/(45+5)= 90% 

The 90% of examples are classified as spam are actually spam.

Accuracy is the proportion of the total number of predictions that are correct.

 Table 6. Accuracy in confusion matrix

Accuracy = (45+30)/(45+20+5+30) = 75%

The 75% of examples are correctly classified by the classifier. 

F1 score is a weighted average of the recall (sensitivity) and precision. F1 score might be good choice when you seek to balance between Precision and Recall. 


It helps to compute recall and precision in one equation so that the problem to distinguish the models with low recall and high precision or vice versa could be solved.  

Kindly follow my blog by email and stay tuned for more advanced post on regression measures. 

Thank you!


What is Support Vector Machine?

  What is Support Vector Machine? Support Vector Machine (SVM) is a supervised machine learning algorithm that is widely used in classificat...