Data Science and Machine Learning : May 2019

Saturday, May 11, 2019

Root Mean Square Error (RMSE)

How to calculate Root-Mean-Square Error?

This post will cover most common ways to evaluate the regression model. The idea of regression model is to predict real or number or continuous values, not categorical or ordinal values. There are several ways to measure error rate of a regression model but now we will just consider RMSE measure. Let's look major points covered by the post.

What is residual?
Root Mean Square Error (RMSE)
How to calculate RMSE?

We will take a dataset which contains x and y values where, x is input value and y is output. Let's take input and output values as 1, 2, 2, 3 and 1, 2, 3, 4 resp. Figure 1 shows regression model ŷ = 1.5 x - 0.5. Value 1.5 is slope of the regression line and 0.5 is an intercept. In this case, it is linear model.

Figure 1

Let's move on Residual.

It is difference between predicted and actual values and mathematically, it can be represented as Residual = yi - ŷi for ith data point. Our goal is always to minimize such error. It can be negative or positive. Let's see RMSE.

Root Mean Square Error (RMSE) can be considered to be a standard deviation of residual and it is commonly used measure. It can only be compared between models whose errors are measured in the same units. Let's define mathematical equation of RMSE.

Equation tells use, take the error of each one of the data points, square it, add them together, take the average and take the square root of that. Let's move on how to calculate this measure. Moreover, you can easily determine it using functions available in R or Python.

How to compute RMSE? First, we will compute residual for each of the data points. Following regression plots will give you better understanding on how to calculate residual for given four data points.

For first data point, x=1 and y=1.

The below plot shows, ŷ is calculated by placing x = 1 into equation ŷ = 1.5 x - 0.5 and it turned out to be 1. Here, actual and predicted value is same thus, residual = yi - ŷi = (1-1) = 0

Figure 2

For second data point, x=2 and y=2.

The below plot shows, ŷ is calculated by placing x = 2 into equation ŷ = 1.5 x - 0.5 and got predicted output 1 therefore, residual = yi - ŷi = (2 - 2.5) = -0.5. As you can see residual is negative here. When input is below regression line then you will have negative residual.

Figure 3

For second data point, x=2 and y=3.

The below plot shows, ŷ is calculated by placing x = 2 into equation ŷ = 1.5 x - 0.5 and got predicted value 1. Hence residual = yi - ŷi = (3 - 2.5) = 0.5. Here, we have positive residual.

Figure 4

For second data point, x=3 and y=4.

The below plot shows, ŷ is calculated by placing x = 3 into equation ŷ = 1.5 x - 0.5 and predicted outcome is 4 then residual = yi - ŷi = (4 - 4) = 0 . For this data point, actual and predicted value is same that is 4.

Figure 5

Up to this we have taken the error of each one of the data points. Now, we will square it and add them together, and then will take the average.

Let's take square root of 0.125

This way we can know how much our machine learning model disagrees with the actual data.

Kindly follow my blog and stay tuned for more advanced post on list of regression measures.

Thank you!.

Monday, May 6, 2019

Research Paper on Machine Learning

Research Papers on Classifiers and Regression Models

In this article, I am going to write on two most important research papers which are related to comparison of list of classification and regression techniques. If you are interested in learning several supervised techniques then you must refer following two papers.

These research findings are very useful for machine learning fans. Paper 1st is on comparison of classification techniques and 2nd is about comparison of large collection of popular regression techniques. I also given URLs of these papers for download. Besides, you can easily access our 77 regression models.

Paper 1: Do we need hundreds of classifiers to solve real world classification problems?

This research paper focuses on 179 classifiers over 121 datasets from 17 machine learning families.

Figure 1 - Machine Learning Classification Families

Classification techniques are implemented in R, Weka, Matlab and C. According to the study, random forest classifier is the most likely to be the best classifier. Download this paper from link https://bit.ly/1yAuJa9

Paper 2: An extensive experimental survey of regression methods.

Paper second is on machine learning regression techniques, published in neural network. It explains and compares 77 the most important models which belong to 19 machine learning families. Techniques are evaluated on 83 UCI regression datasets. Most of the techniques are implemented in R. I also mentioned list of Regression Techniques with their R package and references in my earlier article. Figure 2 shows 19 regression families.

Figure 2 - Machine Learning Regression Families

You can download above paper from link https://bit.ly/2J2OmTV

Our code of 77 regression models is now available. Download it from https://bit.ly/2Y9LyI5

And downloaded code you can try for your regression problem.

Kindly follow my blog and stay tuned for more advanced post on dataset splitting.

Thank you!