Data Science and Machine Learning : Root Mean Square Error (RMSE)

How to calculate Root-Mean-Square Error?

This post will cover most common ways to evaluate the regression model. The idea of regression model is to predict real or number or continuous values, not categorical or ordinal values. There are several ways to measure error rate of a regression model but now we will just consider RMSE measure. Let's look major points covered by the post.

What is residual?
Root Mean Square Error (RMSE)
How to calculate RMSE?

We will take a dataset which contains x and y values where, x is input value and y is output. Let's take input and output values as 1, 2, 2, 3 and 1, 2, 3, 4 resp. Figure 1 shows regression model ŷ = 1.5 x - 0.5. Value 1.5 is slope of the regression line and 0.5 is an intercept. In this case, it is linear model.

Figure 1

Let's move on Residual.

It is difference between predicted and actual values and mathematically, it can be represented as Residual = yi - ŷi for ith data point. Our goal is always to minimize such error. It can be negative or positive. Let's see RMSE.

Root Mean Square Error (RMSE) can be considered to be a standard deviation of residual and it is commonly used measure. It can only be compared between models whose errors are measured in the same units. Let's define mathematical equation of RMSE.

Equation tells use, take the error of each one of the data points, square it, add them together, take the average and take the square root of that. Let's move on how to calculate this measure. Moreover, you can easily determine it using functions available in R or Python.

How to compute RMSE? First, we will compute residual for each of the data points. Following regression plots will give you better understanding on how to calculate residual for given four data points.

For first data point, x=1 and y=1.

The below plot shows, ŷ is calculated by placing x = 1 into equation ŷ = 1.5 x - 0.5 and it turned out to be 1. Here, actual and predicted value is same thus, residual = yi - ŷi = (1-1) = 0

Figure 2

For second data point, x=2 and y=2.

The below plot shows, ŷ is calculated by placing x = 2 into equation ŷ = 1.5 x - 0.5 and got predicted output 1 therefore, residual = yi - ŷi = (2 - 2.5) = -0.5. As you can see residual is negative here. When input is below regression line then you will have negative residual.

Figure 3

For second data point, x=2 and y=3.

The below plot shows, ŷ is calculated by placing x = 2 into equation ŷ = 1.5 x - 0.5 and got predicted value 1. Hence residual = yi - ŷi = (3 - 2.5) = 0.5. Here, we have positive residual.

Figure 4

For second data point, x=3 and y=4.

The below plot shows, ŷ is calculated by placing x = 3 into equation ŷ = 1.5 x - 0.5 and predicted outcome is 4 then residual = yi - ŷi = (4 - 4) = 0 . For this data point, actual and predicted value is same that is 4.