Showing posts with label Machine Learning measure. Show all posts
Showing posts with label Machine Learning measure. Show all posts

Monday, March 27, 2023

What is Random Forests?

What is Random Forests? 



Random Forests is a popular machine learning algorithm used for both regression and classification tasks. It is an ensemble method that combines multiple decision trees to make more accurate predictions.


How the algorithm works:

  1. Data Preparation: Random Forests can handle both categorical and continuous data. It requires a labeled dataset with both input features and output labels.
  2. Feature Selection: Random Forests randomly select a subset of features from the dataset to build each decision tree. This helps to avoid overfitting and improves the performance of the algorithm.
  3. Build Decision Trees: Random Forests builds multiple decision trees using the subset of features selected in step 2. Each decision tree is built by selecting a random sample of the data and a random subset of features.
  4. Voting: When making a prediction, Random Forests takes the input features and runs them through each decision tree in the forest. Each tree returns a prediction, and the final prediction is made by taking a majority vote of all the individual tree predictions.
  5. Evaluation: Random Forests performance is evaluated by using a metric that is appropriate for the problem at hand. For example, for a regression problem, one could use mean squared error (MSE), while for a classification problem, one could use accuracy or F1 score.

Advantages of Random Forests:

  1. Random Forests can handle both categorical and continuous data.
  2. It can handle missing data.
  3. Random Forests are resistant to overfitting because of feature selection and bagging.
  4. It can be used for both classification and regression tasks.
  5. It can handle high dimensional data with a large number of features.
  6. It provides an estimate of feature importance.

Disadvantages of Random Forests:

  1. Random Forests can be slow to train on large datasets with a large number of trees.
  2. The model can be difficult to interpret because of the large number of decision trees.
  3. Random Forests can be biased towards features with many categories.

Random Forests is a powerful machine learning algorithm that is widely used for both classification and regression tasks. It combines multiple decision trees to make more accurate predictions and is resistant to overfitting. However, it can be slow to train on large datasets, and the model can be difficult to interpret.


An example of building a simple random forest model using Python's scikit-learn library:


1. First, let's import the necessary libraries:


from sklearn.ensemble import RandomForestClassifier

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split


2. Next, let's generate a sample dataset using make_classification:


X, y = make_classification(n_samples=1000, n_features=4, n_informative=2, n_redundant=0, random_state=42)


3. Here, we generate a dataset with 1000 samples, 4 features, 2 informative features, and 0 redundant features. Now, let's split the dataset into training and testing sets:


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


4. Here, we use 20% of the dataset for testing. Now, let's create a random forest classifier and fit it to the training data:


rf = RandomForestClassifier(n_estimators=100, random_state=42)

rf.fit(X_train, y_train)


5. Here, we create a random forest classifier with 100 trees and fit it to the training data. Finally, let's evaluate the performance of the model on the testing data:


print("Accuracy:", rf.score(X_test, y_test)) 


This will print the accuracy of the model on the testing data.


And that's it! You've built a simple random forest model using scikit-learn. Of course, you can modify the parameters of the random forest classifier to improve its performance or adapt it to your specific needs.


Saturday, May 11, 2019

Root Mean Square Error (RMSE)


How to calculate Root-Mean-Square Error?



This post will cover most common ways to evaluate the regression model. The idea of regression model is to predict real or number or continuous values, not categorical or ordinal values. There are several ways to measure error rate of a regression model but now we will just consider RMSE measure. Let's look major points covered by the post.

  • What is residual?
  • Root Mean Square Error (RMSE)
  • How to calculate RMSE?

We will take a dataset which contains x and y values where, x is input value and y is output. Let's take input and output values as 1, 2, 2, 3 and 1, 2, 3, 4 resp. Figure 1 shows regression model ŷ = 1.5 x - 0.5. Value 1.5 is slope of the regression line and 0.5 is an intercept. In this case, it is linear model.


Figure 1


Let's move on Residual.

It is difference between predicted and actual values and mathematically, it can be represented  as Residual = yi - ŷi for ith data point.  Our goal is always to minimize such error. It can be negative or positive. Let's see RMSE.

Root Mean Square Error (RMSE) can be considered to be a standard deviation of residual and it is commonly used measure. It can only be compared between models whose errors are measured in the same units. Let's define mathematical equation of RMSE.


Equation tells use, take the error of each one of the data points, square it, add them together, take the average and take the square root of that. Let's move on how to calculate this measure. Moreover, you can easily determine it using functions available in R or Python.

How to compute RMSE? First, we will compute residual for each of the data points. Following regression plots will give you better understanding on how to calculate residual for given four data points.
For first data point, x=1 and y=1.
The below plot shows, ŷ is calculated by placing x = 1 into equation ŷ = 1.5 x - 0.5 and it turned out to be 1. Here, actual and predicted value is same thus, residual = yi - ŷi = (1-1) = 0

Figure 2

For second data point, x=2 and y=2.
The below plot shows, ŷ is calculated by placing x = 2 into equation ŷ = 1.5 x - 0.5 and got predicted output 1 therefore, residual = yi - ŷi = (2 - 2.5) = -0.5. As you can see residual is negative here. When input is below regression line then you will have negative residual.  


Figure 3

For second data point, x=2 and y=3.
The below plot shows, ŷ is calculated by placing x = 2 into equation ŷ = 1.5 x - 0.5 and got predicted value 1. Hence residual = yi - ŷi = (3 - 2.5) = 0.5. Here, we have positive residual.


Figure 4


For second data point, x=3 and y=4.
The below plot shows, ŷ is calculated by placing x = 3 into equation ŷ = 1.5 x - 0.5 and predicted outcome is 4 then residual = yi - ŷi = (4 - 4) = 0 . For this data point, actual and predicted value is same that is 4.


Figure 5

Up to this we have taken the error of each one of the data points. Now, we will square it and add them together, and then will take the average.



Let's take square root of 0.125


This way we can know how much our machine learning model disagrees with the actual data.


Kindly follow my blog and stay tuned for more advanced post on list of regression measures.

Thank you!.


What is Support Vector Machine?

  What is Support Vector Machine? Support Vector Machine (SVM) is a supervised machine learning algorithm that is widely used in classificat...