Data Science and Machine Learning

Tuesday, March 28, 2023

What is Support Vector Machine?

What is Support Vector Machine?

Support Vector Machine (SVM) is a supervised machine learning algorithm that is widely used in classification, regression, and outlier detection problems. SVM is based on the concept of finding the optimal hyperplane that separates different classes in the feature space.

In detail, the SVM algorithm works by mapping the input data into a high-dimensional feature space using a non-linear mapping function. It then finds the optimal hyperplane that maximizes the margin between the two classes. The margin is the distance between the hyperplane and the nearest data points from each class. The hyperplane that has the maximum margin is the one that is chosen as the optimal hyperplane. SVM is capable of handling both linear and non-linear classification problems by using different kernel functions.

How the algorithm works:

First, the algorithm takes the input data and maps it into a higher-dimensional space. This mapping is done using a kernel function, which transforms the input data into a new space where it is easier to separate the classes using a hyperplane.
Next, the algorithm finds the hyperplane that maximizes the margin between the two classes. The margin is the distance between the hyperplane and the closest data points from each class.
The algorithm then predicts the class of new data points by determining which side of the hyperplane they fall on.

Advantages of SVM include:

SVM can handle both linear and non-linear classification problems by using different kernel functions such as linear, polynomial, radial basis function (RBF), and sigmoid.
SVM can handle high-dimensional data and can perform well even when the number of features is greater than the number of samples.
SVM has a regularization parameter that helps to avoid overfitting and improve the generalization performance of the model.
SVM can handle both binary and multi-class classification problems by using different strategies such as one-vs-one and one-vs-all.

Disadvantages of SVM include:

SVM can be sensitive to the choice of kernel function and its parameters. Choosing the right kernel function and its parameters can be a challenging task.
SVM can be computationally expensive, especially for large datasets with a large number of features.
SVM can be sensitive to outliers in the data and may result in a suboptimal solution.

An example of building a simple SVM model using Python's scikit-learn library:

First, let's load the dataset and split it into training and testing sets:

from sklearn.datasets import load_breast_cancer

from sklearn.model_selection import train_test_split

# Load data

cancer = load_breast_cancer()

# Split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, test_size=0.3, random_state=42)

Next, let's create an SVM model with a radial basis function (RBF) kernel and a regularization parameter of 1.0:

from sklearn.svm import SVC

# Create SVM model

svc = SVC(kernel='rbf', C=1.0)

We can train the model on the training data using the fit method:

# Train SVM model on training data

svc.fit(X_train, y_train)

We can then use the model to make predictions on the testing data using the predict method:

# Make predictions on testing data

y_pred = svc.predict(X_test)

Finally, we can evaluate the performance of the model using metrics such as accuracy, precision, recall, and F1-score:

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Calculate evaluation metrics

accuracy = accuracy_score(y_test, y_pred)

precision = precision_score(y_test, y_pred)

recall = recall_score(y_test, y_pred)

f1 = f1_score(y_test, y_pred)

# Print evaluation metrics

print('Accuracy: {:.2f}'.format(accuracy))

print('Precision: {:.2f}'.format(precision))

print('Recall: {:.2f}'.format(recall))

print('F1-score: {:.2f}'.format(f1))

This will output the evaluation metrics for the SVM model on the testing data. The exact values may vary each time the code is run due to the random splitting of the data into training and testing sets.

In this example, we first load the iris dataset from Scikit-learn's built-in datasets. We split the data into training and testing sets using the train_test_split function. We create an SVM model with a linear kernel and a regularization parameter of 1.0. We train the SVM model on the training data using the fit function. We then use the trained model to predict the classes of the testing data using the predict function. Finally, we calculate the accuracy score of the model on the testing data using the accuracy_score function and print the result.

Monday, March 27, 2023

What is Random Forests?

What is Random Forests?

Random Forests is a popular machine learning algorithm used for both regression and classification tasks. It is an ensemble method that combines multiple decision trees to make more accurate predictions.

How the algorithm works:

Data Preparation: Random Forests can handle both categorical and continuous data. It requires a labeled dataset with both input features and output labels.
Feature Selection: Random Forests randomly select a subset of features from the dataset to build each decision tree. This helps to avoid overfitting and improves the performance of the algorithm.
Build Decision Trees: Random Forests builds multiple decision trees using the subset of features selected in step 2. Each decision tree is built by selecting a random sample of the data and a random subset of features.
Voting: When making a prediction, Random Forests takes the input features and runs them through each decision tree in the forest. Each tree returns a prediction, and the final prediction is made by taking a majority vote of all the individual tree predictions.
Evaluation: Random Forests performance is evaluated by using a metric that is appropriate for the problem at hand. For example, for a regression problem, one could use mean squared error (MSE), while for a classification problem, one could use accuracy or F1 score.

Advantages of Random Forests:

Random Forests can handle both categorical and continuous data.
It can handle missing data.
Random Forests are resistant to overfitting because of feature selection and bagging.
It can be used for both classification and regression tasks.
It can handle high dimensional data with a large number of features.
It provides an estimate of feature importance.

Disadvantages of Random Forests:

Random Forests can be slow to train on large datasets with a large number of trees.
The model can be difficult to interpret because of the large number of decision trees.
Random Forests can be biased towards features with many categories.

Random Forests is a powerful machine learning algorithm that is widely used for both classification and regression tasks. It combines multiple decision trees to make more accurate predictions and is resistant to overfitting. However, it can be slow to train on large datasets, and the model can be difficult to interpret.

An example of building a simple random forest model using Python's scikit-learn library:

1. First, let's import the necessary libraries:

from sklearn.ensemble import RandomForestClassifier

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

2. Next, let's generate a sample dataset using make_classification:

X, y = make_classification(n_samples=1000, n_features=4, n_informative=2, n_redundant=0, random_state=42)

3. Here, we generate a dataset with 1000 samples, 4 features, 2 informative features, and 0 redundant features. Now, let's split the dataset into training and testing sets:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

4. Here, we use 20% of the dataset for testing. Now, let's create a random forest classifier and fit it to the training data:

rf = RandomForestClassifier(n_estimators=100, random_state=42)

rf.fit(X_train, y_train)

5. Here, we create a random forest classifier with 100 trees and fit it to the training data. Finally, let's evaluate the performance of the model on the testing data:

print("Accuracy:", rf.score(X_test, y_test))

This will print the accuracy of the model on the testing data.

And that's it! You've built a simple random forest model using scikit-learn. Of course, you can modify the parameters of the random forest classifier to improve its performance or adapt it to your specific needs.

How Machine Learning can be used with Blockchain Technology?

How Machine Learning can be used with Blockchain Technology?

The integration of blockchain technology with machine learning has become an emerging topic in recent years. Blockchain technology, known for its security and immutability, can provide a secure and transparent way to store and manage large amounts of data. Machine learning, on the other hand, is a subset of artificial intelligence that allows computers to learn from data and make predictions or decisions without being explicitly programmed. The integration of these two technologies can provide a powerful tool for solving various problems related to data management, privacy, and security.

Here are some of the ways in which blockchain technology can be integrated with machine learning:

Data Management is one of the key challenges in machine learning is the management of large amounts of data. Blockchain technology can help in the management of this data by providing a secure and decentralized way to store, access, and share data. This can be especially useful in scenarios where data privacy and security are critical, such as in the healthcare industry or financial sector. By using blockchain technology, machine learning models can access data from multiple sources without compromising privacy or security.
Data Verification and Auditability, blockchain technology is known for its transparency and immutability, which makes it an ideal tool for data verification and auditability. This is particularly useful in scenarios where the authenticity and integrity of data are critical. By using blockchain technology, machine learning models can verify the authenticity of data before using it for training or making predictions.
Decentralized Machine Learning, the traditional machine learning approach involves training models on centralized servers, which can be vulnerable to attacks or data breaches. By using blockchain technology, machine learning can be decentralized, which means that the training and inference can be done on multiple nodes, making it more secure and resilient to attacks. This approach also allows for the creation of a collaborative learning environment where different parties can contribute to the training process while maintaining data privacy and security.
Smart Contracts are self-executing contracts with the terms of the agreement between buyer and seller being directly written into lines of code. These contracts can be used to automate the execution of certain tasks in the machine learning process, such as data acquisition, pre-processing, and model training. By using smart contracts, the machine learning process can be automated and made more efficient.
Tokenization is the process of converting real-world assets or data into digital tokens that can be traded on blockchain networks. In the context of machine learning, tokenization can be used to incentivize data sharing and collaboration. By using tokens, data providers can be rewarded for sharing their data with others, which can lead to the creation of a more collaborative and decentralized machine learning ecosystem.

The integration of blockchain technology with machine learning can provide a secure, transparent, and decentralized way to manage data, verify its authenticity, and train models. This can lead to the creation of more efficient and collaborative machine learning ecosystems that can address a wide range of real-world problems.

Tuesday, January 21, 2020

Regression Methods

List of Regression Methods

This post walk you through machine learning family, list of regression methods with their references. It covers almost all methods so it can be more useful for those who are interested in knowing names of ML or statistical regression techniques with family.

Thank you!
For more details please refer below research paper.

Reference:

An extensive experimental survey of regression methods. Neural Networks. For download click here.

Sunday, December 22, 2019

Exploratory Data Analysis (EDA)

Exploratory Data Analysis in Data Science

This post is about data plotting or visualization methods for data analysis and each technique has been explained with an example using iris dataset. You can also find some important posts in this blog like machine learning modelling process, list of regression techniques, confusion matrix, root-mean-squared-error. If you are interested in reading research papers then kindly refer post on classifier and regression models. There you can find code of 77 regression models. In this post, you will discover about some important data visualization techniques, mainly:

2-D scatter plot

3-D scatter plot

Pair plot

Histogram

Box-plot

Let's first talk about simple iris dataset, it has 4 features/independent variables or predictors (sepal length, sepal width, petal length, petal width), that means it is 4 dimensional array. Response/dependent variables or class labels are virginica, setosa and versicolor. Dataset has 150 data points and it is balanced as number of data points for each class is the same i.e. 50 data points for each class. You can download dataset from here or see it using sklearn.datasets in python. For more information about iris dataset click on this link. To implement above mentioned visualization techniques in python, you must have pandas, seaborn, matplotlib and numpy libraries.

1. 2-D scatter plot:

Fig. 1 shows 2-D scatter plot of sepal_length and sepal_width and reports that blue points of setosa class are easily separable by green and orange data points by drawing a linear line. However, class labels versicolor and virginca are not easily separable with this 2-D feature combination (sepal_length and sepal_width). In this case, we can try for the other combinations for instance, petal_length and petal_width.

Fig. 1 2-D scatter plot of sepal_length and sepal_width

2. 3-D scatter plot:

It plots data points into 3 dimensional space. Disadvantage of 3-D plot is that it requires many interactions with plot for interpretation so it is not more convenient technique.

Fig. 2 3-D scatter plot of petal_length, sepal_length and petal_width from iris dataset

3. Pair plot:

We can not do 4-D scatter plot instead, we use pair plot. It would be good solution in order to avoid checking lot of combinations using 2-D and many mouse interactions using 3-D scatter plot. Dataset with 4, 5, 6 or 7 dimensions, can easily interpret by pair plot however, it can not be good option if dimensions are more than that. To identify class labels, Fig. 3 presents petal_width and petal_length are two highly influential predictors where, setosa are linearly separable from class versicolor and virginica. The diagonal elements are Probability Density Functions (PDF) of each feature.

Fig. 3 Pair plot of iris dataset

4. Histogram:

It is representation of probability distribution of data point. Better way to visualize one feature (1-D) is histogram. Lets take an example of sepal_length, shown in Fig. 3. The x-axis is sepal_length where y-axis is number of counts of sepal_length. Light blue, orange and green are the histograms of sepal_length of setosa, versicolor and virginica flower types, respectively (see fig. 4). Histogram tells us how many data points are there in the window of 4 to 6. It shows maximum setosa flowers (around 15) are exist when sepal_length size is 5. Height of histogram shows how often we find particular flower type given sepal_length. Smooth line is called PDF and it is smoothed form of histogram.

Fig. 4 Histogram of sepal_length

5. Box-plot:

It is another technique of visualizing the 1-D scatter plot. Box plot uses median, percentiles and quantiles and put it into plot. By looking at Fig. 4, we do not know what is 25th, 50th or 75th percentile of setosa sepal_length. To know that, we use box-plot, it uses percentiles. In the Fig. 5, x-axis is flower types or 3 boxes corresponding to each class label and y-axis is septal_length. Lets understand green color box, it tells what 25th, 50th and 75th percentile value of sepal length for virginica. Whiskers are generally minimum and maximum value of feature for each class however, there is no standard way to draw it. Besides, box-plot helps us in writing a rules and finding mis-classifications or errors.

Fig. 5 Box-plot of sepal_length

Kindly follow my blog and stay tuned for more advanced posts on ML.

Thank you!

Saturday, November 23, 2019

Machine Learning Overview

For easy understanding of ML overview, this post shows the cheat sheet of types of ML with some algorithms as well as examples.

Kindly follow my blog and stay tuned for more advanced posts on ML.

Thank you!

Reference:

Fernandez-Delgado, M., Sirsat, M., Cernadas, E., Alawadi, S., Barro, S., Febrero-Bande, M., 2019. An extensive experimental survey of regression methods. Neural Networks 111, 11–34. URL: 10.1016/j.neunet.2018.12.010

Saturday, May 11, 2019

Root Mean Square Error (RMSE)

How to calculate Root-Mean-Square Error?

This post will cover most common ways to evaluate the regression model. The idea of regression model is to predict real or number or continuous values, not categorical or ordinal values. There are several ways to measure error rate of a regression model but now we will just consider RMSE measure. Let's look major points covered by the post.

What is residual?
Root Mean Square Error (RMSE)
How to calculate RMSE?

We will take a dataset which contains x and y values where, x is input value and y is output. Let's take input and output values as 1, 2, 2, 3 and 1, 2, 3, 4 resp. Figure 1 shows regression model ŷ = 1.5 x - 0.5. Value 1.5 is slope of the regression line and 0.5 is an intercept. In this case, it is linear model.

Figure 1

Let's move on Residual.

It is difference between predicted and actual values and mathematically, it can be represented as Residual = yi - ŷi for ith data point. Our goal is always to minimize such error. It can be negative or positive. Let's see RMSE.

Root Mean Square Error (RMSE) can be considered to be a standard deviation of residual and it is commonly used measure. It can only be compared between models whose errors are measured in the same units. Let's define mathematical equation of RMSE.

Equation tells use, take the error of each one of the data points, square it, add them together, take the average and take the square root of that. Let's move on how to calculate this measure. Moreover, you can easily determine it using functions available in R or Python.

How to compute RMSE? First, we will compute residual for each of the data points. Following regression plots will give you better understanding on how to calculate residual for given four data points.

For first data point, x=1 and y=1.

The below plot shows, ŷ is calculated by placing x = 1 into equation ŷ = 1.5 x - 0.5 and it turned out to be 1. Here, actual and predicted value is same thus, residual = yi - ŷi = (1-1) = 0

Figure 2

For second data point, x=2 and y=2.

The below plot shows, ŷ is calculated by placing x = 2 into equation ŷ = 1.5 x - 0.5 and got predicted output 1 therefore, residual = yi - ŷi = (2 - 2.5) = -0.5. As you can see residual is negative here. When input is below regression line then you will have negative residual.

Figure 3

For second data point, x=2 and y=3.

The below plot shows, ŷ is calculated by placing x = 2 into equation ŷ = 1.5 x - 0.5 and got predicted value 1. Hence residual = yi - ŷi = (3 - 2.5) = 0.5. Here, we have positive residual.

Figure 4

For second data point, x=3 and y=4.

The below plot shows, ŷ is calculated by placing x = 3 into equation ŷ = 1.5 x - 0.5 and predicted outcome is 4 then residual = yi - ŷi = (4 - 4) = 0 . For this data point, actual and predicted value is same that is 4.

Figure 5

Up to this we have taken the error of each one of the data points. Now, we will square it and add them together, and then will take the average.

Let's take square root of 0.125

This way we can know how much our machine learning model disagrees with the actual data.

Kindly follow my blog and stay tuned for more advanced post on list of regression measures.

Thank you!.