Data Science and Machine Learning : April 2019

Monday, April 29, 2019

Confusion Matrix

What is Confusion Matrix and Advanced Classification Metrics?

After data preparation and model training, there is model evaluation phase which I mentioned in my earlier article Simple Picture of Machine Learning Modelling Process.

Once model is developed, the next phase is to calculate the performance of the developed model using some evaluation metrics. In this article, you will just discover about confusion matrix though there are many classification metrics out there.

Mainly, it focuses on below points:

What is confusion matrix?
Four outputs in confusion matrix
Advanced classification metrics

Table 1. Confusion matrix with advanced classification metrics

Confusion Matrix is a tool to determine the performance of classifier. It contains information about actual and predicted classifications. The below table shows confusion matrix of two-class, spam and non-spam classifier.

Table 2. Confusion matrix of email classification

Let’s understand four outputs in confusion matrix.

1. True Positive (TP) is the number of correct predictions that an example is positive which means positive class correctly identified as positive.

Example: Given class is spam and the classifier has been correctly predicted it as spam.

2. False Negative (FN) is the number of incorrect predictions that an example is negative which means positive class incorrectly identified as negative.
Example: Given class is spam however, the classifier has been incorrectly predicted it as non-spam.

3. False positive (FP) is the number of incorrect predictions that an example is positive which means negative class incorrectly identified as positive.
Example: Given class is non-spam however, the classifier has been incorrectly predicted it as spam.

4. True Negative (TN) is the number of correct predictions that an example is negative which means negative class correctly identified as negative.
Example: Given class is spam and the classifier has been correctly predicted it as negative.

Now, let’s see some advanced classification metrics based on confusion matrix. These metrics are mathematically expressed in Table 1 with example of email classification, shown in Table 2. Classification problem has spam and non-spam classes and dataset contains 100 examples, 65 are Spams and 35 are non-spams.

Sensitivity is also referred as True Positive Rate or Recall. It is measure of positive examples labeled as positive by classifier. It should be higher. For instance, proportion of emails which are spam among all spam emails.

Table 3. Sensitivity in confusion matrix

Sensitivity = 45/(45+20) = 69.23% .

The 69.23% spam emails are correctly classified and excluded from all non-spam emails.

Specificity is also know as True Negative Rate. It is measure of negative examples labeled as negative by classifier. There should be high specificity. For instance, proportion of emails which are non-spam among all non-spam emails.

Table 4. Specificity in confusion matrix

specificity = 30/(30+5) = 85.71% .

The 85.71% non-spam emails are accurately classified and excluded from all spam emails.

Precision is ratio of total number of correctly classified positive examples and the total number of predicted positive examples. It shows correctness achieved in positive prediction.

Table 5. Precision in confusion matrix

Precision = 45/(45+5)= 90%

The 90% of examples are classified as spam are actually spam.

Accuracy is the proportion of the total number of predictions that are correct.

Table 6. Accuracy in confusion matrix

Accuracy = (45+30)/(45+20+5+30) = 75%

The 75% of examples are correctly classified by the classifier.

F1 score is a weighted average of the recall (sensitivity) and precision. F1 score might be good choice when you seek to balance between Precision and Recall.

It helps to compute recall and precision in one equation so that the problem to distinguish the models with low recall and high precision or vice versa could be solved.

Kindly follow my blog by email and stay tuned for more advanced post on regression measures.

Thank you!

Saturday, April 27, 2019

Machine Learning Process

Simple Picture of Machine Learning Modelling Process

Learning machine is computer algorithm to search patterns in massive data. This article will walk you through step by step Machine Learning (ML) process.

It focuses on steps involved in ML process from scratch and is more useful for beginners who would be interested in learning complete picture of ML modelling.

Fig. 1: Overview of ML process

Figure 1 depicts main components of ML modelling process such as data preparation, model selection, model development, model evaluation and deployment however, Figure 2 shows detail view of all components.

Fig. 2: Detail View of ML Modelling Process

Kindly follow my blog and stay tuned for more advanced post on confusion matrix and advanced classification measures.

Thank you!

Regression Techniques

Regression Techniques By Their Machine Learning Families

Several Machine Learning (ML) algorithms and families are out there and it's really interesting to know which are the algorithms come under each ML family. Having a knowledge of various techniques can also speed up your modelling process.

In this post, I tried to classify all possible regression techniques according to their ML families. Following chart contains name of the regression technique with it's R package and reference.

Kindly follow my blog and stay tuned for more advanced post on machine learning modelling.

Thank you!

Reference

An extensive experimental survey of regression methods. Neural Networks.

MapReduce Process

It's Easy to Learn MapReduce process

In this article, I have tried to cover MapReduce process by explaining Map and Reduce cycle. MapReduce is one of the core components of Hadoop, which is made for the processing of huge amount of data in parallel fashion on commodity machines. Map makes the traditional inquire task, disassembling task and data analysis task into distributed processing, handling and allocating task to different nodes.

The MapReduce algorithm parallelizes the performance of a number of problems. The Reduce combines different information coming from the Map, computing the result sets and achieving the reduced answer. MapReduce is a programming model of Hadoop system and used to process intensive data within less time.

Process flow of MapReduce

Figure shows each one of the map and reduce phase has key-value pairs as input and output. The shuffle phase shuffles the outputs of map phase to the input of the reduce phase evenly using the MapReduce library. The map phase runs a user defined mapper function on a set of key-value pairs [kj, vj] taken as input, and generates a set of intermediate key-value pairs. Each major (reduce) step reduces the number of data objects (key-value pairs) by an order of magnitude or more.

The “org.apache.hadoop.mapreduce.InputFormat” is responsible for dividing given input data block into multiple input splits. The Hadoop application execution time is greatly affected by the shuffling phase where amount of data is transferred from map tasks to reduce tasks.

The “org.apache.hadoop.mapreduce.recordreader” is responsible for dividing split into (key, value) phase to be accepted by mapper phase. Below are steps that happen as part of the MapReduce cycle.

Step 1 — Input request should in the form of .JAR file which contains Driver code, Mapper code and Reducer code.

Step 2 — Job Tracker assigns the mapper tasks by tracking the business logic from the .JAR file on the all the available task trackers.

Step 3 — Once all the task trackers are done with mapper processes, they send the same status back to Job Tracker.

Step 4 — All the task trackers do with mapper phase, then job tracker initiates sort and shuffle phase on all the mapper outputs.

Step 5 — When the sort and shuffle is done, job tracker initiates reducer phase on all available task trackers.

Step 6 — Once all task trackers do with reducer phase, they update the same status back to the job tracker.

Mapper and Reducer are user driven phases. Mapper class output filename is “part-m-00000” and Reducer class output filename is “part-r-00000”. Job Tracker and Task Tracker are the two daemons which are entirely responsible for MapReduce processing.

Kindly follow my blog and stay tuned for more advanced post on regression techniques.

Thank you!