Exploratory Data Analysis in Data Science
This post is about data plotting or visualization methods for data analysis and each technique has been explained with an example using iris dataset. You can also find some important posts in this blog like machine learning modelling process, list of regression techniques, confusion matrix, root-mean-squared-error. If you are interested in reading research papers then kindly refer post on classifier and regression models. There you can find code of 77 regression models. In this post, you will discover about some important data visualization techniques, mainly:
- 2-D scatter plot
- 3-D scatter plot
- Pair plot
- Histogram
- Box-plot
Let's first talk about simple iris dataset, it has 4 features/independent variables or predictors (sepal length, sepal width, petal length, petal width), that means it is 4 dimensional array. Response/dependent variables or class labels are virginica, setosa and versicolor. Dataset has 150 data points and it is balanced as number of data points for each class is the same i.e. 50 data points for each class. You can download dataset from here or see it using sklearn.datasets in python. For more information about iris dataset click on this link. To implement above mentioned visualization techniques in python, you must have pandas, seaborn, matplotlib and numpy libraries.
1. 2-D scatter plot:
Fig. 1 shows 2-D scatter plot of sepal_length and sepal_width and reports that blue points of setosa class are easily separable by green and orange data points by drawing a linear line. However, class labels versicolor and virginca are not easily separable with this 2-D feature combination (sepal_length and sepal_width). In this case, we can try for the other combinations for instance, petal_length and petal_width.
2. 3-D scatter plot:
It plots data points into 3 dimensional space. Disadvantage of 3-D plot is that it requires many interactions with plot for interpretation so it is not more convenient technique.
Fig. 2 3-D scatter plot of petal_length, sepal_length and petal_width from iris dataset |
3. Pair plot:
We can not do 4-D scatter plot instead, we use pair plot. It would be good solution in order to avoid checking lot of combinations using 2-D and many mouse interactions using 3-D scatter plot. Dataset with 4, 5, 6 or 7 dimensions, can easily interpret by pair plot however, it can not be good option if dimensions are more than that. To identify class labels, Fig. 3 presents petal_width and petal_length are two highly influential predictors where, setosa are linearly separable from class versicolor and virginica. The diagonal elements are Probability Density Functions (PDF) of each feature.