Exploratory Data Analysis (EDA)Ultimate Guide
Yankho Chimpesa
Posted on February 24, 2023
An important phase in data analysis and data science is exploratory data analysis (EDA), which involves looking at and visualizing data to comprehend its properties and interactions between variables.
It aids in the discovery of patterns, outliers, and potential data issues. This article will serve as the ultimate guide to exploratory data analysis, including its definition, steps, and techniques.
Definition
Exploratory data analysis is the process of studying data to highlight its key features using quantitative and visual techniques.
It entails comprehending the structure of the data, spotting trends and connections, and looking for probable outliers or abnormalities.
Gaining insights into the data, spotting potential issues, and getting the data ready for more analysis are the key objectives of EDA. As a result, it is unquestionably the most important step in a data science project, accounting for nearly 70-80% of the total time spent on the project.
Since EDA is an iterative process, the analysis may be honed or expanded in response to the findings of earlier analysis.
Types
Univariate data analysis
Univariate data analysis (EDA) is a type of exploratory data analysis (EDA) that examines the distribution and characteristics of a single variable at a time.
The primary goal of univariate analysis is to comprehend the data's central tendency, variability, and distribution https://www.geeksforgeeks.org/exploratory-data-analysis-eda-types-and-tools/.
Some common techniques used in univariate analysis include:
Descriptive Statistics: Descriptive statistics, such as mean, median, mode, range, and standard deviation, provide a summary of the central tendency, dispersion, and shape of the distribution of a variable.
To calculate descriptive statistics such as mean, median, and standard deviation, we can use the NumPy library. Here's an example:
import numpy as np
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
mean = np.mean(data)
median = np.median(data)
std_dev = np.std(data)
print("Mean:", mean)
print("Median:", median)
print("Standard Deviation:", std_dev)
#Output
Mean: 5.5
Median: 5.5
Standard Deviation: 2.8722813232690143
Frequency Distributions: Frequency distributions show how many times each value or range of values occurs in a variable. This helps to understand the shape of the distribution, such as whether it is symmetric or skewed.
To create a frequency distribution, we can use the pandas library. Here's an example:
import pandas as pd
data = [1, 1, 2, 3, 3, 3, 4, 4, 5, 5, 5, 5, 6, 6, 7, 8, 8, 9, 10]
freq_dist = pd.Series(data).value_counts()
print(freq_dist)
#Output
5 4
3 3
6 2
4 2
1 2
8 2
2 1
7 1
9 1
10 1
dtype: int64
Histograms: Histograms are graphical representations of frequency distributions that use bars to show the frequency of each value or range of values in a variable. Histograms provide a visual representation of the distribution of the data.
Box Plots: Box plots, also known as box-and-whisker plots, provide a graphical summary of the distribution of a variable. They show the median, quartiles, and outliers of the data.
Probability Distributions: Probability distributions, such as the normal distribution, provide a mathematical model for the distribution of the data. They can be used to make predictions about the data and to test hypotheses.
Descriptive Statistics: Descriptive statistics, such as mean, median, mode, range, and standard deviation, provide a summary of the central tendency, dispersion, and shape of the distribution of a variable
Bivariate Analysis
Bivariate analysis is a type of exploratory data analysis (EDA) in which the relationship between two variables is examined.
The goal of bivariate analysis is to identify any patterns or trends in the data and to understand how the two variables are related to each other https://www.analyticsvidhya.com/blog/2022/02/a-quick-guide-to-bivariate-analysis-in-python/.
There are several techniques that can be used to perform bivariate analysis, including:
Scatter plots - Scatter plots are a visual way to explore the relationship between two variables. A scatter plot displays the values of two variables as points on a two-dimensional graph, with one variable represented on the x-axis and the other on the y-axis. The pattern of the points can provide insights into the relationship between the two variables. For example, if the points are clustered around a straight line, it suggests a linear relationship between the variables.
Correlation analysis - Correlation analysis is a statistical technique used to measure the strength and direction of the relationship between two variables. The correlation coefficient ranges from -1 to +1, with a value of -1 indicating a perfect negative correlation, a value of +1 indicating a perfect positive correlation, and a value of 0 indicating no correlation. Correlation analysis can help to identify the strength and direction of the relationship between two variables.
Covariance analysis - Covariance is a statistical measure that describes how two variables are related to each other. Covariance is similar to correlation, but it does not take into account the scale of the variables. A positive covariance indicates that the two variables tend to move together, while a negative covariance indicates that the two variables tend to move in opposite directions.
Heat maps - Heat maps are graphical representations of data that use color coding to represent the value of a variable. Heat maps can be used to explore the relationship between two variables by displaying the correlation matrix in a color-coded format. This allows us to quickly identify patterns and trends in the data.
Regression analysis - Regression analysis is a statistical technique used to model the relationship between two variables. Regression analysis can be used to predict the value of one variable based on the value of another variable. For example, we could use regression analysis to predict the sales of a product based on the advertising spend.
By using these techniques, we can gain insights into the relationship between two variables and use this information to inform further analysis and modeling.
Multivariate analysis
Multivariate analysis is a type of exploratory data analysis (EDA) that involves analyzing the relationship between three or more variables. The goal of multivariate analysis is to understand how multiple variables are related to each other and to identify any patterns or trends in the data https://towardsdatascience.com/multivariate-analysis-going-beyond-one-variable-at-a-time-5d341bd4daca.
There are several techniques that can be used to perform multivariate analysis, including:
Factor analysis - Factor analysis is a statistical technique used to identify patterns in the relationship between multiple variables. Factor analysis reduces the number of variables by grouping them into a smaller number of factors, based on their correlation with each other.
Cluster analysis - Cluster analysis is a statistical technique used to group similar objects or individuals based on their characteristics. Cluster analysis can be used to identify patterns in the data and to identify subgroups of individuals or objects.
Principal component analysis - Principal component analysis (PCA) is a statistical technique used to transform a large number of variables into a smaller number of principal components. PCA can be used to reduce the dimensionality of the data and to identify the most important variables.
Discriminant analysis - Discriminant analysis is a statistical technique used to classify individuals or objects into two or more groups based on their characteristics. Discriminant analysis can be used to identify the variables that are most important in distinguishing between the groups.
_Canonical correlation analysis _- Canonical correlation analysis is a statistical technique used to identify the relationship between two sets of variables. Canonical correlation analysis can be used to identify the variables that are most important in explaining the relationship between the two sets of variables.
By using these techniques, we can gain insights into the relationship between multiple variables and use this information to inform further analysis and modeling. Multivariate analysis is particularly useful when working with large datasets or when exploring complex relationships between variables.
Example
# Correlation coeffiecient
corr_df = df.corr()
f,ax=plt.subplots(figsize=(20,20))
sns.heatmap(corr_df,annot=True,fmt=".2f", ax=ax,linewidths=0.5,linecolor="yellow")
plt.xticks(rotation=45)
plt.yticks(rotation=45)
plt.title('Correlations coefficient of the data')
plt.show()
Steps of Exploratory Data Analysis
EDA is typically carried out in several steps, which include:
Data Collection: This involves gathering relevant data for analysis. Data can be collected from various sources, including public datasets, surveys, and databases.
Data Cleaning: This step involves checking for missing data, errors, and outliers. The data is cleaned by removing duplicates, correcting data entry errors, and filling in missing values.
Data Visualization: This step involves creating visualizations to identify patterns and relationships in the data. Common visualization techniques include scatter plots, histograms, and box plots.
Data Transformation: This step involves transforming the data to make it more suitable for analysis. This can include normalization, scaling, and standardization.
Data Modeling: This step involves creating models to describe the relationships between variables. Models can be simple, such as linear regression, or complex, such as decision trees or neural networks.
Data Collection
The first step in EDA is to collect relevant data for analysis. The data can be collected from various sources, such as public datasets, surveys, and databases. In Python, you can use libraries like pandas to read and manipulate data.
See the Example below:
#the pyforest library helps reduce listing multiple import statements
import pyforest
# Read data from a CSV file
df= pd.read_csv('IT Salary Survey EU 2020.csv')
df.head()
Once the data has been put into a pandas dataframe, you may start exploring it with a variety of tools and functions.
Data Cleaning
Cleaning the data is EDA's second phase.
Data cleaning is an important step in exploratory data analysis (EDA) because it ensures that the data is correct, complete, and reliable.
The process of identifying and correcting errors, inconsistencies, and missing values in a dataset is known as data cleaning.
The pandas and dask libraries can be used to clean the data.
Data cleaning is a crucial step in exploratory data analysis (EDA) as it helps to ensure that the data is accurate, complete, and reliable. Data cleaning involves identifying and correcting errors, inconsistencies, and missing values in the dataset.
The process of data cleaning in EDA typically involves the following:
Data inspection: In this step, the data is visually inspected to identify any obvious errors or inconsistencies, such as missing values, outliers, or incorrect data types.
Handling missing values: Missing values can be handled by either removing the rows or filling in the missing values with an appropriate estimate, such as the mean or median.
Handling outliers: Outliers are data points that are significantly different from the other data points in the dataset. Outliers can be handled by removing them from the dataset or by transforming the data to reduce the impact of outliers.
Normalization: Normalization is the process of transforming the data so that it follows a standard distribution. This can help to reduce the impact of outliers and make it easier to compare data points.
Validation: Data validation involves checking the data to ensure that it meets the requirements of the analysis. This includes checking for errors, inconsistencies, and other issues that could affect the validity of the results.
Transformation: Data transformation involves converting the data into a form that is suitable for analysis. This can include aggregating data, creating new variables, or converting variables into different formats.
Overall, data cleaning is an important step in EDA because it ensures that the data is accurate, reliable, and fit for analysis.
By cleaning the data, analysts and data scientists can gain valuable insights and make informed decisions based on the data.
Example:
# Displaying the info and the last columns, rows of the dataset
df.tail()
df.info
# Check for missing values
data.isnull().sum()
# Fill in missing values
data.fillna(data.mean(), inplace=True)
# Remove duplicates
data.drop_duplicates(inplace=True)
In the above code, we first check for any missing values in the data by using the isnull() method, which returns a boolean dataframe indicating which cells are null or missing. We then use the fillna() method to replace any missing values with the mean value of the column. Finally, we use the drop_duplicates() method to remove any duplicate rows in the dataframe.
Data Visualization
The third step in EDA is to create visualizations of the data.
Data visualization is an essential part of exploratory data analysis (EDA). It involves the creation of graphical representations of the data that make it easier to understand and interpret.
Data visualization helps to identify patterns, trends, and relationships in the data that may be difficult to discern from raw data alone. Visualizations in Python can be created using libraries such as Matplotlib and Seaborn.
Overall, data cleaning is a critical step in EDA as it helps to ensure that the data is accurate, reliable, and suitable for analysis. By cleaning the data, analysts can gain valuable insights and make informed decisions based on the data.
There are many types of data visualizations that can be used in EDA, including:
Scatterplots: Scatterplots are used to visualize the relationship between two continuous variables. They show how the values of one variable are related to the values of another variable.
Histograms: Histograms are used to visualize the distribution of a single continuous variable. They show how the values of the variable are spread across a range of values.
Bar charts: Bar charts are used to visualize the distribution of a categorical variable. They show the frequency or proportion of each category.
Box plots: Box plots are used to visualize the distribution of a continuous variable. They show the median, quartiles, and outliers of the variable.
Heat maps: Heat maps are used to visualize the relationship between two categorical variables. They show the frequency or proportion of each combination of categories.
Line charts: Line charts are used to visualize trends in a continuous variable over time.
When creating data visualizations, it is important to choose the right type of visualization for the data being analyzed. The visualization should be clear and easy to understand, and the labels and axis should be clearly labeled.
Example:
import matplotlib.pyplot as plt
import seaborn as sns
# Scatter plot
plt.scatter(data["Years of experience in Germany"], data["Age"])
plt.xlabel("Years of experience in Germany")
plt.ylabel("Age")
plt.show()
# Histogram
sns.histplot(data["Years of experience in Germany"], bins=10)
plt.xlabel("Years of experience in Germany")
plt.ylabel("Frequency")
plt.show()
# Box plot
sns.boxplot(x=data["group"], y=data["value"])
plt.show
Data Transformation
The fourth step in EDA is to transform the data to make it more suitable for analysis. This can include normalization, scaling, and standardization. You can use libraries like Scikit-learn to transform the data.
from sklearn.preprocessing import StandardScaler
# Standardize the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
# Normalize the data
from sklearn.preprocessing import
Conclusion
Exploratory Data Analysis is an essential process in data analysis that provides insights into the data, identifying potential problems, and preparing the data for further analysis. In this article, we have covered the steps involved in EDA, including data collection, cleaning, visualization, transformation, and modeling.
Data cleaning involves identifying and correcting errors in the data, while data visualization enables the identification of patterns and relationships in the data using techniques such as histograms, box plots, and density plots. Data transformation, including normalization, scaling, and standardization, prepares the data for modeling.
EDA is a crucial step in data analysis that can help identify potential problems in the data, such as missing values, outliers, and anomalies, and provide insights into relationships between variables. This enables researchers and analysts to make informed decisions and gain insights that can be used to solve problems or make predictions.
Overall, the EDA process is iterative, meaning that the analysis may be refined or expanded based on the results of previous analysis. EDA is an essential step in the data analysis process, and it is critical to ensure that the data is clean and ready for further analysis.
For more information about EDA, check out my repository on github where i did Exploratory data analysis on the "IT Salary Survey EU" dataset that i did find on kaggle.
Here is the link to the github repo https://github.com/Yankho817/MyProjects.
Posted on February 24, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.