Understanding Data For Data Analytics, Data Science, and Machine Learning – Part-2
Shubham Singh
Posted on May 13, 2022
Things to know beforehand
-
What is Variability?
It is how much data is spread out.
[1] Central Tendency
[2] Median
When your data is very influenced by the outliers then using median is good choice because it is not effected by outliers
to calculate median sort your data (ascending or descending does not matter) and then find the middle point.
Center point will be different based on whether you n is even or odd
[a] when n is even
When n is even, there are 2 centers
[b] when n is odd
For odd n it just
in R both can be calculated with same function
median()
[3] Mode
In Data Mode is the value which occurs most often in the data
calculating mode is a manual task because you have to count occurrence of each value in the Data.
R doesn't have an inbuilt function for mod, so we can use this function
mode <- function(v) {
uniqv <- unique(v)
print(uniqv[which.max(tabulate(match(v, uniqv)))])
}
mode(data)
[2] Major of Spread
Understanding spread of data is very important to understand your data better, 2 sets of data can have same mean but different spread which may lead to low quality estimates.
[1] Range
it is one of the simplistic major of variability, to calculate Range :
diff(range(data))
# or
print(max(data) - min(data))
[2] Inter Quartile Range (IQR) and Whiskers Plot
By dividing your data is 4 equal parts, quartiles are generated each quartile contains 25% of data, i.e.,
1st quartile is 25% of data (25th percentile); 2nd quartile is 50% of data (50th percentile); 3rd quartile is 75% of data (75th percentile); 4th quartile is 100% of data (100th percentile).
Box and Whiskers Plot is very useful for 5 point summery and understanding spread and Outliers
library(ggplot2)
data <- iris
ggplot(data) + geom_boxplot(
mapping = aes(
x = Sepal.Length,
y = Species
)
) + coord_flip()
The five point summary in box plot includes the minimum value, the first quartile, the median, the third quartile, and the maximum value.
Each of these can be looked into the plot below.
- Minimum value : start of the vertical line.
- First Quartile : start of the box in the middle.
- Median : bold horizontal line is the point where median lies.
- Third Quartile : end of the box in middle.
- Maximum value : end of the vertical line.
And if you are wondering what is that point outside the box in virginica
it is an outlier.
To compute outliers mathematically, you need a threshold if any point passes the outliers threshold
it is considered as outlier.
[3] Variance
Variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers is spread out from their average value.
Why it is S^2
because the sum of xi - x bar
can result in zero, so we square it to make it a +ve number.
var(data)
[4] Standard Deviation
the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.
It is quite same as Variance difference is SD unit is same as data, but variance is in unit squared.
sd(data)
Normal distributions with standard deviations of 5 and 10.
For Part-3 go here
Posted on May 13, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
May 13, 2022
May 12, 2022