Machine Learning Overview
Victor Alando
Posted on October 5, 2023
Welcome to the Machine Learning. Machine Learning is a way of taking data and turning it into insights. We use computer power to analyze. Examples from the past to build a model that can predict the result of new example.
Course Basics
We will use several Python packages that are helpful in solving Machine Learning problems. We will be using pandas, numpy, matplotlib, and scikit-learn.
1 Pandas is used for reading data and data manipulation
2 Numpy is used for computation of numerical data
3. Matplotlib is used for graphing data
4. Scikit-learn is used for machine learning models.
Each of these packages are quite expensive, but we will review the functions we will be using. We will also review some basic statistics as it is the foundation of Machine Learning.
What this Course Entails
We well be focusing on classification problems.These are problems where we're are predicting which class something belongs to:
Examples will Include
- Predicting who would survive the Titanic Crash
- Determining a handwritten digit from an image
- Using Biospy data to clarify if a lump is cancerous
We'll be using a number of popular techniques to tackle these problems. We'll get into each of them in details in the upcoming modules.
- Logistic Regression
- Decision Trees
- Random Forests
- Neural Networks
By the end of this course, you'll be able to take a classification dataset and use Python to build several different models to determine the best model for the given problem.
For Example
Predicting if a credit card charge is fraudulent
Determining if an image is of car, bus or bike
Machine Learning can be used to solve a broad range of problems. This course will focus on *Supervised Learning* and *Classification*
Averages
When dealing with data, we often need to calculate some simple statistics.
Let's say we have a list of ages of people in a class. We have them in ascending order since it will be easier to do the calculations.
15, 16, 18, 19, 22, 24, 29, 30, 34
The Mean is the most commonly known average.
Add up all the values and divide by the total number of values.
(15 + 16 + 18 + 19 + 22 + 24 + 29 + 30 + 34) / 9 = 207 / 9 = *23*
The Median is the value in the middle. In this case, since there are 9 values, the middle value is the 5th, which is 22.
In statistics, both the Mean and the median are called averages. The layman's Average is the mean
Percentiles
The median can also be thought of as the 50th Percentile. This means that 50% of the data is less than the median and 50% of the data is greater than the median. This tells us where the middle of the data is, but we often want more of an understanding of the distribution of the data. We'll often look at the 25th percentile and the 75th percentile.
The 25th percentile is the value that is one quarter of the way through the data. This is the value where 25% of the data is less than it (and 75% of the data is greater than it).
Similarly, the 75% percentile is three quarters of the way through the data. This is the value where 75% of the data is less than it (and 25% of the data is greater than it).
15, 16, 18, 19, 22, 24, 29, 30, 34
We have 9 values, so 25% of the data would be approximately 2 datapoints. So, the 3rd datapoint is greater than 25% of the data. Thus the 25th percentile is 18(the 3rd Datapoint).
Similarly, 75% of the data is approximately 6 datapoint. So, the 7th datapoint is greater than 75% of the data. Thus, the 75% percentile is 29 (the 7th datapoint).
The full range of our data is between 15 and 34.
The 25th and 75th percentile tells us that half our data is between 18 and 19
. This helps us again understanding of how the data is distributed.
If there is an even number of datapoints, to find the median (or 50th percentile), you take the mean of two values in the middle.
Standard Deviation & Variance
We can get a deeper understanding of the distribution of our data with the Standard deviation and Variance. are measures of how dispersed or spread out the data is.
We measure how far each datapoint i9s from the mean.
Let's Look at our group of Ages again:
15, 16, 18, 19, 22, 24, 29, 30, 34
Recall that the mean is 23
.
Let's calculate how far each value is from the mean. 15
is 8
away from the mean (since 23-15 = 8
).
Here's a list of all these distances.
8, 7, 5, 4, 1, 1, 6, 7, 11
We square these values and add them together.
8 + 7 + 5 + 4 + 1 + 1 + 6 + 7 + 11
= 64 + 49 + 25 + 16 + 1 + 1 + 36 + 49 + 121
= 362
We divide this value by the total number of values and that gives us the variance.
362 / 9 = 40.22
To get the standard deviation, we just take the square root of this number and get: 6.34
If our is normally distributed like the graph below, 68%
of the population is within one standard deviation of the mean. in the graph, we've highlighted the area within one standard deviation of the mean . You can see that the shaded area is about two thirds ( more precisely 68%) of the total area is under the curve. If we assume that our data is normally distributed, we can say that 68% of the data is within 1 standard deviation of the mean.
In our age example, while the ages are likely not exactly normally distributed, we assume that we are and say that approximately 68% of the population has an age within one standard deviation of the mean. Since the mean is 23
and the standard deviation is 6.34
, we can say that approximately 68%
of the ages in our population are between 16.66
(23-6.34
) and 29.34
(23 + 6.34
)
Even though data is never perfect normal distribution, we can still use the standard deviation to again insight about how the data is distributed.
Posted on October 5, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.