Intro to Machine Learning in Python: Part I
Brett Hammit
Posted on November 11, 2020
After messing around with really getting to know the in's and outs of data frame management and other sides of data science in Python I have been reluctant to get into Machine Learning with the worry of not having the time I would like to commit to it and get as good as I would like. Like everything sometimes you just gotta do it. So here we go.
Starting Point:
Where I am starting is Supervised learning, which basically means there is known input and outputs and you are just modifying the parameters of your model to predict future outcomes.
-An example of this would be Positive vs. Negative movie reviews
I am doing doing this work in Jupyter with the library scikit learn in Python which has algorithms already in it, which makes it much easier to fit models, split test and training data etc.
Linear Regression
Linear Regression is the step up after correlation, it is when we try to model the relationship between of two variables by fitting a model to predict a value.
Within Machine Learning there are some base algorithms and it can be hard to decide what is the best model for your data. This cheat sheet really gives a pretty good guide of what you should be doing based off your data.
Working With Our Data
So the first thing is we need data to work with in order to try to build a model. When you have your data readily available the first thing to do is to analyze what you are working with.
The first step to this is taking your data and setting it into a data frame. We can do this by using "pd.read_csv('YourData')" or whatever type of file you are working with to read to. Creating this data frame will allows us to dig deeper to see what we need to do with our model.
Analyzing Our Data
A good starter on where to first look within your data is use the .describe() and .columns methods on your data to see your columns names and some additional info about them.
With Seaborn in Python being imported we can use "sns.pairplot(YourDataFrame)"
to give us a good idea of the distribution of our data.
An example of Normally Distributed Data vs. Not Normally Distributed Data
After that we can look at the correlation of our data by using "sns.heatmap(df.corr(), annot=True)" to see a heat map of our data as well as the correlations on top of them. 1 means that they are perfectly correlated with one another.
Lastly, in analyzing our data we need to pick what we would like to predict so we can choose the column of what we want to predict and use "sns.distplot(YourDataFrame['ColumnName'])" to pull up a distribution plot of that column. It should be normally distributed like I talked about above.
Conclusion
In this post I mainly talked about my first day in Machine Learning primarily working with Linear Regression and analyzing your data for getting ready to fit it. My next post should be more about actual ML and training, testing and fitting our model!
Posted on November 11, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.