Going into a Data Science Python course!
Lina Dias
Posted on March 19, 2020
Hi guys!
As you know by my previous posts (in the Portuguese blog), I've been doing a mini-course at Udemy about data visualization in Python. This course was recommended for helping me to learn Python, so I can join the AI lab in my university ASAP.
First, I was formally presented to matplotlib.pyplot with the import command:
import matplotlib.pyplot as mpl
or as plt
Next, I met the Google Colab platform, which is a Python notebook, and helped me a lot, since I can't download a Python IDE on the computer I'm currently using.
I made a line graph, with the following code (please try, it was a cute experience):
import matplotlib.pyplot as plt
x = [1, 2] #giving x and y some values so I can plot something
y = [2, 3]
plt.plot(x, y) #plotting the graph
plt.show() #show the graph when I hit "Run"
After creating my first graph, I did the legend, so I could identify things inside what I created.
import matplotlib.pyplot as plt
x = [1, 2, 5] #I added one more value in each variable
y = [2, 3, 7]
plt.title("My first graph") #this is a title for my graph
plt.xlabel("Axis X") #creating labels for each axis
plt.ylabel("Axis Y")
plt.plot(x, y)
plt.show()
Now that we learned how to do line graphs, shall we do bar graphs?
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5] #now, x represents each one of the bars
y = [2, 3, 7, 1, 0] #and y, their sizes
titulo = "Bar graph"
eixox = "Axis X" #creating variables for the legends
eixoy = "Axis Y"
plt.title(titulo)
plt.xlabel(eixox)
plt.ylabel(eixoy)
plt.bar(x, y) #plotting the bar graph
plt.show()
Now we, with the knowledge we got about these two types of graphs, can do at least two things: compare graphs and/or unite them!
import matplotlib.pyplot as plt
x1 = [1, 3, 5, 7, 9] #odd numbers for the bars!
y1 = [2, 3, 7, 1, 0] #random numbers for their sizes: unaltered
x2 = [2, 4, 6, 8, 10] #even numbers for other bars
y2 = [5, 1, 3, 7, 4] #more random numbers, but they weren't here before
titulo = "Bar graphs"
eixox = "Axis X" #"eixo" is Portuguese for "axis"
eixoy = "Axis Y"
#this part you already know
plt.title(titulo)
plt.xlabel(eixox)
plt.ylabel(eixoy)
plt.bar(x1, y1)
plt.bar(x2, y2) #one and then another, but are shown together!
plt.show()
We can also unite n types of graphs! You can enter a plt.plot(x, y)
in the bar graphs code, for example. But we have one more type of graph: Scatterplot, or dispersion graph. Call it by plt.scatter(x, y)
.
Quick note for this English version: in the Portuguese one, I've shown some images of the graphs I did so you can see how it is if you can't access Google Colab now. So I wrote thinking about the images that could be seen and now I'm adapting to DEV.
You may have seen that in the comparative graphs the colors change. Actually, these are the default colors, but you can change them to any hue you want (using mainly the color codes of plt, which you can find here), with the color tag, as in plt.scatter(x, y, color="r")
, where I'm changing the color of the dots to red. You can also use the label tag to make captions for the graph, as in plt.plot(x, y, label="My line")
, but using plt.legend()
after it so the caption can show up in the image.
You can save your figures with plt.savefig("figurename.png")
, being that "png" can be altered to "pdf" if we want a vectorized image, so it has a (really) good print quality. We have the dpi tag, which can be used to define the quality of the image. A good dpi value is, apparently, 300, and a so-so is by 72. You can use plt.savefig("figurename.png", dpi=300)
.
The course has a small case study with info from 1980 to 2016 about the increase in the Brazilian population, and then we're presented to the boxplot. Boxplots are box-shaped diagrams which represent variation in data per quartile. This is my current subject in Probability and Statistics, so I was very interested.
Basically, if you enter a code like this one:
import matplotlib.pyplot as plt
import random #a Python library for generating random numbers!
vetor = [] #a small vector to put values on
for i in range(100): #for a i value from (I guess) 0 to 99...
numAleatorio = random.randint(0, 50) #random number ("número aleatório") is somewhere between 0 and 50
vetor.append(numAleatorio) #vector receives this number so we can create the boxplot
plt.boxplot(vetor) #and then he plots the boxplot with the vector value
plt.show()
...and then press Run, it generates another image, but it's not like the other graphs, so let me explain what I know about it.
If there was something over the top (that line), that would represent values which are very different from what was asked. The above line represents the maximum that this number can be (50). Talking about 50, the main rectangle in the figure represents 50% of the obtained data. The lower line is the minimum, zero.
But where are Statistics in this? See: quartiles are fractions of a given quantity that was divided by 4. So the rectangle contains two quartiles, since 2/4 = 50%. The first line is 0%. The second (when the rectangle "begins") is 25% (1st quartile). The median (the red line) represents 50% (2nd quartile). The above line represents 75% (3rd quartile). The maximum, 100%, is the fourth quartile.
To complete the course, there was a case study about Bioinformatics, but for some reason I still haven't figured out completely, my code was resulting in many errors. I put the code on StackOverflow and I hope someone, someday, will help me to find the error. Also, if you could even look at it, I would be very thankful. Edit: I've figured it out somehow. The link is now deactivated.
I recommend this course for people who, like me, are starting to learn Python and Data Science things.
If you tried it, please leave your feedback in the comments :)
Posted on March 19, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.