Intro to SQL for Election Results Analysis
Ken Hudak
Posted on May 19, 2020
This short tutorial will use SQL to perform an analysis to determine the winner of the 2018 Mayoral race in Windsor, Ontario. After all the debates and campaigning the votes have been cast and now it's time to count them up.
I started off by importing the 2018 Election CSV flat file from the City of Windsor's Open Data Catalogue into Microsoft SQL Server, and used SQL Server syntax to go through the steps here in this article.
It's a good idea to start off by conducting some initial data reconnaissance, formally known as exploratory data analysis (EDA), to see what we're dealing with. The table of election information, we'll call it Election2018, consists of 3057 rows made up of the following six columns of data:
Ward - a unique ID for a geographic area in the city
Poll - the voting location ID within a Ward
Poll_Name - name of the voting location
Contest_Title - title of the office the candidate is seeking
Candidate_Name - name of the individual candidates
Total_Votes - how many votes each candidate received at each poll
Part 1: What does the Election2018 table look like using Microsoft SQL Server?
We can use SQL to display a table of information that looks a lot like the familiar format we've grown to love/hate in Excel -- which is all about columns and rows. I'll need to use two key components of SQL to get started: the SELECT statement and the FROM command. With SELECT we'll be telling the database what columns we would like to work with. SELECT is pretty much mandatory, hence it's what we start almost every basic query with.
Using the SELECT command, I could ask the database to show me all of the columns by typing each of their individual names:
SELECT Ward, Poll, Poll_Name, Contest_Title, Candidate_Name, Total_Votes
And I also need to specify the table in which these columns are located. In this example since we're using the Election2018 table, we'll refer to it as that name using the FROM command. End the query with a semicolon. (Not every SQL platform requires the semicolon, but many do, so just make it standard practice to use it.)
So the complete query to look at all the columns and rows from table Election2018 would look like this:
SELECT Ward, Poll, Poll_Name, Contest_Title, Candidate_Name, Total_Votes
FROM Election2018;
While typing in all of the column names may sound like fun, SQL gives us a break from that with the asterisk (*). Using SELECT * means essentially the same thing as HEY, SELECT ALL THE COLUMNS. Much easier for our data analysis phase and it'll save some time.
The revised query looks like:
SELECT *
FROM Election2018;
And here is a small extract of the returned data that we get from the query:
Ward | Poll | Poll_Name | Contest_Title | Candidate_Name | Total_Votes |
---|---|---|---|---|---|
7 | 8 | 7-8 Forest Glade Arena | MAYOR | Drew DILKENS | 468 |
7 | 8 | 7-8 Forest Glade Arena | MAYOR | Ernie LAMONT | 7 |
7 | 8 | 7-8 Forest Glade Arena | MAYOR | Franz (Frank) DYCK | 16 |
7 | 8 | 7-8 Forest Glade Arena | MAYOR | Matt MARCHAND | 195 |
7 | 8 | 7-8 Forest Glade Arena | MAYOR | Tom HENSEL | 10 |
1 | 5 | 1-5 Roseland Golf and Curling Club | COUNCILLOR WARD 1 | Darcie RENAUD | 380 |
1 | 5 | 1-5 Roseland Golf and Curling Club | COUNCILLOR WARD 1 | Fred FRANCIS | 438 |
1 | 5 | 1-5 Roseland Golf and Curling Club | COUNCILLOR WARD 1 | Krysta GLOVASKY-RIDSDALE | 33 |
Are the results as you expected? Kinda. We see Wards and Total Votes and Candidate Names, but it turns out the CSV file Election2018 also contains many more ballot races such as Ward Councillors and Trustees. They sure are important races but we're not concerned with those right now. So let's get rid of them.
Part 2: How do we show just the Mayoral race?
The WHERE clause in SQL works like a filter. We can use it to return only records that we're interested in. In this case it'll be the column Contest_Title and within that column we'd like to only see records returned that are part of the MAYOR contest.
Tack the WHERE clause to the query (always after the FROM statement) and we get a query that looks like this:
SELECT *
FROM Election2018
WHERE Contest_Title = 'MAYOR';
Mayor is in single quotes because it's value is a text string and the quotes communicate that to SQL, otherwise the field could be interpreted as a column, which will produce a "Invalid Column Name" error.
Part 3: Which Mayoral Candidate Won the Most Votes?
Just a couple of more lines of code and we'll know who became the next Mayor way back in 2018! Let's create a tidy table with only specific columns and vote totals from all the polling locations in order to publish the final results.
Start with SELECT. Use the Candidate_Name, Contest_Title, and Total_Votes columns. Name and title are easy, just type those two in separated by commas.
For calculating the total votes we'll need to also include the SUM() function in the SELECT statement. Inside the parenthesis of the function we'll need to enter the name of the numerical field to be summed -- in this case it will be Total_Votes. We'll be asking SQL to look at each polling location and sum up how many votes each candidate received.
Utilizing SUM(Total_Votes) in the SELECT statement is how to approach this. The query will add up all the votes from each Ward and return them in a column named something like 'Column3.' We can customize the name of this new column by using an alias. To give a field an alias just type "as" (no quotes) and then the new column name of your choosing. It'll look like this: SUM(Total_Votes) as TotalMayoralVotes.
Oh, and don't forget that FROM Election2018 command!!
SELECT Candidate_Name, Contest_Title, SUM(Total_Votes) as TotalMayoralVotes
FROM Election2018;
Next we add the WHERE filter described above to get Mayoral candidates only, and below that enter in a GROUP BY statement. GROUP BY will lump together all the votes for individual polling stations for each Mayoral candidate and list the total number of votes in the results next to their name.
SELECT Candidate_Name, Contest_Title, SUM(Total_Votes) as TotalMayoralVotes
FROM Election2018
WHERE Contest_Title = 'Mayor'
GROUP BY Candidate_Name, Contest_Title;
Lastly, I want to see the results in order from who had the most votes in the city to who had the least. Here we'll be using ORDER BY as the last line in the query. SQL returns ORDER BY values in ascending (lowest to highest) order by default, so I'll specify that the TotalMayoralVotes is descending by using DESC in the query (yes, you can ORDER BY on an alias). The final SQL code get stitched together just like this:
SELECT Candidate_Name, Contest_Title, SUM(Total_Votes) as TotalMayoralVotes
FROM Election2018
WHERE Contest_Title = 'Mayor'
GROUP BY Candidate_Name, Contest_Title
ORDER BY TotalMayoralVotes DESC;
And the results are in!
Candidate_Name | Contest_Title | TotalMayoralVotes |
---|---|---|
Drew DILKENS | MAYOR | 31201 |
Matt MARCHAND | MAYOR | 18626 |
Ernie LAMONT | MAYOR | 1121 |
Tom HENSEL | MAYOR | 797 |
Franz (Frank) DYCK | MAYOR | 785 |
Drew Dilkens was re-elected as Windsor's Mayor in 2018 by a 12,500 vote margin over the second place candidate Matt Marchand. Thanks SQL!
Posted on May 19, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.