Outlier Detection in Election Data Using Geospatial Analysis - AKWA IBOM
mwang-cmn
Posted on July 5, 2024
Introduction
The aim of this project is to uncover potential election irregularities to enable the electoral commission to ensure transparency of election results. In this project , I will identify outlier polling units where the voting results deviate significantly from neighbouring units.
Data Understanding
The dataset used in this analysis, represents polling units in the state of Akwa Ibom only.The data used can be found here. I conducted this analysis in Python as follows
from google.colab import drive, files
drive.mount('/content/drive')
#Import Libraries
import pandas as pd
from geopy.geocoders import OpenCage
#path = '/content/drive/MyDrive/Colab Notebooks/Nigeria_Elections/'
data = pd.read_csv(path + "AKWA_IBOM_crosschecked.csv")
Here is a summary about columns in the data set
- State: The name of the Nigerian state where the election took place (e.g., “AKWA IBOM”).
- LGA (Local Government Area): The specific local government area within the state (e.g., “ABAK”).
- Ward: The electoral ward within the local government area (e.g., “ABAK URBAN 1”).
- PU-Code (Polling Unit Code): A unique identifier for the polling unit (e.g., “3/1/2001 0:00”).
- PU-Name (Polling Unit Name): The name or location of the polling unit (e.g., “VILLAGE SQUARE, IKOT AKWA EBOM” or “PRY SCH, IKOT OKU UBARA”).
- Accredited Voters: The number of voters accredited to participate in the election at that polling unit.
- Registered Voters: The total number of registered voters in that polling unit.
- Results Found: Indicates whether results were found for this polling unit (usually TRUE or FALSE).
- Transcription Count: The count of how many times the results were transcribed (may be -1 if not applicable).
- Result Sheet Stamped: Indicates whether the result sheet was stamped (TRUE or FALSE).
- Result Sheet Corrected: Indicates whether any corrections were made to the result sheet (TRUE or FALSE).
- Result Sheet Invalid: Indicates whether the result sheet was deemed invalid (TRUE or FALSE).
- Result Sheet Unclear: Indicates whether the result sheet was unclear (TRUE or FALSE).
- Result Sheet Unsigned: Indicates whether the result sheet was unsigned (TRUE or FALSE).
- APC: The number of votes received by the All Progressives Congress (APC) party.
- LP: The number of votes received by the Labour Party (LP).
- PDP: The number of votes received by the People’s Democratic Party (PDP).
- NNPP: The number of votes received by the New Nigeria People’s Party (NNPP).
I then created the Address column by concatenating the Polling unit Name, Ward, the Local government Area and State, which will be useful during geocoding:
data['Address'] = data['PU-Name'] + ',' + data['Ward'] + ',' + data['LGA'] + ',' + data['State']
To obtain the Latitude and Longitude columns, I utilized geospatial encoding techiniques.
I generated an API key on OpenCage Geocoding API, and defined a function geocode_address to geocode our new Address column to obtain the Latitude and Longitude columns
def geocode_address(Address):
try:
location = geolocator.geocode(Address)
return location.latitude, location.longitude
except:
return None, None
data[['Latitude', 'Longitude']] = data['Address'].apply(lambda x: pd.Series(geocode_address(x)))
A quick at our dataset:
Looks like our function works and I was able to obtain the Latitude and Longitude column.
As there are still null values in these 2 columns, I will Impute them using the Simple Imputer, which will replace the missing values with the mean.
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy = 'mean')
data[['Latitude', 'Longitude']] = imputer.fit_transform(data[['Latitude', 'Longitude']])
data.to_csv('AKWA_IBOM_geocode.csv', index = False)
Identifying Neighbours
I defined a radius of 1 km to identify which polling units are considered neighbours
#Calculate distance and find neighbours
from geopy.distance import geodesic
neighbours= {}
def neighbouring_pu(data, radius = 1.0):
for i, row in data.iterrows():
neighbours[i] = []
for j, row2 in data.iterrows():
if i != j:
distance = geodesic((row['Latitude'],row['Longitude']), (row2['Latitude'],row2['Longitude'])).km
if distance <= radius:
neighbours[i].append(j)
return neighbours
neighbours = neighbouring_pu(data, radius =1.0)
Outlier Calculation - Score
I will define a function, get_outlier_scores, that calculates the outlier scores for voting data in this dataset. It does so by comparing the votes each row received for various parties (APC, LP, PDP, NNPP) to the average votes received by its neighboring rows, which are specified in a dictionary, neighbours.
For each row, the function computes the absolute difference between the votes in that row and the average votes of its neighbors for each party, and stores these differences as outlier scores. Finally, it returns a new DataFrame that combines the original voting data with the calculated outlier scores. This allows for the identification of rows with voting patterns that significantly differ from their neighbors.
def get_outlier_scores(data, neighbours):
outlier_scores = []
parties = ['APC', 'LP', 'PDP', 'NNPP']
for i, row in data.iterrows():
scores = {}
for party in parties:
votes = row[party]
neighbour_votes = data.loc[neighbours[i], party].mean() if neighbours[i] else 0
scores[party + '_outlier_score'] = abs(votes - neighbour_votes)
outlier_scores.append(scores)
outlier_scores_data = pd.DataFrame(outlier_scores)
return pd.concat([data, outlier_scores_data], axis = 1)
outlier_scores_df = get_outlier_scores(data, neighbours)
Sorting and Reporting
I sorted the data by the outlier scores for each party and obtained the following detailed report that includes the top five outliers for each party, with the 'PU-Code', number of votes, and the outlier score.
: All Progressives Congress (APC) party
PU-Code | APC | APC_outlier_score |
---|---|---|
03-05-11-009 | 324 | 228.52 |
03-29-05-013 | 194 | 167.334 |
03-30-07-001 | 180 | 153.325 |
03-05-09-014 | 194 | 152.149 |
03-28-05-003 | 180 | 138.132 |
: Labour Party (LP)
PU-Code | LP | LP_outlier_score |
---|---|---|
03-05-11-009 | 59 | 45.451 |
03-29-05-013 | 42 | 6.65894 |
03-30-07-001 | 29 | 6.34942 |
03-05-09-014 | 3 | 26.5831 |
03-28-05-003 | 91 | 61.5261 |
: People’s Democratic Party (PDP)
PU-Code | PDP | PDP_outlier_score |
---|---|---|
03-05-11-009 | 7 | 27.3627 |
03-29-05-013 | 181 | 145.232 |
03-30-07-001 | 17 | 18.8739 |
03-05-09-014 | 36 | 24.2221 |
03-28-05-003 | 12 | 48.2519 |
: New Nigeria People’s Party - NNPP
PU-Code | NNPP | NNPP_outlier_score |
---|---|---|
03-05-11-009 | 0 | 0.27451 |
03-29-05-013 | 6 | 4.14865 |
03-30-07-001 | 0 | 1.85521 |
03-05-09-014 | 0 | 2.36104 |
03-28-05-003 | 0 | 2.36104 |
Visualize the neighbours
Generate scatterplots to visualize the geographical distribution of polling units based on their outlier scores for four political parties (APC, LP, PDP, NNPP).
Each point represents a polling unit plotted by its latitude and longitude.
Each plot provides a clear visual representation of how the outlier scores are geographically distributed, making it easier to identify patterns or anomalies in the data.
import matplotlib.pyplot as plt
import seaborn as sns
parties = ['APC', 'LP', 'PDP', 'NNPP']
for party in parties:
plt.figure(figsize=(10, 6))
sns.scatterplot(data=outlier_scores_df, x='Latitude', y='Longitude', hue=party + '_outlier_score', palette='viridis')
plt.title(f'Polling Units by {party} Outlier Score')
plt.xlabel('Latitude')
plt.ylabel('Longitude')
plt.legend(title=party + ' Outlier Score')
plt.savefig(f'polling_units_{party}_outlier_score.png')
plt.show()
Deliverables
Posted on July 5, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 29, 2024