Introduction

The aim of this project is to uncover potential election irregularities to enable the electoral commission to ensure transparency of election results. In this project , I will identify outlier polling units where the voting results deviate significantly from neighbouring units.

Data Understanding

The dataset used in this analysis, represents polling units in the state of Akwa Ibom only.The data used can be found here. I conducted this analysis in Python as follows

from google.colab import drive, files
drive.mount('/content/drive')
#Import Libraries
import pandas as pd
from geopy.geocoders import OpenCage
#path = '/content/drive/MyDrive/Colab Notebooks/Nigeria_Elections/'
data = pd.read_csv(path + "AKWA_IBOM_crosschecked.csv")

Here is a summary about columns in the data set

State: The name of the Nigerian state where the election took place (e.g., “AKWA IBOM”).
LGA (Local Government Area): The specific local government area within the state (e.g., “ABAK”).
Ward: The electoral ward within the local government area (e.g., “ABAK URBAN 1”).
PU-Code (Polling Unit Code): A unique identifier for the polling unit (e.g., “3/1/2001 0:00”).
PU-Name (Polling Unit Name): The name or location of the polling unit (e.g., “VILLAGE SQUARE, IKOT AKWA EBOM” or “PRY SCH, IKOT OKU UBARA”).
Accredited Voters: The number of voters accredited to participate in the election at that polling unit.
Registered Voters: The total number of registered voters in that polling unit.
Results Found: Indicates whether results were found for this polling unit (usually TRUE or FALSE).
Transcription Count: The count of how many times the results were transcribed (may be -1 if not applicable).
Result Sheet Stamped: Indicates whether the result sheet was stamped (TRUE or FALSE).
Result Sheet Corrected: Indicates whether any corrections were made to the result sheet (TRUE or FALSE).
Result Sheet Invalid: Indicates whether the result sheet was deemed invalid (TRUE or FALSE).
Result Sheet Unclear: Indicates whether the result sheet was unclear (TRUE or FALSE).
Result Sheet Unsigned: Indicates whether the result sheet was unsigned (TRUE or FALSE).
APC: The number of votes received by the All Progressives Congress (APC) party.
LP: The number of votes received by the Labour Party (LP).
PDP: The number of votes received by the People’s Democratic Party (PDP).
NNPP: The number of votes received by the New Nigeria People’s Party (NNPP).

I then created the Address column by concatenating the Polling unit Name, Ward, the Local government Area and State, which will be useful during geocoding:

data['Address'] = data['PU-Name'] + ',' + data['Ward'] + ',' + data['LGA'] + ',' + data['State']

To obtain the Latitude and Longitude columns, I utilized geospatial encoding techiniques.
I generated an API key on OpenCage Geocoding API, and defined a function geocode_address to geocode our new Address column to obtain the Latitude and Longitude columns

def geocode_address(Address):
  try:
    location = geolocator.geocode(Address)
    return location.latitude, location.longitude
  except:
    return None, None

data[['Latitude', 'Longitude']] = data['Address'].apply(lambda x: pd.Series(geocode_address(x)))

A quick at our dataset:

Looks like our function works and I was able to obtain the Latitude and Longitude column.
As there are still null values in these 2 columns, I will Impute them using the Simple Imputer, which will replace the missing values with the mean.

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy = 'mean')
data[['Latitude', 'Longitude']] = imputer.fit_transform(data[['Latitude', 'Longitude']])
data.to_csv('AKWA_IBOM_geocode.csv', index = False)

Identifying Neighbours

I defined a radius of 1 km to identify which polling units are considered neighbours

#Calculate distance and find neighbours
from geopy.distance import geodesic
neighbours= {}
def neighbouring_pu(data, radius = 1.0):
  for i, row in data.iterrows():
    neighbours[i] = []
    for j, row2 in data.iterrows():
      if i != j:
        distance = geodesic((row['Latitude'],row['Longitude']), (row2['Latitude'],row2['Longitude'])).km
        if distance <= radius:
          neighbours[i].append(j)
  return neighbours

neighbours = neighbouring_pu(data, radius =1.0)

Outlier Calculation - Score
I will define a function, get_outlier_scores, that calculates the outlier scores for voting data in this dataset. It does so by comparing the votes each row received for various parties (APC, LP, PDP, NNPP) to the average votes received by its neighboring rows, which are specified in a dictionary, neighbours.
For each row, the function computes the absolute difference between the votes in that row and the average votes of its neighbors for each party, and stores these differences as outlier scores. Finally, it returns a new DataFrame that combines the original voting data with the calculated outlier scores. This allows for the identification of rows with voting patterns that significantly differ from their neighbors.

def get_outlier_scores(data, neighbours):
  outlier_scores = []
  parties = ['APC', 'LP', 'PDP', 'NNPP']
  for i, row in data.iterrows():
    scores = {}
    for party in parties:
      votes = row[party]
      neighbour_votes = data.loc[neighbours[i], party].mean() if neighbours[i] else 0
      scores[party + '_outlier_score'] = abs(votes - neighbour_votes)
    outlier_scores.append(scores)
    outlier_scores_data = pd.DataFrame(outlier_scores)
  return pd.concat([data, outlier_scores_data], axis = 1)

outlier_scores_df = get_outlier_scores(data, neighbours)

Sorting and Reporting
I sorted the data by the outlier scores for each party and obtained the following detailed report that includes the top five outliers for each party, with the 'PU-Code', number of votes, and the outlier score.

: All Progressives Congress (APC) party

PU-Code	APC	APC_outlier_score
03-05-11-009	324	228.52
03-29-05-013	194	167.334
03-30-07-001	180	153.325
03-05-09-014	194	152.149
03-28-05-003	180	138.132

: Labour Party (LP)

PU-Code	LP	LP_outlier_score
03-05-11-009	59	45.451
03-29-05-013	42	6.65894
03-30-07-001	29	6.34942
03-05-09-014	3	26.5831
03-28-05-003	91	61.5261

: People’s Democratic Party (PDP)

PU-Code	PDP	PDP_outlier_score
03-05-11-009	7	27.3627
03-29-05-013	181	145.232
03-30-07-001	17	18.8739
03-05-09-014	36	24.2221
03-28-05-003	12	48.2519

: New Nigeria People’s Party - NNPP

PU-Code	NNPP	NNPP_outlier_score
03-05-11-009	0	0.27451
03-29-05-013	6	4.14865
03-30-07-001	0	1.85521
03-05-09-014	0	2.36104
03-28-05-003	0	2.36104

Visualize the neighbours

Generate scatterplots to visualize the geographical distribution of polling units based on their outlier scores for four political parties (APC, LP, PDP, NNPP).
Each point represents a polling unit plotted by its latitude and longitude.
Each plot provides a clear visual representation of how the outlier scores are geographically distributed, making it easier to identify patterns or anomalies in the data.

import matplotlib.pyplot as plt
import seaborn as sns

parties = ['APC', 'LP', 'PDP', 'NNPP']
for party in parties:
  plt.figure(figsize=(10, 6))
  sns.scatterplot(data=outlier_scores_df, x='Latitude', y='Longitude', hue=party + '_outlier_score', palette='viridis')
  plt.title(f'Polling Units by {party} Outlier Score')
  plt.xlabel('Latitude')
  plt.ylabel('Longitude')
  plt.legend(title=party + ' Outlier Score')
  plt.savefig(f'polling_units_{party}_outlier_score.png')
  plt.show()