Despite the unbelievable development of technology, simple basic needs such as access to clean drinking water are still one of the most important problems of human beings. For some areas in the world, to find clean water, pump this water up and transport this water to people are really difficult processes. Tanzania is the largest country of East-Africa with 59 million population. 25 million of this population have lack access to clean water, 40 million people also have a lack access to improved sanitation. The Tanzanian Water Ministry agreed with Taarifa and they begun a competition by DrivenData to solve this problem with improving clean water sources. As a Module 3 project of Flatiron School Data Science Bootcamp, I worked on this problem with Mark Subra. The reason for choosing this project is my interest in solving the main problems that concern humanity. I have worked in NGO for many years to help people. Even, I have two close friends who live in Tanzania and always tell me about the water shortage of this country. So, I have an interest this type of problems.

Water points were divided in three classes as functional, non-functional or functional but needs repair by water ministry. Our aim in this project to build a model which predicts the functionality of water points. The data was taken from DrivenData. Basically, there are 4 different datasets; submission format, training set, test set and train labels set which contains status of wells. With given training set and labels set, competitors are wanted to build predictive model and apply it to test set to determine status of the wells and submit. Train set contains 59400 water points data with 40 features.

Challenges of This Project

Importance of Data Cleaning

Mainly, there are two challenges in this data. The first one is to clean the data. Because, it contains lots of columns which has same information. These columns cause multi-collinearity in model. Also, there are many null, zero and missing values. Generally, features are categorical and some of them has more than 2000 unique values. There are spelling mistakes in some columns which creates high unique values. Lastly, some columns has discrete values. So, we dropped some columns which contains same information, converted null and missing values to mean or collected them in unknown category. For feature engineering, we created new columns for some features and categorized them again manually. This cleaning process took too much time but at the end, we understood the importance of data cleaning again. Because, for the first modeling trial as a baseline with simple logistic regression our model gave 0.83 roc-auc score for binary class.

Imbalanced Ternary Class Problem

Our data has highly imbalanced three target labels and all three of them are important to predict as true. So, we have to find the balanced values for each label in confusion matrix. To simplify this problem, firstly we collected functional and functional but needs repair wells together and found the best model for binary class. With doing this, we understood the which model approach is true for this data and how to set our parameters for models. When we converted our data to ternary class we also faced with imbalance problem. One of our target labels has very less value than others. To solve this problem, we used oversampling technique SMOTE (Synthetic Minority Over-Sampling Technique). SMOTE works on the idea of nearest neighbors and create its synthetic data. It is one of the most popular techniques for oversampling.

Models and Metric

To prepare our data to machine learning, we did encoding and scaling also for dealing with first challenge. For ternary target model, Target Encoder and Robust Scaler were used. Random Forest, LGBM and XGBoost were tried. To handle the second challenge as imbalanced target problem, SMOTE over-sampling technique was applied. The metric for competition was balanced accuracy. So, we used this metric and sometimes to compare and check results one more, we used roc-auc score.

Explorations of Problems for Solution

From the graph, it is shown that there are many wells which contains enough water but non-functional. Also, it is observed that 4272 wells were dried but they have good water quality. With finding a solution to give source again these wells, they can be functional. Finding clean water sources is not the only problem, to continue to feed these sources are equally important. 2226 (7%) wells have enough and soft water but needs repair. Authorities must invest on repairing. Otherwise these will be non-functional. 8035 (27%) wells has enough, good quality water but they are non-functional. This shows that authorities must work and invest on technology to pump these good sources.

3500 water wells need repairs, otherwise they would be non-functional easily.

This graph shows the highest ratio of functional wells to non-functional by funder. Danida is Tanzania, Denmark cooperation for wells and has many functional wells. RWSSP is Rural Water Supply and Sanitation Program. Also, most of the wells which was funded by Germany Republic are functional.

Mostly the wells which are funded by government are non-functional. Most of water points which central government and district council installed are non-functional.
Darul es Salaam is one of the highest populated cities but 35% of good water quality points are non-functional.
Iringa is one of the important areas but it contains lots of non-functional water points which has soft water.

The most common extraction type is gravity but second is hand pumps. It is seen that, there are many non-functional water points which belongs to gravity (which is natural force so no need to do anything expensive) as extraction type. So, gravity type wells do not need too much investment on it. So, there can be found more water points which can be functional easily.
The wells which have constructed in recent years are functional then older ones. And it is seen that recent years have some functional but needs repair wells. It means that if they will not be repaired recently, they will be non-functional easily.

This map shows the location of functional but needs repair wells locations. There are some clusters around highly populated areas. With the regular maintenance of this wells, more people can find clean water.
Water basin is also another important parameter for functionality of wells. The areas which has near to good water basin high probability to find clean water.

Wells with no fee are more likely to be non-functional, and wells with some form of payment are more likely to be functional.

Solutions

Our model can predict the functionality of the water wells with 86% accuracy. With the good prediction of functionality, the solutions can be;

prioritizing functioning wells which need repair and yield clean water
targeting repairs to clusters of wells especially those with high populations
payments of some kind will provide incentive to keep wells functional
allocate funds and resources to effective organizations with track record

All details for cleaning process, data preprocessing, modeling process, more explorations and solutions for problem and future improvements can be found in this github repo.

Pump image courtesy of flickr user christophercjensen and gif is from giphy

Blog

Pump it Up: Data Mining the Water Table

ezgigm