Tabular Data Binary Classification: All Tips and Tricks from 5 Kaggle Competitions

kamil_k7k

Kamil A. Kaczmarek

Posted on July 22, 2020

Tabular Data Binary Classification: All Tips and Tricks from 5 Kaggle Competitions

This article was originally written by Shahul Es and posted on the Neptune blog.


In this article, I will discuss some great tips and tricks to improve the performance of your structured data binary classification model. These tricks are obtained from solutions of some of Kaggle’s top tabular data competitions. Without much lag, let’s begin.

These are the five competitions that I have gone through to create this article:

Dealing with larger datasets

One issue you might face in any machine learning competition is the size of your data set. If the size of your data is large, that is 3GB + for kaggle kernels and more basic laptops you could find it difficult to load and process with limited resources. Here is the link to some of the articles and kernels that I have found useful in such situations.

Data exploration

Data exploration always helps to better understand the data and gain insights from it. Before starting to develop machine learning models, top competitors always read/do a lot of exploratory data analysis for the data. This helps in feature engineering and cleaning of the data.

Data preparation

After data exploration, the first thing to do is to use those insights to prepare the data. To tackle issues like class imbalance, encoding categorical data, etc. Let’s see the methods used to do it.

Feature engineering

Next, you can check the most popular feature and feature engineering techniques used in these top kaggle competitions. The feature engineering part varies from problem to problem depending on the domain.

Feature selection

After generating many features from your data, you need to decide which all features to use in your model to get the maximum performance out of your model. This step also includes identifying the impact each feature is having on your model. Let’s see some of the most popular feature selection methods.

Modeling

After handcrafting and selecting your features, you should choose the right Machine learning algorithm to make your prediction. These are the collection of some of the most used ML models in structured data classification challenges.

Hyperparameter tuning

Evaluation

Choosing a suitable validation strategy is very important to avoid huge shake-ups or poor performance of the model in the private test set.

The traditional 80:20 split wouldn’t work for many cases. Cross-validation works in most cases over the traditional single train-validation split to estimate the model performance.

There are different variations of KFold cross-validation such as group k-fold that should be chosen accordingly.

Note:

There are various metrics that you can use to evaluate the performance of your tabular models. A bunch of useful classification metrics are listed and explained here.

Other training tricks

Ensemble

If you’re in the competing environment one won’t get to the top of the leaderboard without ensembling. Selecting the appropriate ensembling/stacking method is very important to get the maximum performance out of your models.

Let’s see some of the popular ensembling techniques used in kaggle competitions:

Final thoughts

In this article, you saw many popular and effective ways to improve the performance of your tabular data binary classification model. Hopefully, you will find them useful in your projects


This article was originally written by Shahul Es and posted on the Neptune blog, where you can find more in-depth articles for machine learning practitioners.

💖 💪 🙅 🚩
kamil_k7k
Kamil A. Kaczmarek

Posted on July 22, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related