Machine Learning and Tabular Data

Introduction

Machine Learning is quite simple but then it could be quite complicated at the same time. Neural networks are a fundamental aspect of machine learning. They are a series of algorithms that work together to identify underlying patterns in data, they usually mimic or behave in a way that the neurons of the human brain work and that is where the name is gotten from.

Data comes in various forms, there is structured data, semi-structured and unstructured data. Structured data can be found in spreadsheets etc while unstructured data exists in log files, images, audios etc. Tabular data is a form of structured data. It is data that is structured into rows, where each of those rows contains information about something.

The Problems

Using neural networks on tabular data has not always been ideal. Models that are usually built with neural networks typically have low performance compared to traditional machine learning models. Tabular data typically does not have the hyper non-linear relationships that image recognition, NLP datasets have and there isn’t enough information in tabular data for the models to capitalize on and increase their performance levels.

The quality of data found is another one of the major concerns in tabular data. There are oftentimes outliers in the data, missing values. It is also difficult to find spatial correlations between the variables found in tabular datasets, which means that methods like Convolutional Neural Networks are unable to create models based on tabular data. Another important problem is the conversion of categorical attributes in the data. This is usually done using one-hot encoding but that increases the problem of dimensionality. Data augmentation is a very important part of machine learning as it helps the model become more accurate. It is very challenging to apply that for tabular data and all of these combine to show the complexity of using Neural networks with Tabular data.

Models that perform very well on tabular data such as Gradient boosted trees, random forests, linear regression algorithms etc. all do very well when mapping “shallow” non-linear relationships and the mapping is done in an efficient and simple way. So, neural networks are not bad for tabular data, the amount of data required for a neural network to have good performance is not typically found in tabular data and explains the underperformance.

The time and resources needed to tune neural networks and deep learning for tabular data are also not easily justifiable knowing how well gradient boosting algorithms work on the same type of data.

What ML algorithms work instead

As alluded to earlier, gradient boosting algorithms have been shown to be the best for working on problems including tabular data, the best bet you can get for accurate modelling of these problems are LightGBM, XGBoost, Catboost.These three can be considered as the holy grail of tabular data and should be the first point of call in Tabular Data problems. Linear regression algorithms such as Logistic Regression, ElasticNet, etc. also perform admirably.

If there is still a need for a deep learning model to be created for tabular data, there exists Tabnet. TabNet is a Deep Neural Network for working with Structured, Tabular Data. It has outperformed previously mentioned Decision Tree-based models on multiple benchmark datasets and can be used in practice. A simple guide for implementation in solving a problem can be found here.

Understanding what your problem needs and knowing what to prioritize will aid in choosing the right machine learning method to use. Hopefully, this article helps you understand the available options. Thank you.

Some content that was used to gain an understanding of this issue include:

Blog

Machine Learning and Tabular Data

iyissa

Introduction

The Problems

What ML algorithms work instead

Join Our Newsletter. No Spam, Only the good stuff.

Related