This article was originally posted by Shahul ES on Neptune blog

In this article, I will discuss some great tips and tricks to improve the performance of your text classification model. These tricks are obtained from solutions of some of Kaggle’s top NLP competitions.

Namely, I’ve gone through:

Jigsaw Unintended Bias in Toxicity Classification – $65,000
Toxic Comment Classification Challenge – $35,000
Quora Insincere Questions Classification – $25,000
Google QUEST Q&A Labeling – $25,000
TensorFlow 2.0 Question Answering – $50,000 and found a ton of great ideas.

Without much lag, let’s begin.

Dealing with larger datasets

One issue you might face in any machine learning competition is the size of your data set. If the size of your data is large, that is 3GB + for Kaggle kernels and more basic laptops you could find it difficult to load and process with limited resources. Here is the link to some of the articles and kernels that I have found useful in such situations.

Optimize the memory by reducing the size of some attributes
Use open-source libraries such as Dask to readand manipulate the data , it performs parallel computing and saves up memory space
Use cudf
Convert data to parquet format
Convert data to feather format

Small datasets and external data

But, what can one do if the dataset is small? Let’s see some techniques to tackle this situation.

One way to increase the performance of any machine learning model is to use some external data frame that contains some variables that influence the predicate variable.

Let’s see some of the external datasets.

Use of squad data for Question Answering tasks
Other datasets for QA tasks
Wikitext long term dependency language modeling dataset
Stackexchange data Prepare a dictionary of commonly misspelled words and corrected words.
Use of helper datasets for cleaning
Pseudo labeling is the process of adding confidently predicted test data to your training data
Use different data sampling methods
Text augmentation by Exchanging words with synonyms
Text augmentation by noising in RNN
Text augmentation by translation to other languages and back

Data Exploration and Gaining insights

Data exploration always helps to better understand the data and gain insights from it. Before starting to develop machine learning models, top competitors always read/do a lot of exploratory data analysis for the data. This helps in feature engineering and cleaning of the data.

Twitter data exploration methods
Simple EDA for tweets
EDA for Quora data
EDA in R for Quora data
Complete EDA with stack exchange data
My previous article on EDA for natural language processing

Data Cleaning

Data cleaning is one of the important and integral parts of any NLP problem. Text data always needs some preprocessing and cleaning before we can represent it in a suitable form.
Use this notebook to clean social media data
Data cleaning for BERT
Use textblob to correct misspellings
Cleaning for pre-trained embeddings
Language detection and translation for multilingual tasks
Preprocessing for Glove part 1 and part 2
Increasing word coverage to get more from pre-trained word embeddings

Text Representations

Before we feed our text data to the Neural network or ML model, the text input needs to be represented in a suitable format. These representations determine the performance of the model to a large extent.

Pretrained Glove vectors
Pretrained fasttext vectors
Pretrained word2vec vectors
My previous article on these 3 embedding
Combining pre-trained vectors. This can help in better representation of text and decreasing OOV words
Paragram embeddings
Universal Sentence Encoder
Use USE to generate sentence-level features
3 methods to combine embedding

Contextual embeddings models

BERT Bidirectional Encoder Representations from Transformers
GPT
Roberta a Robustly Optimized BERT
Albert a Lite BERT for Self-supervised Learning of Language Representations
Distilbert a lighter version of BERT
XLNET

Modeling

Model architecture

Choosing the right architecture is important to develop a proper machine learning model, sequence to sequence models like LSTMs, GRUs perform well in NLP problems and is always worth trying. Stacking 2 layers of LSTM/GRU networks is a common approach.

Loss functions

Choosing a proper loss function for your NN model really enhances the performance of your model by allowing it to optimize well on the surface.

You can try different loss functions or even write a custom loss function that matches your problem. Some of the popular loss functions are

Binary cross-entropy for binary classification
Categorical cross-entropy for multi-class classification
Focal loss used for unbalanced datasets
Weighted focal loss for multilabel classification
Weighted kappa for multiclass classification
BCE with logit loss to get sigmoid cross-entropy
Custom mimic loss used in Jigsaw unintended bias classification competition
MTL custom loss used in jigsaw unintended bias classification competition

Optimizers

Stochastic gradient descent
RMSprop
Adagrad allows the learning rate to adapt based on parameters
Adam for fast and easy convergence
Adam with warmup to enable warmup state to Adam algorithm
Bert Adam for Bert based models
Rectified Adam for stabilizing training and accelerating convergence

Callback methods

Callbacks are always useful to monitor the performance of your model while training and trigger some necessary actions that can enhance the performance of your model.

Model checkpoint for monitoring and saving weights
Learning rate scheduler to change the learning rate based on model performance to help converge easily
Simple custom callbacks using lambda callbacks
Custom Checkpointing
Building your custom callbacks for various use cases
Reduce on plateau to reduce the learning rate when a metric has stopped improving
Early Stopping to stop training when the model stops improving
Snapshot ensembling to get a variety of model checkpoints in one training
Fast geometric ensembling
Stochastic Weight Averaging (SWA)
Dynamic learning rate decay

Evaluation and cross-validation

Choosing a suitable validation strategy is very important to avoid huge shake-ups or poor performance of the model in the private test set.

The traditional 80:20 split wouldn’t work for many cases. Cross-validation works in most cases over the traditional single train-validation split to estimate the model performance.

There are different variations of KFold cross-validation such as group k-fold that should be chosen accordingly.

K-fold cross-validation
Stratified KFold cross-validation
Group KFold
Adversarial validation to check if train and test distributions are similar or not
CV analysis of different strategies

Runtime tricks

You can perform some tricks to decrease the runtime and also improve model performance at the runtime.

Sequence bucketing to save runtime and improve performance
Get sentences from its head and tail when the input sentence is larger than 512 tokens
Use the GPU efficiently
Free keras memory
Save and load models to save runtime and memory
Don’t Save Embedding in RNN Solutions
Load word2vec vectors without key vectors

Model ensembling

If you’re in the competing environment one won’t get to the top of the leaderboard without ensembling. Selecting the appropriate ensembling/stacking method is very important to get the maximum performance out of your models.

Let’s see some of the popular ensembling techniques used in Kaggle competitions:

Weighted average ensemble
Stacked generalization ensemble
Out of folds predictions
Blending with linear regression
Use optuna to determine blending weights
Power average ensemble
Power 3.5 blending strategy

Final thoughts

In this article, you saw many popular and effective ways to improve the performance of your NLP classification model. Hopefully, you will find them useful in your projects.

This article was originally posted on neptune.ml/blog where you can find more in-depth articles for machine learning practitioners.

Blog

Text Classification: All Tips and Tricks from 5 Kaggle Competitions

Jakub Czakon