cflearn - a minimal Automatic Machine Learning (AutoML) solution for tabular datasets based on PyTorch
carefree0910
Posted on July 6, 2020
Introduction
I've been working on tabular datasets in the past few years, and managed to build a rough AutoML system that beat the 'auto sklearn' solution to some extend. After I met PyTorch, I was deeply attracted by its simplicity and power, but I failed to find a satisfying solution for tabular datasets which was 'carefree' enough. So I decided to take advantage of my knowledges and build one myself, and here comes the carefree-learn
, which aims to provide out of the box tools to train neural networks on tabular datasets with PyTorch.
carefree-learn
provides high level APIs for PyTorch to simplify the training on tabular datasets. It features:
- A scikit-learn-like interface with much more 'carefree' usages. In fact,
carefree-learn
provides an end-to-end pipeline on tabular datasets, including AUTOMATICALLY deal with:- Detection of redundant feature columns which can be excluded (all SAME, all DIFFERENT, etc).
- Detection of feature columns types (whether a feature column is string column / numerical column / categorical column).
- Imputation of missing values.
- Encoding of string columns and categorical columns (Embedding or One Hot Encoding).
- Pre-processing of numerical columns (Normalize, Min Max, etc.).
- And much more...
- Can either fit / predict directly from some numpy arrays, or fit / predict indirectly from some files locate on your machine.
- Easy-to-use saving and loading. By default, everything will be wrapped into a zip file!
- Distributed Training, which means hyper-parameter tuning can be very efficient in
carefree-learn
. - Supports many convenient functionality in deep learning, including:
- Early stopping.
- Model persistence.
- Learning rate schedulers.
- And more...
- Some 'translated' machine learning algorithms, including:
- Trainable (Neural) Naive Bayes
- Trainable (Neural) Decision Tree
- Some brand new techniques which may boost vanilla Neural Network (NN) performances on tabular datasets, including:
- TreeDNN with Dynamic Soft Pruning, which makes NN less sensitive to hyper-parameters.
- Deep Distribution Regression (DDR), which is capable of modeling the entire conditional distribution with one single NN model.
- Highly customizable for developers. We have already wrapped (almost) every single functionality / process into a single module (a Python class), and they can be replaced or enhanced either directly from source codes or from local codes with the help of some pre-defined registration functions provided by
carefree-learn
. - Full utilization of the WIP ecosystem
cf*
, such as:- carefree-toolkit: provides a lot of utility classes & functions which are 'stand alone' and can be leveraged in your own projects.
- carefree-data: a lightweight tool to read -> convert -> process ANY tabular datasets. It also utilizes cython to accelerate critical procedures.
To try carefree-learn
, you can install it with pip install carefree-learn
.
Details
I structured the carefree-learn
backend in three modules: Model
, Pipeline
and Wrapper
:
-
Model
: Incarefree-learn
, aModel
should implement the core algorithms.- It assumes that the input data in training process is already 'batched, processed, nice and clean', but not yet 'encoded'.
- Fortunately,
carefree-learn
pre-defined some useful methods which can encode categorical columns easily.
- Fortunately,
- It does not care about how to train a model, it only focuses on how to make predictions with input, and how to calculate losses with them.
- It assumes that the input data in training process is already 'batched, processed, nice and clean', but not yet 'encoded'.
-
Pipeline
: Incarefree-learn
, aPipeline
should implement the high-level parts, as listed below:- It assumes that the input data is already 'processed, nice and clean', but it should take care of getting input data into batches, because in real applications batching is essential for performance.
- It should take care of the training loop, which includes updating parameters with an optimizer, verbosing metrics, checkpointing, early stopping, logging, etc.
-
Wrapper
: Incarefree-learn
, aWrapper
should implement the preparation and API part.- It should not make any assumptions to the input data, it could already be 'nice and clean', but it could also be 'dirty and messy'. Therefore, it needs to transform the original data into 'nice and clean' data and then feed it to
Pipeline
. The data transformations include:- Imputation of missing values.
- Transforming string columns into categorical columns.
- Processing numerical columns.
- Processing label column (if needed).
- It should implement some algorithm-agnostic methods (e.g.
predict
,save
,load
, etc.).
- It should not make any assumptions to the input data, it could already be 'nice and clean', but it could also be 'dirty and messy'. Therefore, it needs to transform the original data into 'nice and clean' data and then feed it to
It is worth mentioning that carefree-learn
uses registrations to manage the code structure.
Although the demand of working with tabular datasets is not that large, I'll be very happy if carefree-learn
could help someone who needs it.
Examples
For detailed information, please visit the documentation.
Quick Start
import cflearn
from cfdata.tabular import TabularDataset
x, y = TabularDataset.iris().xy
m = cflearn.make().fit(x, y)
# Make label predictions
m.predict(x)
# Make probability predictions
m.predict_prob(x)
# Estimate performance
cflearn.estimate(x, y, wrappers=m)
""" Then you will see something like this:
================================================================================================================================
| metrics | acc | auc |
--------------------------------------------------------------------------------------------------------------------------------
| | mean | std | score | mean | std | score |
--------------------------------------------------------------------------------------------------------------------------------
| fcnn | 0.946667 | 0.000000 | 0.946667 | 0.993200 | 0.000000 | 0.993200 |
================================================================================================================================
"""
# `carefree-learn` models can be saved easily, into a zip file!
# For example, a `cflearn^_^fcnn.zip` file will be created with this line of code:
cflearn.save(m)
# And loading `carefree-learn` models are easy too!
m = cflearn.load()
# You will see exactly the same result as above!
cflearn.estimate(x, y, wrappers=m)
# `carefree-learn` can also easily fit / predict / estimate directly on files!
# `delim` refers to 'delimiter', and `skip_first` refers to skipping first line or not.
# * Please refer to https://github.com/carefree0910/carefree-data/blob/dev/README.md if you're interested in more details.
""" Suppose we have an 'xor.txt' file with following contents:
0,0,0
0,1,1
1,0,1
1,1,0
"""
m = cflearn.make(delim=",", skip_first=False).fit("xor.txt", x_cv="xor.txt")
cflearn.estimate("xor.txt", wrappers=m)
""" Then you will see something like this:
================================================================================================================================
| metrics | acc | auc |
--------------------------------------------------------------------------------------------------------------------------------
| | mean | std | score | mean | std | score |
--------------------------------------------------------------------------------------------------------------------------------
| fcnn | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 |
================================================================================================================================
"""
# When we fit from files, we can predict on either files or lists:
print(m.predict([[0, 0]])) # [[0]]
print(m.predict([[0, 1]])) # [[1]]
print(m.predict("xor.txt")) # [ [0] [1] [1] [0] ]
Distributed
In carefree-learn
, Distributed Training doesn't mean training your model on multiple GPUs or multiple machines, because carefree-learn
focuses on tabular datasets (or, structured datasets) which are often not as large as unstructured datasets. Instead, Distributed Training in carefree-learn
means training multiple models at the same time. This is important because:
- Deep Learning models suffer from randomness, so we need to train multiple models with the same algorithm and calculate the mean / std of the performances to estimate the algorithm's capacity and stability.
- Ensemble these models (which are trained with the same algorithm) can boost the algorithm's performance without making any changes to the algorithm itself.
- Parameter searching will be easier & faster.
import cflearn
from cfdata.tabular import TabularDataset
# It is necessary to wrap codes under '__main__' on WINDOWS platform when running distributed codes
if __name__ == '__main__':
x, y = TabularDataset.iris().xy
# Notice that 3 fcnn were trained simultaneously with this line of code
_, patterns = cflearn.repeat_with(x, y, num_repeat=3, num_parallel=3)
# And it is fairly straight forward to apply stacking ensemble
ensemble = cflearn.ensemble(patterns)
patterns_dict = {"fcnn_3": patterns, "fcnn_3_ensemble": ensemble}
cflearn.estimate(x, y, metrics=["acc", "auc"], other_patterns=patterns_dict)
""" Then you will see something like this:
================================================================================================================================
| metrics | acc | auc |
--------------------------------------------------------------------------------------------------------------------------------
| | mean | std | score | mean | std | score |
--------------------------------------------------------------------------------------------------------------------------------
| fcnn_3 | 0.937778 | 0.017498 | 0.920280 | -- 0.993911 -- | 0.000274 | 0.993637 |
--------------------------------------------------------------------------------------------------------------------------------
| fcnn_3_ensemble | -- 0.953333 -- | -- 0.000000 -- | -- 0.953333 -- | 0.993867 | -- 0.000000 -- | -- 0.993867 -- |
================================================================================================================================
"""
You might notice that the best results of each column is 'highlighted' with a pair of '--'.
Hyper Parameter Optimization (HPO)
import cflearn
from cfdata.tabular import *
if __name__ == '__main__':
x, y = TabularDataset.iris().xy
# Bayesian Optimization (BO) will be used as default
hpo = cflearn.tune_with(
x, y,
task_type=TaskTypes.CLASSIFICATION,
num_repeat=2, num_parallel=0, num_search=10
)
# We can further train our model with the best hyper-parameters we've obtained:
m = cflearn.make(**hpo.best_param).fit(x, y)
cflearn.estimate(x, y, wrappers=m)
""" Then you will see something like this:
~~~ [ info ] Results
================================================================================================================================
| metrics | acc | auc |
--------------------------------------------------------------------------------------------------------------------------------
| | mean | std | score | mean | std | score |
--------------------------------------------------------------------------------------------------------------------------------
| 0659e09f | 0.943333 | 0.016667 | 0.926667 | 0.995500 | 0.001967 | 0.993533 |
--------------------------------------------------------------------------------------------------------------------------------
| 08a0a030 | 0.796667 | 0.130000 | 0.666667 | 0.969333 | 0.012000 | 0.957333 |
--------------------------------------------------------------------------------------------------------------------------------
| 1962285c | 0.950000 | 0.003333 | 0.946667 | 0.997467 | 0.000533 | 0.996933 |
--------------------------------------------------------------------------------------------------------------------------------
| 1eb7f2a0 | 0.933333 | 0.020000 | 0.913333 | 0.994833 | 0.003033 | 0.991800 |
--------------------------------------------------------------------------------------------------------------------------------
| 4ed5bb3b | 0.973333 | 0.013333 | 0.960000 | 0.998733 | 0.000467 | 0.998267 |
--------------------------------------------------------------------------------------------------------------------------------
| 5a652f3c | 0.953333 | -- 0.000000 -- | 0.953333 | 0.997400 | 0.000133 | 0.997267 |
--------------------------------------------------------------------------------------------------------------------------------
| 82c35e77 | 0.940000 | 0.020000 | 0.920000 | 0.995467 | 0.002133 | 0.993333 |
--------------------------------------------------------------------------------------------------------------------------------
| a9ef52d0 | -- 0.986667 -- | 0.006667 | -- 0.980000 -- | -- 0.999200 -- | -- 0.000000 -- | -- 0.999200 -- |
--------------------------------------------------------------------------------------------------------------------------------
| ba2e179a | 0.946667 | 0.026667 | 0.920000 | 0.995633 | 0.001900 | 0.993733 |
--------------------------------------------------------------------------------------------------------------------------------
| ec8c0837 | 0.973333 | -- 0.000000 -- | 0.973333 | 0.998867 | 0.000067 | 0.998800 |
================================================================================================================================
~~~ [ info ] Best Parameters
----------------------------------------------------------------------------------------------------
acc (a9ef52d0) (0.986667 ± 0.006667)
----------------------------------------------------------------------------------------------------
{'optimizer': 'rmsprop', 'optimizer_config': {'lr': 0.005810863965757382}}
----------------------------------------------------------------------------------------------------
auc (a9ef52d0) (0.999200 ± 0.000000)
----------------------------------------------------------------------------------------------------
{'optimizer': 'rmsprop', 'optimizer_config': {'lr': 0.005810863965757382}}
----------------------------------------------------------------------------------------------------
best (a9ef52d0)
----------------------------------------------------------------------------------------------------
{'optimizer': 'rmsprop', 'optimizer_config': {'lr': 0.005810863965757382}}
----------------------------------------------------------------------------------------------------
~~ [ info ] Results
================================================================================================================================
| metrics | acc | auc |
--------------------------------------------------------------------------------------------------------------------------------
| | mean | std | score | mean | std | score |
--------------------------------------------------------------------------------------------------------------------------------
| fcnn | 0.980000 | 0.000000 | 0.980000 | 0.998867 | 0.000000 | 0.998867 |
================================================================================================================================
"""
You might notice that:
- The final results obtained by HPO is even better than the stacking ensemble results mentioned above.
- We search for
optimizer
andlr
as default. In fact, we can manually passedparams
intocflearn.tune_with
. If not, thencarefree-learn
will execute following codes:
from cftool.ml.param_utils import *
params = {
"optimizer": String(Choice(values=["sgd", "rmsprop", "adam"])),
"optimizer_config": {
"lr": Float(Exponential(1e-5, 0.1))
}
}
It is also worth mention that we can pass file datasets into cflearn.tune_with
as well. See tests/usages/test_basic.py
for more details.
What's next
The next step is to make some benchmark testing and optimize carefree-learn
's performance. I'm pretty sure it can reach a satisfying level with some tuned default settings.
And, as always, bug fixing XD
Posted on July 6, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.