Elevate Your Python Skills: Machine Learning Packages That Transformed My Journey as ML Engineer

Discover packages and tools that have been pivotal in my coding journey as ML engineer. They not only enhance efficiency but also introduce innovative solutions, reshaping how I tackle problems using Python.

In this series, we will explore five, or less, packages from various categories: ML, Data Engineering Pipelines, Frameworks & DL, Visualization, API & Deployment, Developers Tools, and other Packages I Adore.

This installment is centered on Machine Learning packages. Each package comes with a succinct description, its main advantages, and a sample use-case to highlight it's code design. Where relevant, I'll also provide alternatives or complimentary packages, giving you a holistic perspective on the tools available.

Machine Learning

1. scikit-learn

Description: A comprehensive library for machine learning algorithms.
Advantage: User-friendly with a consistent API and thorough documentation.
When to use: It's the package of choice for standard machine learning tasks, including classification, regression, and clustering.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_models import LogisticRegression


numeric_features = ['Salary']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])


text_feature = 'SelfDescription'
text_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

categorical_features = ['Age','Country']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('txt', text_transformer, text_feature),
        ('cat', categorical_transformer, categorical_features),
        ])

predictor = Pipeline(steps=[('preprocessor', preprocessor),
                  ('classifier', LogisticRegression(solver='lbfgs'))])


# train
predictor.fit(X_train, y_train)

#evaluate and predict

Complimentary: river and skorch

river complements scikit-learn by offering tools specifically designed for online learning, ideal for scenarios where data is streaming in real-time. While scikit-learn is optimized for batch learning, river provides a solution for incrementally updating models with new data points as they arrive.

skorch seamlessly integrates the deep learning capabilities of PyTorch into the scikit-learn ecosystem. It allows developers to use PyTorch-based neural networks as if they were scikit-learn estimators, making it easier to incorporate deep learning models into workflows that already leverage scikit-learn tools, such as grid search and pipelines.

2. PyMC

Description: Specialized in Bayesian modeling and probabilistic machine learning.
Advantage: Equips you with the tools to define probabilistic models in code.
When to use: Ideal for white-box ML using probabilistic programming.

import pymc as pm
import xarray as xr


with pm.Model() as model:

    # Priors
    alpha = pm.Normal('alpha', mu=0, sd=10)
    beta = pm.Normal('beta', mu=0, sd=10, shape=X_train.shape[1])

    # Linear combination
    mu = alpha + xr.dot(X_train, beta)

    # Logistic link function
    p = pm.invlogit(mu)

    # Likelihood
    y_obs = pm.Bernoulli('y_obs', p=p, observed=y_train)

    # Sample/train
    trace = pm.sample(3000)

# Evaluation with posterior predictive checks
# Prediction by drawing samples from the posterior predictive distribution

Alternatives: stan and edward

While Stan offers its own modeling language and provides MCMC sampling, and Edward integrates with TensorFlow/Keras to offer Variational Inference, PyMC stands out for its ease of use within the Python environment, user-friendly API, and active community.

3. darts

Description: My preferred package for time series forecasting and anomaly detection.
Advantage: It offers comprehensive tools for time series analysis and a unified interface for various forecasting models.
When to use: Essential when dealing with time series data and you need forecasting, anomaly detection, or other analyses using classical, deep learning, prophet models, and beyond.

from darts.models import RNNModel

model_config = {
    "model_name": "Sales_LSTM",
    "hidden_dim": 20,
    "dropout": 0,
    "batch_size": 16,
    "n_epochs": 200,
    "random_state": 42,
    "training_length": 20,
    "input_chunk_length": 14,
    "force_reset": True,
    "save_checkpoints": True,
}

model = RNNModel(
    model="LSTM",
    optimizer_kwargs={"lr": 1e-3},
    **model_config
)

# train
model.fit(TimeSeriesData)

# forecast next 3
forecast = model.predict(3)

Alternatives: Merlion and kats

While Merlion and Kats offer their own sets of capabilities in time series analysis, Darts shines as a comprehensive choice for time series forecasting and processing, catering to a wide range of requirements with its extensive toolkit. Both Merlion and Kats can serve as potential alternatives, but Darts’ holistic offerings make it a standout choice for me.

4. FLAML

Description: A swift and efficient automated machine learning library.
Advantage: Achieve optimal ML results with minimal coding and time investment.
When to use: Perfect when you desire swift outcomes without the intricacies of model fine-tuning.

from flaml import AutoML

automl = AutoML()

automl_config = {
    "time_budget": 120,  # time in seconds
    "metric": 'accuracy',
    "task": 'classification',
     "estimator_list": ['lgbm', 'xgboost', 'catboost', 'extra_tree',],
    "seed": 42,
    "log_file_name": "churn.log",
    "log_training_metric": True,
}


# train
automl.fit(X_train, y_train, **automl_config)

# evaluate and predict

Complimentary: AutoGluon and mljar-supervised

FLAML specializes in automating machine learning tasks for tabular data. In contrast, AutoGluon amplifies the automation game by accommodating a wider spectrum, including text, images, and multi-modal data, making it a more versatile toolkit. Meanwhile, mljar-supervised extends FLAML by adding model explanation, ensemble, and visualization, , presenting itself as a viable alternative with comparable capabilities.

5. CVXPY

Description: The go-to library for convex optimization.
Advantage: It provides an intuitive method to define and solve convex optimisation problems.
When to use: Essential for solving optimisation challenges across domains like finance, control, signal processing, and more.

'''
Task: Operations Research
PYIKEA wants to maximize its profit of selling armchair, wingchair, and Lovet-table. The profit of selling armchair is 150 DKK, wingchair 100 DKK, and Lovet-table 250 DKK. 
It takes:
    15 planks of wood and 5 hours of labour to make one armchair
    12 planks of wood and 2 hours of labour to make one wingchair
    18 planks of wood and 8 hours of labour to make one Lovet-table
The store needs at least 4 of each chair, and a table. The total amount of woods pieces in storage is 450 and labour budget is of 120 hours only.
What combination of chairs and table(s) yield maximum profit?
'''

import cvxpy as cp


# Variables
A = cp.Variable(integer=True, name="Armchair")
W = cp.Variable(integer=True, name="Wingchair") 
L = cp.Variable(integer=True, name="Lovet-table")

# Objective
profit = 150*A + 100*W + 250*L
objective = cp.Maximize(profit)

# Constraints
constraints = [15*A + 12*W + 18*L <= 450, 
                5*A +  2*W +  8*L <= 120,
                  A >= 4,
                  W >= 4,
                  L >= 1
              ]

# Problem to Solve
problem = cp.Problem(objective, constraints=constraints)
result = problem.solve()

Alternative: pyomo

Pyomo and cvxpy are both my interstellar choices for optimisation in Python, each with its own set of strengths. While cvxpy excels with its intuitive approach to convex problems, Pyomo flaunts versatility in tackling a variety of optimisation challenges, be it linear, nonlinear, or mixed-integer. Essentially, picking between the two boils down to the mood I am on of the day!

We navigated through my favourite Python packages that redefine machine learning workflows. I picked scikit-learn for general ML tasks, complemented by tools like river and skorch. PyMC guided us through the intricacies of Bayesian modeling with alternatives like Stan and Edward. darts emerged as a comprehensive choice for time series analysis, though Merlion and kats offer their unique capabilities.

For rapid results, FLAML streamlines automated ML, with AutoGluon and mljar-supervised expanding on similar terrains. Lastly, CVXPY showcased its prowess in optimization, with pyomo as another contender. These packages, collectively, illuminate the expansive and evolving landscape of Python-based machine learning that can elevate your skills, as they did mine.

Stay tuned for the next segment on Data Engineering Pipelines, featuring Dagster, Apache Airflow, Prefect and Argo.

Until then, stay curious and keep on coding.

Blog