🦆🦆🦆🦆🦆🦆🦆🦆🦆🦆🦆🦆🦆🦆🦆🦆🦆🦆🦆
This is a Rumale machine learning cheat sheet based on DataCamp's Scikit-learn cheat sheet.

yoshoku / rumale

Rumale is a machine learning library in Ruby

Rumale

Rumale (Ruby machine learning) is a machine learning library in Ruby Rumale provides machine learning algorithms with interfaces similar to Scikit-Learn in Python Rumale supports Support Vector Machine, Logistic Regression, Ridge, Lasso, Multi-layer Perceptron, Naive Bayes, Decision Tree, Gradient Tree Boosting, Random Forest, K-Means, Gaussian Mixture Model, DBSCAN, Spectral Clustering, Mutidimensional Scaling, t-SNE, Fisher Discriminant Analysis, Neighbourhood Component Analysis, Principal Component Analysis, Non-negative Matrix Factorization, and many other algorithms.

Installation

Add this line to your application's Gemfile:

gem 'rumale'

And then execute:

$ bundle

Or install it yourself as:

$ gem install rumale

Documentation

Rumale API Documentation

Usage

Example 1. Pendigits dataset classification

Rumale provides function loading libsvm format dataset file. We start by downloading the pendigits dataset from LIBSVM Data web site.

$ wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/pendigits
$ wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/pendigits.t

Training of the classifier with Linear SVM and RBF kernel feature map is the following code.

…

View on GitHub

Comparison of classifiers

t-SNE + MNIST

Let's get started!

gem install rumale

A Basic Example

require 'rumale'

ruby_labels = label_array
#  [0,0,0,0,0,1,1,1,1,1,2,2,2,2,2]
ruby_samples = sample_array
#  [[samples_1], [samples_2], [samples_3], .. [samples_n]]

# Convert to Narray.
labels = Numo::Int32.cast(ruby_labels)
samples = Numo::DFloat.cast(ruby_samples)

# Preprocessing The Data
# Encoding Categorical Features, Normalization, etc.

# Create Your Model
model = Rumale::NearestNeighbors::KNeighborsClassifier.new
model.fit(samples, labels)

# Prediction
model.predict(new_samples)

# Evaluation
puts model.score(test_samples, testl_labels)

Loading The Data

Convert Ruby Array to NArray.

labels = Numo::Int32[*ruby_array]
# labels = Numo::Int32.cast(ruby_array)
# labels = Numo::Int32.asarray(ruby_array)

samples = Numo::DFloat[*ruby_array]
# samples = Numo::DFloat.cast(ruby_array)
# samples = Numo::DFloat.asarray(ruby_array)

Libsvm file.

# Load the training dataset.
samples, labels = Rumale::Dataset.load_libsvm_file('pendigits')

Preprocessing The Data

Standardization

normalizer = Rumale::Preprocessing::StandardScaler.new
new_training_samples = normalizer.fit_transform(training_samples)
new_testing_samples = normalizer.transform(testing_samples)

Normalization

normalizer = Rumale::Preprocessing::L2Normalizer.new
new_samples = normalizer.fit_transform(samples)

Binarization

na[na >= thresh] = 1
na[na <  thresh] = 0

Encoding Categorical Features

encoder = Rumale::Preprocessing::LabelEncoder.new
labels = Numo::Int32[1, 8, 8, 15, 0]
encoded_labels = encoder.fit_transform(labels)
# => Numo::Int32#shape=[5]
[1, 2, 2, 3, 0]
decoded_labels = encoder.inverse_transform(encoded_labels)
# => [1, 8, 8, 15, 0]

encoder = Rumale::Preprocessing::LabelEncoder.new
labels = ["A", "B", "B", "A", "C", "C"]
encoded_labels = encoder.fit_transform(labels)
# => Numo::Int32#shape=[6]
# [0, 1, 1, 0, 2, 2]
decoded_labels = encoder.inverse_transform(encoded_labels)
# => ["A", "B", "B", "A", "C", "C"]

One-hot-encoding

encoder = Rumale::Preprocessing::OneHotEncoder.new
labels = Numo::Int32[0, 0, 2, 3, 2, 1]
one_hot_vectors = encoder.fit_transform(labels)
# => Numo::DFloat#shape=[6,4]
# [[1, 0, 0, 0], 
#  [1, 0, 0, 0], 
#  [0, 0, 1, 0], 
#  [0, 0, 0, 1], 
#  [0, 0, 1, 0], 
#  [0, 1, 0, 0]]

Imputing Missing Values

idx = narray.eq(0).where
narray[idx] = Float::NAN
mean = narray.mean(axis:0, nan:true)
axis = narray.new_narray.seq % narray.shape[1]
narray[idx] = mean[axis[idx]]

Create Your Model

Supervised Learning Estimators

k-NN(k-Nearest Neighbors)

Rumale::NearestNeighbors::KNeighborsClassifier.new(n_neighbors: 5) 
Rumale::NearestNeighbors::KNeighborsRegressor.new(n_neighbors: 5)

n_neighbors : The number of neighbors.

Linear Regression

Rumale::LinearModel::LinearRegression.new(
  fit_bias:       # (Boolean) — The flag indicating whether to fit the bias term.
  bias_scale:     # (Float) — The scale of the bias term.
  max_iter:       # (Integer) — The maximum number of iterations.
  batch_size:     # (Integer) — The size of the mini batches.
  optimizer:      # (Optimizer) — The optimizer to calculate adaptive learning rate. If nil is given, Nadam is used.
  random_seed:    # (Integer) — The seed value using to initialize the random generator.
)

optimizer: AdaGrad, Adam, Nadam, RMSProp, SGD, goldFin

Ridge Regression

L2 regularization

Rumale::LinearModel::Ridge.new(
  reg_param:      # (Float) — The regularization parameter.
  fit_bias:       # (Boolean) — The flag indicating whether to fit the bias term.
  bias_scale:     # (Float) — The scale of the bias term.
  max_iter:       # (Integer) — The maximum number of iterations.
  batch_size:     # (Integer) — The size of the mini batches.
  optimizer:      # (Optimizer) — The optimizer to calculate adaptive learning rate. If nil is given, Nadam is used.
  random_seed:    # (Integer) — The seed value using to initialize the random generator.
)

Lasso Regression

L1 regularization

Rumale::LinearModel::Lasso.new(
  reg_param:      # (Float) — The regularization parameter.
  fit_bias:       # (Boolean) — The flag indicating whether to fit the bias term.
  bias_scale:     # (Float) — The scale of the bias term.
  max_iter:       # (Integer) — The maximum number of iterations.
  batch_size:     # (Integer) — The size of the mini batches.
  optimizer:      # (Optimizer) — The optimizer to calculate adaptive learning rate. If nil is given, Nadam is used.
  random_seed:    # (Integer) — The seed value using to initialize the random generator.
)

Logistic Regression

Rumale::LinearModel::LogisticRegression.new(
  reg_param:      # (Float) — The regularization parameter.
  fit_bias:       # (Boolean) — The flag indicating whether to fit the bias term.
  bias_scale:     # (Float) — The scale of the bias term. If fit_bias is true, the feature vector v becoms [v; bias_scale].
  max_iter:       # (Integer) — The maximum number of iterations.
  batch_size:     # (Integer) — The size of the mini batches.
  optimizer:      # (Optimizer) — The optimizer to calculate adaptive learning rate. If nil is given, Nadam is used.
  random_seed:    # (Integer) — The seed value using to initialize the random generator.
)

Support Vector Machine

svc = Rumale::LinearModel::SVC.new(
  reg_param:      # (Float) —  The regularization parameter.
  fit_bias:       # (Boolean) —  The flag indicating whether to fit the bias term.
  bias_scale:     # (Float) —  The scale of the bias term.
  max_iter:       # (Integer) —  The maximum number of iterations.
  batch_size:     # (Integer) —  The size of the mini batches.
  probability:    # (Boolean) —  The flag indicating whether to perform probability estimation.
  optimizer:      # (Optimizer) —  The optimizer to calculate adaptive learning rate. If nil is given, Nadam is used.
  random_seed:    # (Integer) —  The seed value using to initialize the random generator.
)

Naive Bayes

GaussianNB
BernoulliNB
MultinomialNB

Rumale::NaiveBayes::GaussianNB.new
Rumale::NaiveBayes::BernoulliNB.new(smoothing_param: 1.0, bin_threshold: 0.0)
Rumale::NaiveBayes::MultinomialNB.new(smoothing_param: 1.0)

Decision Tree

Rumale::Tree::DecisionTreeClassifier.new(
  criterion:         # (String) —  The function to evaluate spliting point. Supported criteria are ‘gini’ and ‘entropy’.
  max_depth:         # (Integer) —  The maximum depth of the tree. If nil is given, decision tree grows without concern for depth.
  max_leaf_nodes:    # (Integer) —  The maximum number of leaves on decision tree. If nil is given, number of leaves is not limited.
  min_samples_leaf:  # (Integer) —  The minimum number of samples at a leaf node.
  max_features:      # (Integer) —  The number of features to consider when searching optimal split point. If nil is given, split process considers all features.
  random_seed:       # (Integer) —  The seed value using to initialize the random generator. It is used to randomly determine the order of features when deciding spliting point.
)

Rumale::Tree::DecisionTreeRegressor.new(
  criterion:         # (String) —The function to evaluate spliting point. Supported criteria are ‘mae’ and ‘mse’.
  max_depth:         # (Integer) —The maximum depth of the tree. If nil is given, decision tree grows without concern for depth.
  max_leaf_nodes:    # (Integer) —The maximum number of leaves on decision tree. If nil is given, number of leaves is not limited.
  min_samples_leaf:  # (Integer) —The minimum number of samples at a leaf node.
  max_features:      # (Integer) —The number of features to consider when searching optimal split point. If nil is given, split process considers all features.
  random_seed:       # (Integer) —The seed value using to initialize the random generator. It is used to randomly determine the order of features when deciding spliting point.
)

ExtraTree

Random Forest

Rumale::Ensemble::RandomForestClassifier.new(
  n_estimators:      # (Integer) —The numeber of decision trees for contructing random forest.
  criterion:         # (String) —The function to evalue spliting point. Supported criteria are ‘gini’ and ‘entropy’.
  max_depth:         # (Integer) —The maximum depth of the tree. If nil is given, decision tree grows without concern for depth.
  max_leaf_nodes:    # (Integer) —The maximum number of leaves on decision tree. If nil is given, number of leaves is not limited.
  min_samples_leaf:  # (Integer) —The minimum number of samples at a leaf node.
  max_features:      # (Integer) —The number of features to consider when searching optimal split point. If nil is given, split process considers all features.
  random_seed:       # (Integer) —The seed value using to initialize the random generator. It is used to randomly determine the order of features when deciding spliting point.
)

Rumale::Ensemble::RandomForestRegressor.new(
  n_estimators:      # (Integer) —The numeber of decision trees for contructing random forest.
  criterion:         # (String) —The function to evalue spliting point. Supported criteria are ‘gini’ and ‘entropy’.
  max_depth:         # (Integer) —The maximum depth of the tree. If nil is given, decision tree grows without concern for depth.
  max_leaf_nodes:    # (Integer) —The maximum number of leaves on decision tree. If nil is given, number of leaves is not limited.
  min_samples_leaf:  # (Integer) —The minimum number of samples at a leaf node.
  max_features:      # (Integer) —The number of features to consider when searching optimal split point. If nil is given, split process considers all features.
  random_seed:       # (Integer) —The seed value using to initialize the random generator. It is used to randomly determine the order of features when deciding spliting point.
)

AdaBoost (Adaptive Boosting)

Rumale::Ensemble::AdaBoostClassifier.new(
  n_estimators:      # (Integer) —The numeber of decision trees for contructing random forest.
  criterion:         # (String) —The function to evalue spliting point. Supported criteria are ‘gini’ and ‘entropy’.
  max_depth:         # (Integer) —The maximum depth of the tree. If nil is given, decision tree grows without concern for depth.
  max_leaf_nodes:    # (Integer) —The maximum number of leaves on decision tree. If nil is given, number of leaves is not limited.
  min_samples_leaf:  # (Integer) —The minimum number of samples at a leaf node.
  max_features:      # (Integer) —The number of features to consider when searching optimal split point. If nil is given, split process considers all features.
  random_seed:       # (Integer) —The seed value using to initialize the random generator. It is used to randomly determine the order of features when deciding spliting point.
)

Rumale::Ensemble::AdaBoostRegressor.new(
  n_estimators:      # (Integer) —The numeber of decision trees for contructing random forest.
  threshold:         # (Float) —The threshold for delimiting correct and incorrect predictions. That is constrained to [0, 1]
  exponent:          # (Float) —The exponent for the weight of each weak learner.
  criterion:         # (String) —The function to evalue spliting point. Supported criteria are ‘gini’ and ‘entropy’.
  max_depth:         # (Integer) —The maximum depth of the tree. If nil is given, decision tree grows without concern for depth.
  max_leaf_nodes:    # (Integer) —The maximum number of leaves on decision tree. If nil is given, number of leaves is not limited.
  min_samples_leaf:  # (Integer) —The minimum number of samples at a leaf node.
  max_features:      # (Integer) —The number of features to consider when searching optimal split point. If nil is given, split process considers all features.
  random_seed:       # (Integer) —The seed value using to initialize the random generator. It is used to randomly determine the order of features when deciding spliting point.
)

Unsupervised Learning Estimators

PCA (Principal component analysis)

Rumale::Decomposition::PCA.new(
  n_components:    # (Integer) —The number of principal components.
  max_iter:        # (Integer) —The maximum number of iterations.
  tol:             # (Float) —The tolerance of termination criterion.
  random_seed:     # (Integer) —The seed value using to initialize the random generator.
)

NMF (Non-negative matrix factorization)

Rumale::Decomposition::NMF.new(
  n_components:    # (Integer) —The number of components.  
  max_iter:        # (Integer) —The maximum number of iterations.  
  tol:             # (Float) —The tolerance of termination criterion.  
  eps:             # (Float) —A small value close to zero to avoid zero division error.  
  random_seed:     # (Integer) —The seed value using to initialize the random generator.  
)

t-SNE (T-distributed Stochastic Neighbor Embedding)

Rumale::Manifold::TSNE.new(
  n_components: # (Integer) —The number of dimensions on representation space.
  perplexity:   # (Float) —The effective number of neighbors for each point. Perplexity are typically set from 5 to 50.
  metric:       # (String) —The metric to calculate the distances in original space. If metric is 'euclidean', Euclidean distance is calculated for distance in original space. If metric is 'precomputed', the fit and fit_transform methods expect to be given a distance matrix.
  init:         # (String) —The init is a method to initialize the representaion space. If init is 'random', the representaion space is initialized with normal random variables. If init is 'pca', the result of principal component analysis as the initial value of the representation space.
  max_iter:     # (Integer) —The maximum number of iterations.
  tol:          # (Float) —The tolerance of KL-divergence for terminating optimization. If tol is nil, it does not use KL divergence as a criterion for terminating the optimization.
  verbose:      # (Boolean) —The flag indicating whether to output KL divergence during iteration.
  random_seed:  # (Integer) —The seed value using to initialize the random generator.
)

KMeans clustering

Rumale::Clustering::KMeans.new(
  n_clusters:      # (Integer) —The number of clusters.
  init:            # (String) —The initialization method for centroids (‘random’ or ‘k-means++’).
  max_iter:        # (Integer) —The maximum number of iterations.
  tol:             # (Float) —The tolerance of termination criterion.
  random_seed:     # (Integer) —The seed value using to initialize the random generator.
)

DBSCAN (Density-based spatial clustering of applications with noise)

Rumale::Clustering::DBSCAN.new(
  eps:             # (Float) —The radius of neighborhood.
  min_samples:     # (Integer) —The number of neighbor samples to be used for the criterion whether a point is a core point.
)

Model Fitting

model.fit(samples, labels)
model.fit(samples)
model.fit_transform(x)

Prediction

y_pred = model.predict(samples)

Evaluate Model’s Performance

Classification Metrics

Accuracy Score

evaluator = Rumale::EvaluationMeasure::Accuracy.new
puts evaluator.score(ground_truth, predicted)

Regression Metrics

Mean Absolute Error, MAE

evaluator = Rumale::EvaluationMeasure::MeanAbsoluteError.new
puts evaluator.score(ground_truth, predicted)

Mean Squared Error

evaluator = Rumale::EvaluationMeasure::MeanSquaredError.new
puts evaluator.score(ground_truth, predicted)

R2 Score

(coefficient of determination)

evaluator = Rumale::EvaluationMeasure::R2Score.new
puts evaluator.score(ground_truth, predicted)

Clustering Metrics

Adjusted Rand Index

evaluator = Rumale::EvaluationMeasure::AdjustedRandScore.new
puts evaluator.score(ground_truth, predicted)

Cross-Validation

svc = Rumale::LinearModel::SVC.new
kf = Rumale::ModelSelection::StratifiedKFold.new(n_splits: 5)
cv = Rumale::ModelSelection::CrossValidation.new(estimator: svc, splitter: kf)
report = cv.perform(samples, labels)
mean_test_score = report[:test_score].inject(:+) / kf.n_splits

Tune Your Model

Grid Search

rfc = Rumale::Ensemble::RandomForestClassifier.new(random_seed: 1)

pg = { n_estimators: [5, 10], max_depth: [3, 5], max_leaf_nodes: [15, 31] }

kf = Rumale::ModelSelection::StratifiedKFold.new(n_splits: 5)

gs = Rumale::ModelSelection::GridSearchCV.new(estimator: rfc, param_grid: pg, splitter: kf)
gs.fit(samples, labels)

p gs.cv_results
p gs.best_params

Grid search with pipeline

rbf = Rumale::KernelApproximation::RBF.new(random_seed: 1)
svc = Rumale::LinearModel::SVC.new(random_seed: 1)
pipe = Rumale::Pipeline::Pipeline.new(steps: { rbf: rbf, svc: svc })

pg = { rbf__gamma: [32.0, 1.0], rbf__n_components: [4, 128], svc__reg_param: [16.0, 0.1] }

kf = Rumale::ModelSelection::StratifiedKFold.new(n_splits: 5)

gs = Rumale::ModelSelection::GridSearchCV.new(estimator: pipe, param_grid: pg, splitter: kf)
gs.fit(samples, labels)

p gs.cv_results
p gs.best_params

Pipeline

Sequentially apply a list of transforms and a final estimator.

rbf = Rumale::KernelApproximation::RBF.new(gamma: 1.0, n_coponents: 128, random_seed: 1)
svc = Rumale::LinearModel::SVC.new(reg_param: 1.0, fit_bias: true, max_iter: 5000, random_seed: 1)

pipeline = Rumale::Pipeline::Pipeline.new(steps: { trs: rbf, est: svc })
pipeline.fit(training_samples, traininig_labels)

results = pipeline.predict(testing_samples)

References

The duck logo

Ugly_duckling_theorem¹

(Wikipedia）

https://github.com/yoshoku/rumale/issues/4#issuecomment-483495559 ↩

Rumale Cheat Sheet

kojix2