Intro to the Knobs: A Quick Journey through Popular Hyperparameters

What are Hyperparameters?

Hyperparameters are set before training and affect the training but are not changed by the training. Adjusting them affects the models accuracy and the capacity to generalize to new samples and the speed of convergence. In essence, it’s how a developer controls a specific model to get better outputs.

Below is a list of some hyperparameters you may see as an ML Developer.

Temperature

Controls randomness of outputs through affecting the probability distribution. Temperature has more creativity than ‘Top P’ which just determines a threshold of a probability distribution and doesn’t affect the whole distribution. Lower temperatures maintain confidence and higher temperatures are more creative and have more diverse outputs. Good for NLP and Recommendation systems.

Top P

Controls diversity & creativity through a threshold in a probability distribution. When you want to maintain coherence and relevance but want to have some diversity then those ‘Top P’ over Temperature. Top P doesn’t change the probability distribution but cuts it off at a chosen point so not too much chaos is created but rather a diversity that is more controlled than adjusting the temperature hyperparameter.

Token Length

This is the number of words and characters fed to the LLM model. It may be too short for the model to handle or too long.

Max Tokens

This is the number of tokens the LLM can generate. If it’s too short, it will use less memory and you get a faster response but if it will mean inaccuracies in the response.

Stop Token

A Stop Token or “end-of-sequence” token is a signal to indicate the stopping at each generated sentence or the end of the generation of the whole output of an LLM. A developer can control the stops through those signals.

Learning Rate

The learning rate controls the step size during training. It affects both the speed of convergence and the quality of convergence. There are different techniques to finding the best learning rate including using a grid search. One can choose to have a dynamic learning rate adjusting over time based on an assortment of criteria.

Weight Decay

Weight decay penalizes large weights by adding a penalty to the loss function. It prevents overfitting by evening out the weights making the network less prone to memorize insignificant details.

Hidden Layer Size

This decides the number of neurons in each hidden layer of the model. A recommended number is the mean of the neurons in the input and output layers.

Dropout Rate

The Dropout rate regulates the percentage of neurons eliminated during training in order to deter overfitting. A standard is 0.5 for hidden layers and a lower or even zero dropout rate for input and output layers.

Batch Size

The Batch size decides the number of samples used in each iteration of training. It may be worth studying the effects of large and small sizes on your model.

Number of Epochs

This is the number of times the entire dataset is passed through the model for training.
Look for common practices to determine the best number. If model performance degrades early, then implement early stopping. You can do so programmatically by monitoring a metric and then define a criteria to stop at.

Summary

Here is a quick journey through some ML hyperparameters you may see as an ML developer perhaps working in AWS Sagemaker or Bedrock. I hope I have shed some light and helped to begin to unravel the rich tapestry of hyperparameters.

As you play with the knobs and dials you may notice the compendium of tweaks and fixes can lead to transformative effects in the performance of your models. As you dive deeper into the interplay of the labyrinth of these parameters you gain an expertise in machine learning that will be both one of experience and a developed intuition in creating models that work well.

Blog