Activation functions in PyTorch (4)

hyperkai

Super Kai (Kazuya Ito)

Posted on October 5, 2024

Activation functions in PyTorch (4)

Buy Me a Coffee

*Memos:

(1) GELU(Gaussian Error Linear Unit):

  • can convert an input value(x) to an output value by the input value's probability under a Gaussian distribution with optional Tanh. *0 is exclusive except when x = 0.
  • 's formula is. *Both of them get the almost same results: Image description Or: Image description
  • is GELU() in PyTorch.
  • is used in:
    • Transformer. *Transformer() in PyTorch.
    • NLP(Natural Language Processing) based on Transformer such as ChatGPT, BERT(Bidirectional Encoder Representations from Transformers), etc. *Strictly speaking, ChatGPT and BERT are based on Large Language Model(LLM) which is based on Transformer.
  • 's pros:
    • It mitigates Vanishing Gradient Problem.
    • It mitigates Dying ReLU Problem. *0 is still produced for the input value 0 so Dying ReLU Problem is not completely avoided.
  • 's cons:
    • It's computationally expensive because of complex operation including Erf(Error function) or Tanh.
  • 's graph in Desmos:

Image description

(2) Mish:

  • can convert an input value(x) to an output value by x * Tanh(Softplus(x)). *0 is exclusive except when x = 0.
  • 's formula is: Image description
  • is Mish() in PyTorch.
  • 's pros:
    • It mitigates Vanishing Gradient Problem.
    • It mitigates Dying ReLU Problem. *0 is still produced for the input value 0 so Dying ReLU Problem is not completely avoided.
  • 's cons:
    • It's computationally expensive because of Tanh and Softplus operation.
  • 's graph in Desmos:

Image description

(3) SiLU(Sigmoid-Weighted Linear Units):

  • can convert an input value(x) to an output value by x * Sigmoid(x). *0 is exclusive except when x = 0.
  • 's formula is y = x / (1 + e-x).
  • is also called Swish.
  • is SiLU() in PyTorch.
  • 's pros:
    • It mitigates Vanishing Gradient Problem.
    • It mitigates Dying ReLU Problem. *0 is still produced for the input value 0 so Dying ReLU Problem is not completely avoided.
  • 's cons:
    • It's computationally expensive because of Sigmoid.
  • 's graph in Desmos:

Image description

(4) Softplus:

  • can convert an input value(x) to the output value between 0 and ∞. *0 is exclusive.
  • 's formula is y = log(1+ex).
  • is Softplus() in PyTorch.
  • 's pros:
    • It normalizes input values.
    • The convergence is stable.
    • It mitigates Vanishing Gradient Problem.
    • It mitigates Exploding Gradient Problem.
    • It avoids Dying ReLU Problem.
  • 's cons:
    • It's computationally expensive because of log and exponential operation.
  • 's graph in Desmos:

Image description

💖 💪 🙅 🚩
hyperkai
Super Kai (Kazuya Ito)

Posted on October 5, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related