Activation functions in PyTorch (4)

*Memos:

My post explains GELU() and Mish().
My post explains SiLU() and Softplus().
My post explains Step function, Identity and ReLU.
My post explains Leaky ReLU, PReLU and FReLU.
My post explains ELU, SELU and CELU.
My post explains Tanh, Softsign, Sigmoid and Softmax.
My post explains Vanishing Gradient Problem, Exploding Gradient Problem and Dying ReLU Problem.
My post explains layers in PyTorch.
My post explains loss functions in PyTorch.
My post explains optimizers in PyTorch.

(1) GELU(Gaussian Error Linear Unit):

can convert an input value(x) to an output value by the input value's probability under a Gaussian distribution with optional Tanh. *0 is exclusive except when x = 0.
's formula is. *Both of them get the almost same results: Or:
is GELU() in PyTorch.
is used in:
- Transformer. *Transformer() in PyTorch.
- NLP(Natural Language Processing) based on Transformer such as ChatGPT, BERT(Bidirectional Encoder Representations from Transformers), etc. *Strictly speaking, ChatGPT and BERT are based on Large Language Model(LLM) which is based on Transformer.
's pros:
- It mitigates Vanishing Gradient Problem.
- It mitigates Dying ReLU Problem. *0 is still produced for the input value 0 so Dying ReLU Problem is not completely avoided.
's cons:
- It's computationally expensive because of complex operation including Erf(Error function) or Tanh.
's graph in Desmos:

(2) Mish:

can convert an input value(x) to an output value by x * Tanh(Softplus(x)). *0 is exclusive except when x = 0.
's formula is:
is Mish() in PyTorch.
's pros:
- It mitigates Vanishing Gradient Problem.
- It mitigates Dying ReLU Problem. *0 is still produced for the input value 0 so Dying ReLU Problem is not completely avoided.
's cons:
- It's computationally expensive because of Tanh and Softplus operation.
's graph in Desmos:

(3) SiLU(Sigmoid-Weighted Linear Units):

can convert an input value(x) to an output value by x * Sigmoid(x). *0 is exclusive except when x = 0.
's formula is y = x / (1 + e^-x).
is also called Swish.
is SiLU() in PyTorch.
's pros:
- It mitigates Vanishing Gradient Problem.
- It mitigates Dying ReLU Problem. *0 is still produced for the input value 0 so Dying ReLU Problem is not completely avoided.
's cons:
- It's computationally expensive because of Sigmoid.
's graph in Desmos:

(4) Softplus:

can convert an input value(x) to the output value between 0 and ∞. *0 is exclusive.
's formula is y = log(1+e^x).
is Softplus() in PyTorch.
's pros:
- It normalizes input values.
- The convergence is stable.
- It mitigates Vanishing Gradient Problem.
- It mitigates Exploding Gradient Problem.
- It avoids Dying ReLU Problem.
's cons:
- It's computationally expensive because of log and exponential operation.
's graph in Desmos:

Blog