Activation Functions: The Small Nonlinearity That Shapes a Network

sky_io@outlook.com (K4i) — Thu, 18 Jun 2026 10:00:00 +0800

Activation functions look like small details. In a neural-network layer, the heavy computation is usually the matrix multiplication:

$$z = Wx + b$$

Then we apply a simple function elementwise:

$$a = \phi(z)$$

It is tempting to treat $\phi$ as a plug-in choice: sigmoid, tanh, ReLU, GELU, SiLU, Mish, or one of hundreds of proposed variants. But the activation is not decoration. It decides whether stacked layers can represent nonlinear functions, whether gradients keep flowing, whether hidden values stay centered, and whether the model pays a large runtime cost for a tiny accuracy gain.

Activation-Function on k4i's blog

Activation Functions: The Small Nonlinearity That Shapes a Network