Neural networks without activation functions would be like orchestras without conductors: a lot of linear instruments playing in unison, but no way to create complex, layered music. Activation functions inject nonlinearity into networks, and this is what allows them to approximate complicated mappings between inputs and outputs. Without them, stacking layers of a neural network would collapse back into a single linear transformation, no more expressive than logistic regression.
For a long time, the sigmoid function was the star of the show. Its smooth S-shaped curve seemed like a natural fit: it squashes values into the range (0,1), making them interpretable as probabilities. In fact, the original backpropagation paper by Rumelhart, Hinton and Williams (1986) used sigmoids throughout, and so did the early convolutional networks of LeCun and colleagues in the 1990s. But sigmoids came with a serious flaw: the vanishing gradient problem. For large positive or negative inputs, the derivative of the sigmoid becomes tiny, effectively killing gradient flow. Training deep networks with sigmoids turned out to be frustratingly slow and unstable.
That’s why the rectified linear unit, or ReLU, caused such a stir when it appeared in deep learning practice in the late 2000s (popularised by Nair & Hinton, 2010, and later brought fully into the mainstream by Krizhevsky et al. in ImageNet classification with deep convolutional neural networks, 2012). ReLU is as simple as it gets:
Its derivative is either 0 (for negative inputs) or 1 (for positive inputs). This simplicity turned out to be exactly what deep networks needed. There is, however, one subtlety: ReLU has a kink at zero. Mathematically, the function is not differentiable at
A common misconception, though, is to think of ReLU as “linear.” It looks like a straight line, after all. But appearances are deceiving. A function is linear if it satisfies the property
but
The two results don’t match. That small counterexample is enough to prove that ReLU is nonlinear. And this nonlinearity is precisely what allows neural networks to stack ReLUs into deep hierarchies and model arbitrarily complex functions.
But at the same time, ReLU is piecewise linear. That means if you split the input domain into two regions — negative and non-negative — on each region, the function is linear:
For
For
The “nonlinearity” comes from stitching these two linear pieces together at 0. That kink at the origin is what makes the function globally nonlinear.
This property scales up in fascinating ways when we stack ReLUs in a neural network. Each neuron divides its input space into two half-spaces with a linear function on each side. A network with many ReLUs layers partitions the input space into a huge number of polyhedral regions, and inside each region the entire network behaves like a linear function. What makes it powerful is the sheer number of regions: with depth, the network can carve the input space into exponentially many such regions, and each one has its own linear mapping.

Figure 1. Left: Surface of the function learned by a 4-neuron ReLU layer, showing its piecewise-linear shape. Right: Same surface with red fold lines marking the kink boundaries where neurons switch on/off.
Hanin & Sellke (ICLR 2018, Understanding Deep Neural Networks with Rectified Linear Units) have shown that ReLU networks are astonishingly powerful: with no more than about
Beyond this theoretical universality, ReLU has clear practical advantages over older activations such as sigmoid or tanh. The difference is easiest to see in their curves. Sigmoid and tanh both flatten out as inputs grow in magnitude, and their derivatives collapse toward zero in those regions. That’s the vanishing gradient problem made visible: once activations saturate, gradients can no longer flow. ReLU, in contrast, has a flat zero region for negative inputs, but as soon as values turn positive, its slope is exactly 1 — a straight line that carries gradients back unchanged. This is why deep networks with ReLU can keep learning, while sigmoid or tanh networks often stall.

Figure 2. Top row shows sigmoid, tanh, and ReLU activations. Bottom row shows their derivatives. Sigmoid and tanh derivatives shrink to nearly zero outside the center, while ReLU’s derivative stays at 1 for all positive inputs, preserving gradient flow.
ReLU is also computationally efficient. Forward and backward passes reduce to a simple comparison, while sigmoid and tanh require exponentials and divisions, which are more costly. A further benefit is sparsity: negative inputs are clamped to zero, so many neurons switch off entirely at any given time. This sparsity encourages more efficient representations, reduces interference between features, and often improves generalization.
The very property that makes ReLU efficient — setting all negative values to zero — also brings a cost. This sparsity can lead to the dying ReLU problem: if a neuron’s inputs consistently fall in the negative region, its output and gradient both remain zero, so its weights stop updating. In effect, the neuron is dead and may never activate again. In practice this usually isn’t catastrophic — modern networks are so overparameterized that a few silent units don’t break performance — but it’s still a waste of capacity. To mitigate the issue, researchers have introduced variants such as Leaky ReLU (
Another drawback is that ReLU outputs are strictly non-negative, so they’re not centered around zero. During backpropagation this can cause gradients in each layer to have the same sign, leading to inefficient “zig-zag” updates in weight space. Batch normalization is often used to address this by re-centering activations.
ReLU is not smooth: small changes around zero can produce abrupt jumps in output, which is undesirable in the final output layer where stability matters most. Finally, because the mean activation of ReLU units is positive rather than zero, they can introduce a bias shift in the following layer. Unless these shifts cancel out, training may slow down.
ReLU’s hard cutoff at zero is both its strength and its weakness. By zeroing out all negative inputs, it creates sparse and efficient activations, but at the cost of potentially killing neurons and blocking gradient flow. Over time, researchers have proposed several variants that soften this behavior and improve stability.

Leaky ReLU introduces a small slope for negative values:
Instead of cutting off negative activations entirely, it lets a trickle of gradient through. This prevents neurons from going completely silent, making training more robust. Even though the modification looks minor, it has a big practical effect: dead neurons become much rarer.
Parametric ReLU (PReLU) takes this one step further by learning the slope
Exponential Linear Unit (ELU) smooths out the negative region with an exponential curve:
Unlike ReLU or Leaky ReLU, ELU doesn’t just preserve gradient flow — it also outputs values centered closer to zero. This helps reduce the bias shift problem, where all activations being positive slows down learning. For small negative inputs, ELU behaves like a leaky ReLU, but for large negative values it saturates to a constant. That saturation keeps activations bounded, improving stability and speeding up convergence in practice.
GELU (Gaussian Error Linear Unit) is a more recent favorite. Instead of a sharp cutoff, it uses a smooth probabilistic gate based on the Gaussian distribution. You can think of GELU as weighting the input by the probability that it should “pass through.” This smoothness often improves training dynamics, especially in very deep architectures like Transformers, where GELU has become the default.
Hendrycks & Gimpel (2016), who introduced GELU, showed it outperformed both ReLU and ELU in deep models. Later, BERT (2018) adopted GELU as its activation, and since then it has become the standard across most Transformer-based LLMs. The smoother gradient flow and more nuanced treatment of negative inputs lead to faster convergence and slightly better accuracy.
SiLU (Sigmoid Linear Unit, also known as Swish) is another smooth alternative to ReLU. Instead of a hard cutoff, it multiplies the input by its sigmoid, so large positive values pass through, large negatives are suppressed, and small negatives are only dampened rather than discarded. You can think of SiLU as applying a soft, learnable gate that never fully closes. This gentle shaping often leads to more stable optimization and slightly better accuracy compared to ReLU or even GELU in some settings.
Its formula is
Ramachandran et al. (2017), who introduced Swish/SiLU, showed consistent gains across vision and language tasks. Later, architectures such as EfficientNet (2019) in computer vision and more recent large language models (e.g. PaLM, LLaMA-2) adopted SiLU in their feed-forward blocks. Its smoothness and non-monotonic shape help networks train deeper and converge faster, while preserving more information in the negative range than ReLU.
With ReLU and its variants, initialization is not a cosmetic detail - it directly determines whether neurons ever learn. Because ReLU outputs zero for negative inputs, a poor initialization can push most pre-activations into the negative region, immediately creating dead neurons with zero gradients. At the same time, overly large initial weights can cause activations to explode as depth increases, destabilizing training. Proper initialization aims to keep pre-activations roughly centered and with controlled variance as they propagate forward and backward through the network.
For ReLU-style activations, this led to He initialization (He et al., 2015), which scales weights by