ReLU: The Nonlinear Workhorse of Deep Learning

History

Neural networks without activation functions would be like orchestras without conductors: a lot of linear instruments playing in unison, but no way to create complex, layered music. Activation functions inject nonlinearity into networks, and this is what allows them to approximate complicated mappings between inputs and outputs. Without them, stacking layers of a neural network would collapse back into a single linear transformation, no more expressive than logistic regression.

For a long time, the sigmoid function was the star of the show. Its smooth S-shaped curve seemed like a natural fit: it squashes values into the range (0,1), making them interpretable as probabilities. In fact, the original backpropagation paper by Rumelhart, Hinton and Williams (1986) used sigmoids throughout, and so did the early convolutional networks of LeCun and colleagues in the 1990s. But sigmoids came with a serious flaw: the vanishing gradient problem. For large positive or negative inputs, the derivative of the sigmoid becomes tiny, effectively killing gradient flow. Training deep networks with sigmoids turned out to be frustratingly slow and unstable.

That’s why the rectified linear unit, or ReLU, caused such a stir when it appeared in deep learning practice in the late 2000s (popularised by Nair & Hinton, 2010, and later brought fully into the mainstream by Krizhevsky et al. in ImageNet classification with deep convolutional neural networks, 2012). ReLU is as simple as it gets:

\begin{matrix} (19) & ReLU (x) = max (0, x) . \end{matrix}

kink at zero $x = 0$ , because the left-sided slope is 0 while the right-sided slope is 1. In practice, though, this is not a problem. Most deep learning libraries simply pick one of the two slopes (usually 0, sometimes 0.5) and move on. The probability that a parameter update lands exactly at zero is vanishingly small, so this corner case almost never matters.

Is ReLU linear?

$f(x) + f(y) = f(x+y)$ . Let’s test ReLU:

\begin{matrix} (20) & ReLU (- 1) + ReLU (1) = 0 + 1 = 1, \end{matrix}

but

\begin{matrix} (21) & ReLU (- 1 + 1) = ReLU (0) = 0. \end{matrix}

The two results don’t match. That small counterexample is enough to prove that ReLU is nonlinear. And this nonlinearity is precisely what allows neural networks to stack ReLUs into deep hierarchies and model arbitrarily complex functions.

But at the same time, ReLU is piecewise linear. That means if you split the input domain into two regions — negative and non-negative — on each region, the function is linear:

$x < 0$ $\text{ReLU}(x) = 0$ , which is just a horizontal line (linear).
$x \geq 0$ $\text{ReLU}(x) = x$ , which is the identity line (also linear).

The “nonlinearity” comes from stitching these two linear pieces together at 0. That kink at the origin is what makes the function globally nonlinear.

This property scales up in fascinating ways when we stack ReLUs in a neural network. Each neuron divides its input space into two half-spaces with a linear function on each side. A network with many ReLUs layers partitions the input space into a huge number of polyhedral regions, and inside each region the entire network behaves like a linear function. What makes it powerful is the sheer number of regions: with depth, the network can carve the input space into exponentially many such regions, and each one has its own linear mapping.

Relu Piecewise

Figure 1. Left: Surface of the function learned by a 4-neuron ReLU layer, showing its piecewise-linear shape. Right: Same surface with red fold lines marking the kink boundaries where neurons switch on/off.

Why ReLU is so effective?

ICLR 2018, Understanding Deep Neural Networks with Rectified Linear Units $\log_2(n+1)$ $n$ any $L_p$ $L_p$ as different rulers for the same curve — each ruler gives a slightly different perspective, but ReLUs can match them all. What matters here is the number of layers, not just the number of neurons: each neuron applies ReLU individually, but it’s the stacking of layers that composes these nonlinearities and gives networks their expressive power. In other words, even relatively shallow ReLU networks are universal approximators.

Beyond this theoretical universality, ReLU has clear practical advantages over older activations such as sigmoid or tanh. The difference is easiest to see in their curves. Sigmoid and tanh both flatten out as inputs grow in magnitude, and their derivatives collapse toward zero in those regions. That’s the vanishing gradient problem made visible: once activations saturate, gradients can no longer flow. ReLU, in contrast, has a flat zero region for negative inputs, but as soon as values turn positive, its slope is exactly 1 — a straight line that carries gradients back unchanged. This is why deep networks with ReLU can keep learning, while sigmoid or tanh networks often stall.

Relu Derivative

Figure 2. Top row shows sigmoid, tanh, and ReLU activations. Bottom row shows their derivatives. Sigmoid and tanh derivatives shrink to nearly zero outside the center, while ReLU’s derivative stays at 1 for all positive inputs, preserving gradient flow.

ReLU is also computationally efficient. Forward and backward passes reduce to a simple comparison, while sigmoid and tanh require exponentials and divisions, which are more costly. A further benefit is sparsity: negative inputs are clamped to zero, so many neurons switch off entirely at any given time. This sparsity encourages more efficient representations, reduces interference between features, and often improves generalization.

ReLU Disadvantages

The very property that makes ReLU efficient — setting all negative values to zero — also brings a cost. This sparsity can lead to the dying ReLU problemLeaky ReLU $\max(\alpha x, x)$ $\alpha$ $x<0$ Parametric ReLU $\alpha$ is learned), and smooth alternatives like ELU or GELU. These modifications keep gradients alive in the negative region, softening the bluntness of vanilla ReLU and reducing the risk that entire swaths of neurons die during training.

Another drawback is that ReLU outputs are strictly non-negative, so they’re not centered around zero. During backpropagation this can cause gradients in each layer to have the same sign, leading to inefficient “zig-zag” updates in weight space. Batch normalization is often used to address this by re-centering activations.

ReLU is not smooth: small changes around zero can produce abrupt jumps in output, which is undesirable in the final output layer where stability matters most. Finally, because the mean activation of ReLU units is positive rather than zero, they can introduce a bias shift in the following layer. Unless these shifts cancel out, training may slow down.

ReLU Variants

ReLU’s hard cutoff at zero is both its strength and its weakness. By zeroing out all negative inputs, it creates sparse and efficient activations, but at the cost of potentially killing neurons and blocking gradient flow. Over time, researchers have proposed several variants that soften this behavior and improve stability.

Relu Variants

Leaky ReLU introduces a small slope for negative values:

\begin{matrix} (22) & h (x) = max (α x, x), α ≪ 1. \end{matrix}

Instead of cutting off negative activations entirely, it lets a trickle of gradient through. This prevents neurons from going completely silent, making training more robust. Even though the modification looks minor, it has a big practical effect: dead neurons become much rarer.

Parametric ReLU (PReLU) $\alpha$ directly from data. Different neurons — or entire layers — can then adapt how much “leak” they allow in the negative region. This adds only a small number of extra parameters but gives the network flexibility to discover the most effective nonlinearity for the task at hand.

Exponential Linear Unit (ELU) smooths out the negative region with an exponential curve:

\begin{matrix} (23) & \begin{matrix} h (x) = {\begin{cases} x & x \geq 0 \\ α (\exp (x) - 1) & x < 0 \end{cases} \end{matrix} \end{matrix}

Unlike ReLU or Leaky ReLU, ELU doesn’t just preserve gradient flow — it also outputs values centered closer to zero. This helps reduce the bias shift problem, where all activations being positive slows down learning. For small negative inputs, ELU behaves like a leaky ReLU, but for large negative values it saturates to a constant. That saturation keeps activations bounded, improving stability and speeding up convergence in practice.

GELU (Gaussian Error Linear Unit) is a more recent favorite. Instead of a sharp cutoff, it uses a smooth probabilistic gate based on the Gaussian distribution. You can think of GELU as weighting the input by the probability that it should “pass through.” This smoothness often improves training dynamics, especially in very deep architectures like Transformers, where GELU has become the default.

Hendrycks & Gimpel (2016), who introduced GELU, showed it outperformed both ReLU and ELU in deep models. Later, BERT (2018) adopted GELU as its activation, and since then it has become the standard across most Transformer-based LLMs. The smoother gradient flow and more nuanced treatment of negative inputs lead to faster convergence and slightly better accuracy.

SiLU (Sigmoid Linear Unit, also known as Swish) is another smooth alternative to ReLU. Instead of a hard cutoff, it multiplies the input by its sigmoid, so large positive values pass through, large negatives are suppressed, and small negatives are only dampened rather than discarded. You can think of SiLU as applying a soft, learnable gate that never fully closes. This gentle shaping often leads to more stable optimization and slightly better accuracy compared to ReLU or even GELU in some settings.

Its formula is

\begin{matrix} (24) & SiLU (x) = x \cdot σ (x), σ (x) = \frac{1}{1 + e^{- x}} . \end{matrix}

Ramachandran et al. (2017), who introduced Swish/SiLU, showed consistent gains across vision and language tasks. Later, architectures such as EfficientNet (2019) in computer vision and more recent large language models (e.g. PaLM, LLaMA-2) adopted SiLU in their feed-forward blocks. Its smoothness and non-monotonic shape help networks train deeper and converge faster, while preserving more information in the negative range than ReLU.

Initialization matters for ReLU (and friends)

With ReLU and its variants, initialization is not a cosmetic detail - it directly determines whether neurons ever learn. Because ReLU outputs zero for negative inputs, a poor initialization can push most pre-activations into the negative region, immediately creating dead neurons with zero gradients. At the same time, overly large initial weights can cause activations to explode as depth increases, destabilizing training. Proper initialization aims to keep pre-activations roughly centered and with controlled variance as they propagate forward and backward through the network.

For ReLU-style activations, this led to He initializationHe et al., 2015 $2 / \text{fan\_in}$ . The extra factor of 2 compensates for the fact that roughly half of ReLU outputs are zero, preserving signal variance across layers. Variants like Leaky ReLU, PReLU, ELU, and SiLU follow the same principle, sometimes with a small adjustment to account for their non-zero negative slope. In practice, using He (or He-like) initialization is enough to avoid dying neurons at startup and to ensure gradients flow reliably from the very first update.