Most contemporary machine learning models for classification do not minimise mean squared error. They minimise cross-entropy instead. Why is that the case? At first glance, squared error seems like the most natural choice: it is simple, symmetric, and widely used in regression. Yet whenever the task involves categorical decisions - spam vs. not spam, dog vs. cat vs. horse, or predicting the next token in a large language model - cross-entropy dominates.

How does this work, and why is this approach so popular?

Definition and connection to Entropy / KL Divergence

Imagine a dataset of animal photos: xdog,cat,horse. Each image has a label - say, “dog.” Mathematically, this label corresponds to a distribution: if the true class is “dog,” then p=(1,0,0) . If it’s “horse,” then p=(0,0,1). These are degenerate distributions (or “one-hot” distributions), because all probability mass is placed on a single outcome.

Note

We don’t have access to the full underlying distribution of nature that produces animals and labels. What we do have is a dataset - a collection of samples drawn from that unknown distribution. In information theory terms, we assume there exists some true distribution p(x,y), and our training data are samples from it.

Now imagine that in our dataset we have 70% of images are dogs, 20% cats, and 10% horses, then the overall class distribution is:

(1)pdataset=(0.7,0.2,0.1).

What does this distribution tell us? It tells us how uncertain we are about the label of a random image drawn from this dataset. That uncertainty is quantified by entropy:

(2)H(p)=xp(x)logp(x).

High entropy means the dataset is balanced across classes, so the label of a random sample is uncertain (i.e. 1/3 probability of each sample). Low entropy means the dataset is skewed, so the label is much more predictable (e.g. 99% of dogs.

But of course, we don’t know the true distribution perfectly - we’re training a model that outputs its own guess q. For the dog image, maybe the model predicts q=(0.5,0.3,0.1). To measure how well q aligns with the true distribution p, we use cross-entropy:

(3)H(p,q)=xp(x)logq(x).

Cross-entropy tells us the average number of “surprise points” we rack up when we trust the model’s probabilities. A correct prediction with high probability gives few points (low surprise). A correct prediction with low probability gives many points (high surprise). Wrong, overconfident predictions give the biggest surprise of all.

Notice how this reduces to logq(dog) when p is one-hot.

The final piece of the puzzle is KL divergence (Kullback–Leibler divergence). It measures how different two probability distributions are:

(4)DKL(pq)=xp(x)logp(x)q(x).

Notice this can be rewritten in terms of entropy and cross-entropy:

(5)DKL(pq)=H(p,q)H(p).

This tells us something important: cross-entropy is just KL divergence plus the entropy of the true distribution. And since H(p) is fixed by the dataset, minimizing cross-entropy is exactly the same as minimizing KL divergence between the true distribution p and the model’s predictions q.

Why Cross-Entropy Fits Classification

Wait - why are we allowed to use cross-entropy in the first place? What assumptions about probabilities are we making?

In classification, the model’s outputs are often passed through a sigmoid (for binary) or softmax (for multiclass). These functions constrain the outputs to be non-negative and (in the multiclass case) sum to one. That makes them look like probabilities — though strictly speaking, they are not the true probabilities of nature. They are the model’s unnormalized scores transformed into a probability-like vector, which can still be miscalibrated.

The reason this setup works is that classification problems are inherently categorical:

Cross-entropy loss is then just the log-likelihood of these distributions under the model’s parameters.

Sigmoid and softmax don’t introduce this probabilistic assumption — they simply provide a convenient parameterization that ensures the outputs lie in the right domain (between 0 and 1, summing to one). This not only matches the Bernoulli/categorical family but also makes training with gradient descent stable, because the loss surface is smooth and the gradients behave well.

That’s why this has become the standard practice: it aligns naturally with the categorical nature of classification and makes optimization feasible.

Why CrossEntropy is better than Mean Square Error

At this point you might wonder: why can’t we just use mean squared error (MSE)? After all, it’s the workhorse loss for regression, and it seems natural to compare predicted probabilities to one-hot targets using squared distance.

Note

Up to now, we’ve been talking about cross-entropy in general, with multiple classes like dogs, cats, and horses. To make the comparison with MSE as simple and clear as possible, let’s narrow down to the binary case and use Binary Cross Entropy (BCE).

Here the label y can only be 0 or 1, and the model outputs a single number y^(0,1) after the sigmoid.

To formalize, let's compare two losses:

(6)MSE:MSE(y,y^)=(yy^)2BCE:BCE(y,y^)=[ylogy^+(1y)log(1y^)]

The problem is that MSE and binary cross-entropy (BCE) penalize mistakes in very different ways. Suppose the true label is y=0, but the model predicts y^=0.9. With MSE, the loss is (00.9)2=0.81. With BCE, the loss is

(7)[(0)log0.9+(10)log(10.9)]=log0.12.30.

The raw numbers aren’t what really matter. What matters is how the two losses grow as the model becomes more confidently wrong.

Here are the two losses side by side for the binary case (y=0 and y=1):

[mse_vs_bce.png]

The absolute loss values for MSE and BCE aren’t worlds apart (in fact, for many points they look fairly similar). What really matters for training is the gradient - the rate of change of the loss with respect to the model’s output.

In the binary setting, with y{0,1} and y^(0,1), the mean squared error is

(8)MSE(y,y^)=(yy^)2,

and its gradient with respect to y^ is

(9)MSEy^=2(y^y).

The binary cross-entropy is

(10)BCE(y,y^)=[ylogy^+(1y)log(1y^)],

and its gradient with respect to y^ is

(11)BCEy^=yy^+1y1y^.

[mse_vs_bce_grad.png]

At the example point y=0 with y^=0.9, the gradient of MSE is 2(y^y)=1.8, while the gradient of BCE is 1y1y^=10.1=10. The difference is striking: MSE gives only a gentle push back when the model is confidently wrong, whereas BCE produces a strong corrective signal. This is precisely why cross-entropy is better suited for classification: it penalizes overconfident mistakes much more aggressively, ensuring the model learns not just to get the labels right but also to calibrate its probabilities carefully.

So far we’ve looked at the size of the gradients, but there’s another important difference between MSE and cross-entropy. MSE treats classification as a regression problem against a one-hot vector: every coordinate is compared. In the multiclass case, this means the model is penalized not only for assigning too little probability to the correct class, but also for assigning any nonzero probability to the wrong ones. Even if the true class already has the highest score, MSE will still push the wrong classes closer to zero, as if “not being exactly zero” were itself an error worth correcting.

Cross-entropy works differently. Because the label is one-hot, all terms vanish except the one corresponding to the true class. The loss depends only on the log probability the model assigns to the correct class. The wrong classes don’t appear individually; they matter only indirectly, because assigning them probability reduces what’s left for the true label.

This distinction is crucial. By spreading the penalty across all coordinates, MSE can encourage the model to hedge its bets and distribute probability more evenly, leading to poorly calibrated outputs. Cross-entropy, in contrast, matches the real goal of classification: maximize the probability of the correct class, and penalize the model severely if it doesn’t.

Imabalanced Dataset or Noisy Labels

In imbalanced datasets, regression-style losses like L1 (MAE) and L2 (MSE) reveal their weakness because they do not treat classification as a probabilistic problem, but as a distance-minimization task between one-hot labels and predictions. In the binary case, the optimal constant predictor under MSE is the mean of the labels, while under L1 it is the median. If a dataset has 99% zeros and 1% ones, MSE pushes the model toward predicting y^0.01, and L1 is even more extreme, collapsing to always predicting y^=0. Both losses effectively encourage the model to approximate the class distribution statistics rather than maximizing the probability of the correct class. This means that rare classes contribute very little to the loss: they are “washed out” by the majority, leading to poor recall on minority classes.

Cross-entropy behaves differently. It treats each sample as equally important, penalizing the log probability of the correct class. A misclassified minority example incurs the same per-sample penalty as a majority one. This prevents the trivial collapse to majority predictions seen with L1 and L2. Although imbalance still affects CE - since the model sees more majority examples and may skew toward them - it remains fundamentally a per-sample likelihood objective rather than a regression to dataset averages. With additional techniques like class weighting or resampling, CE can handle imbalance robustly, while L1/L2 are fundamentally mismatched for categorical prediction.

A related failure mode appears in datasets with noisy labels. Because CE punishes confident mistakes harshly, mislabeled examples can dominate the gradients and mislead the model. In this case, the same property that makes CE effective on clean data - its intolerance of overconfident errors - turns into a liability. CE still generally outperforms L1 or L2 for classification, but its brittleness to noise explains why researchers have proposed many variants and regularization tricks to make it more forgiving in practice.

Why Cross-Entropy Is a Poor Evaluation Metric (and Why That’s Fine)

Despite its central role in training, cross-entropy is not a good evaluation metric. The reason is subtle but important: cross-entropy measures the quality of probabilistic predictions, not task success. A model can achieve lower cross-entropy simply by being better calibrated - assigning slightly higher probabilities to the correct class - without improving its actual decision-making behavior.

This leads to unintuitive outcomes. Two models with identical accuracy can have very different cross-entropy values, and a model with worse accuracy can even have better cross-entropy if it is less overconfident. From an application perspective, this is often undesirable: we usually care about what the model predicts, not how elegantly it distributes probability mass over alternatives.

To see this mathematically, consider two classifiers evaluated on the same dataset, both achieving 100% accuracy. Model A predicts the correct class with probability 0.51 for every example, while Model B predicts the correct class with probability 0.99. Accuracy assigns them the same score (due to threshold=0.5). Cross-entropy does not:

(12)A=log0.510.673,B=log0.990.010.

From a decision-making standpoint, the models are equivalent; from a probabilistic standpoint, they are not. Cross-entropy strongly prefers Model B, even though both models behave identically when converted into hard predictions.

The mismatch goes the other way as well. A model can achieve lower cross-entropy but worse accuracy by spreading probability mass more cautiously. Because cross-entropy is minimized when predicted probabilities match the true data-generating distribution, it rewards calibration even when that calibration does not translate into better classification outcomes. Accuracy, precision, recall, F1, and AUC ignore probability magnitudes entirely and instead evaluate the decisions induced by those probabilities - exactly what most downstream tasks care about.

Cross-entropy tells us how well the model believes; accuracy tells us whether those beliefs were useful.