Loss Functions for Binary Classification

Loss Functions for Binary Classification#

Before diving into the math, let’s understand why we need to talk about loss functions:

Measuring Success: The loss function tells our model how well (or poorly) it’s performing.
Guiding Improvement: It provides the direction for optimization—like a compass pointing toward better predictions.
Different Problems, Different Tools: Choosing the right loss function can dramatically improve how well and how quickly our model learns.

Binary Cross-Entropy (also known as log loss) is a loss function commonly used for binary classification tasks. It measures the difference between the true labels and the predicted probabilities (usually produced by a sigmoid activation). The binary cross-entropy loss for a single example is given by:

\[ \ell(y, \hat{y}) = -\left[y \log(\hat{y}) + (1-y) \log(1-\hat{y})\right] \]

where:

$y$ is the true label (0 or 1),
$\hat{y}$ is the predicted probability that the output is 1,
$\log$ is the natural logarithm.

For a dataset of $N$ examples, the average loss is:

\[ L_{\text{binary}} = -\frac{1}{N} \sum_{i=1}^{N} \left[y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i)\right] \]

import numpy as np

# Define binary cross-entropy loss function
def binary_cross_entropy(y_true, y_pred):
    epsilon = 1e-12  # to avoid log(0)
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    loss = -(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
    return loss

# Example predictions and actual labels
y_true = np.array([0, 1, 1, 0])
y_pred_good = np.array([0.1, 0.9, 0.8, 0.2])  # good predictions
y_pred_bad = np.array([0.9, 0.1, 0.3, 0.8])   # bad predictions

# Calculate losses
loss_good = binary_cross_entropy(y_true, y_pred_good)
loss_bad = binary_cross_entropy(y_true, y_pred_bad)

print("Good Predictions Loss:", loss_good)
print("Average:", np.mean(loss_good))

print("\nBad Predictions Loss:", loss_bad)
print("Average:", np.mean(loss_bad))

Good Predictions Loss: [0.10536052 0.10536052 0.22314355 0.22314355]
Average: 0.164252033486018

Bad Predictions Loss: [2.30258509 2.30258509 1.2039728  1.60943791]
Average: 1.854645225687032

Why Use Binary Cross-Entropy?#

There are several reasons why binary cross-entropy is a popular choice for binary classification tasks:

It Speaks the Language of Probability
- Perfect match for sigmoid outputs (0-1 range)
- Directly measures the “surprise” of seeing the true label given a prediction
It Punishes Overconfident Mistakes
- Being wrong with 99% confidence hurts much more than being wrong with 51% confidence
- Creates stronger learning signals when the model makes confident errors
- Example: Predicting 0.01 when the true label is 1 creates a massive gradient
It Plays Well with Gradient Descent
- Smooth surface with clear gradients throughout the prediction range
- No flat spots or sudden drops that could trap or confuse optimization
- Mathematically elegant connection to maximum likelihood estimation

Why Other Loss Functions Fall Short#

Mean Squared Error (MSE) is given by: $$ \ell_{\text{MSE}}(y, \hat{y}) = (y - \hat{y})^2 $$

MSE is more commonly used for regression tasks, where the model predicts a continuous value. While it’s possible to use MSE for binary classification, it has several drawbacks:

Designed for continuous values, not binary outcomes
Gradients become weaker for very wrong predictions
In practical terms: MSE cares more about being “in the ballpark” than being exactly right

Hinge Loss is another loss function used for binary classification, especially in the context of Support Vector Machines (SVMs). It is defined as: $$ \ell_{\text{hinge}}(y, \hat{y}) = \max(0, 1 - y \cdot \hat{y}) $$

Hinge loss is designed to work well with classifiers that aim to maximize the margin between classes. However, it has some limitations:

Not as intuitive when the model outputs probabilities: Doesn’t distinguish between probabilities of 0.51 and 0.99
Non-smooth points in the loss function can make optimization more challenging

import numpy as np
import matplotlib.pyplot as plt

# Define binary cross-entropy loss
def binary_cross_entropy(y_true, y_pred):
    epsilon = 1e-12
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

# Define mean squared error loss
def mse_loss(y_true, y_pred):
    return (y_true - y_pred)**2


# Generate predicted probabilities from 0 to 1
y_pred = np.linspace(0.001, 0.999, 500)

# True label (1)
y_true_1 = 1
bce_loss_1 = binary_cross_entropy(y_true_1, y_pred)
mse_loss_1 = mse_loss(y_true_1, y_pred)

# True label (0)
y_true_0 = 0
bce_loss_0 = binary_cross_entropy(y_true_0, y_pred)
mse_loss_0 = mse_loss(y_true_0, y_pred)

# Plot losses for y_true = 1
plt.figure(figsize=(10,5))
plt.plot(y_pred, bce_loss_1, label='Binary Cross-Entropy (y=1)')
plt.plot(y_pred, mse_loss_1, label='Mean Squared Error (y=1)', linestyle='--')
plt.xlabel("Predicted Probability")
plt.ylabel("Loss")
plt.title("Loss Comparison for True Label = 1")
plt.legend()
plt.grid(True)
plt.show()

# Plot losses for y_true = 0
plt.figure(figsize=(10, 5))
plt.plot(y_pred, bce_loss_0, label='Binary Cross-Entropy (y=0)')
plt.plot(y_pred, mse_loss_0, label='Mean Squared Error (y=0)', linestyle='--')
plt.xlabel("Predicted Probability")
plt.ylabel("Loss")
plt.title("Loss Comparison for True Label = 0")
plt.legend()
plt.grid(True)
plt.show()

../_images/2f28208282184b267eb78297219b8345df9cdc8847e58c644f1a2ceb5a8cfec3.png

../_images/ee1892d0b3260d9ee5c43c8b91c4ad004f82332c71fe39c44c07230f674f491d.png

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import SGD

# Define the XOR dataset
X = np.array([
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1]
])
# XOR outputs: 0 if inputs are the same, 1 if they are different
y = np.array([0, 1, 1, 0])

# Create a simple neural network with one hidden layer
model = Sequential([
    Input(shape=(2,)),          
    Dense(3, activation='tanh'),  # Hidden layer with 3 neurons using tanh activation
    Dense(1, activation='sigmoid')  # Output layer with sigmoid activation
])

# Compile the model with learning rate of 0.1
optimizer = SGD(learning_rate=0.1)
model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])

# Train the model and save the history to track loss over epochs
history = model.fit(X, y, epochs=5000, verbose=0)

# Evaluate the model on the XOR dataset
loss, accuracy = model.evaluate(X, y, verbose=0)
predictions = model.predict(X)
rounded_predictions = np.round(predictions)

print(f"Final loss: {loss:.4f}")
print(f"Final accuracy: {accuracy:.4f}")
print("\nRaw predictions on the XOR dataset:")
for i, pred in enumerate(predictions):
    print(f"Input: {X[i]} → Prediction: {pred[0]:.4f} → Rounded: {rounded_predictions[i][0]}")

# Visualize the loss over epochs
plt.figure(figsize=(10, 6))
plt.plot(history.history['loss'])
plt.title('Model Loss During Training')
plt.ylabel('Binary Crossentropy Loss')
plt.xlabel('Epoch')
plt.grid(True)
sns.despine()
plt.show()

# Optional: Create a decision boundary visualization
plt.figure(figsize=(10, 6))

# Create a grid of points
h = 0.01
x_min, x_max = -0.5, 1.5
y_min, y_max = -0.5, 1.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
grid_points = np.c_[xx.ravel(), yy.ravel()]

# Get predictions for all grid points
Z = model.predict(grid_points)
Z = Z.reshape(xx.shape)

# Plot the decision boundary
plt.contourf(xx, yy, Z, cmap=plt.cm.RdBu, alpha=0.8)

# Plot the training points
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k', cmap=plt.cm.RdBu)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xlabel('Input 1')
plt.ylabel('Input 2')
plt.title('XOR Decision Boundary')
plt.colorbar()
plt.grid(True)

plt.show()

2025-05-08 15:01:47.636799: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-05-08 15:01:47.640023: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-05-08 15:01:47.653384: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1746716507.667452   29243 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746716507.671542   29243 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1746716507.683046   29243 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746716507.683057   29243 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746716507.683058   29243 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746716507.683060   29243 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-05-08 15:01:47.687150: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

2025-05-08 15:01:49.294496: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[3], line 30
model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
# Train the model and save the history to track loss over epochs
---> 30 history = model.fit(X, y, epochs=5000, verbose=0)
# Evaluate the model on the XOR dataset
loss, accuracy = model.evaluate(X, y, verbose=0)

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py:117, in filter_traceback.<locals>.error_handler(*args, **kwargs)
filtered_tb = None
try:
--> 117     return fn(*args, **kwargs)
except Exception as e:
   filtered_tb = _process_traceback_frames(e.__traceback__)

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/keras/src/backend/tensorflow/trainer.py:371, in TensorFlowTrainer.fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq)
for step, iterator in epoch_iterator:
   callbacks.on_train_batch_begin(step)
--> 371     logs = self.train_function(iterator)
   callbacks.on_train_batch_end(step, logs)
   if self.stop_training:

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/keras/src/backend/tensorflow/trainer.py:219, in TensorFlowTrainer._make_function.<locals>.function(iterator)
def function(iterator):
   if isinstance(
       iterator, (tf.data.Iterator, tf.distribute.DistributedIterator)
   ):
--> 219         opt_outputs = multi_step_on_iterator(iterator)
       if not opt_outputs.has_value():
           raise StopIteration

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/tensorflow/python/util/traceback_utils.py:150, in filter_traceback.<locals>.error_handler(*args, **kwargs)
filtered_tb = None
try:
--> 150   return fn(*args, **kwargs)
except Exception as e:
 filtered_tb = _process_traceback_frames(e.__traceback__)

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py:833, in Function.__call__(self, *args, **kwds)
compiler = "xla" if self._jit_compile else "nonXla"
with OptionalXlaContext(self._jit_compile):
--> 833   result = self._call(*args, **kwds)
new_tracing_count = self.experimental_get_tracing_count()
without_tracing = (tracing_count == new_tracing_count)

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py:878, in Function._call(self, *args, **kwds)
self._lock.release()
# In this case we have not created variables on the first call. So we can
# run the first trace but we should fail if variables are created.
--> 878 results = tracing_compilation.call_function(
   args, kwds, self._variable_creation_config
)
if self._created_variables:
 raise ValueError("Creating variables on a non-first call to a function"
                  " decorated with tf.function.")

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py:139, in call_function(args, kwargs, tracing_options)
bound_args = function.function_type.bind(*args, **kwargs)
flat_inputs = function.function_type.unpack_inputs(bound_args)
--> 139 return function._call_flat(  # pylint: disable=protected-access
   flat_inputs, captured_inputs=function.captured_inputs
)

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/tensorflow/python/eager/polymorphic_function/concrete_function.py:1322, in ConcreteFunction._call_flat(self, tensor_inputs, captured_inputs)
possible_gradient_type = gradients_util.PossibleTapeGradientTypes(args)
if (possible_gradient_type == gradients_util.POSSIBLE_GRADIENT_TYPES_NONE
   and executing_eagerly):
 # No tape is watching; skip to running the function.
-> 1322   return self._inference_function.call_preflattened(args)
forward_backward = self._select_forward_and_backward_functions(
   args,
   possible_gradient_type,
   executing_eagerly)
forward_function, args_with_tangents = forward_backward.forward()

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/tensorflow/python/eager/polymorphic_function/atomic_function.py:216, in AtomicFunction.call_preflattened(self, args)
def call_preflattened(self, args: Sequence[core.Tensor]) -> Any:
 """Calls with flattened tensor inputs and returns the structured output."""
--> 216   flat_outputs = self.call_flat(*args)
 return self.function_type.pack_output(flat_outputs)

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/tensorflow/python/eager/polymorphic_function/atomic_function.py:251, in AtomicFunction.call_flat(self, *args)
with record.stop_recording():
 if self._bound_context.executing_eagerly():
--> 251     outputs = self._bound_context.call_function(
       self.name,
       list(args),
       len(self.function_type.flat_outputs),
   )
 else:
   outputs = make_call_op_in_graph(
       self,
       list(args),
       self._bound_context.function_call_options.as_attrs(),
   )

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/tensorflow/python/eager/context.py:1688, in Context.call_function(self, name, tensor_inputs, num_outputs)
cancellation_context = cancellation.context()
if cancellation_context is None:
-> 1688   outputs = execute.execute(
     name.decode("utf-8"),
     num_outputs=num_outputs,
     inputs=tensor_inputs,
     attrs=attrs,
     ctx=self,
 )
else:
 outputs = execute.execute_with_cancellation(
     name.decode("utf-8"),
     num_outputs=num_outputs,
   (...)   1702       cancellation_manager=cancellation_context,
 )

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/tensorflow/python/eager/execute.py:53, in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
try:
 ctx.ensure_initialized()
---> 53   tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
                                     inputs, attrs, num_outputs)
except core._NotOkStatusException as e:
 if name is not None:

KeyboardInterrupt: 

Brief Discussion of Optimizers#

Keras provides several built-in optimizers that adjust the model parameters during training. In addition to the classic Stochastic Gradient Descent (SGD), there are more advanced optimizers like adam or rmsprop.

# Compare SGD with Adam optimizer
from tensorflow.keras.optimizers import SGD, Adam, RMSprop
import matplotlib.pyplot as plt
import seaborn as sns

# Function to train model with different optimizers
def train_with_optimizer(optimizer_name):
    # Reset the model
    model = Sequential([
        Input(shape=(2,)),          
        Dense(3, activation='tanh'),
        Dense(1, activation='sigmoid')
    ])
    
    # Select optimizer
    if optimizer_name == 'SGD':
        opt = SGD(learning_rate=0.01)
    elif optimizer_name == "Adam":   
        opt = Adam(learning_rate=0.01)
    elif optimizer_name == "RMSprop":
        opt = RMSprop(learning_rate=0.01)
    
    model.compile(loss='binary_crossentropy', optimizer=opt)
    
    # Train and record history
    history = model.fit(X, y, epochs=1000, verbose=0)
    return history.history['loss']

# Train with both optimizers
sgd_loss = train_with_optimizer('SGD')
adam_loss = train_with_optimizer('Adam')
rmsprop_loss = train_with_optimizer('RMSprop')

# Plot comparison
plt.figure(figsize=(10, 6))
plt.plot(sgd_loss, label='SGD')
plt.plot(adam_loss, label='Adam')
plt.plot(rmsprop_loss, label='rmsprop')
plt.title('SGD vs Adam vs. rmsprop: Loss During Training')
plt.ylabel('Binary Crossentropy Loss')
plt.xlabel('Epoch')
plt.legend()
sns.despine()
plt.grid(True)
plt.show()

../_images/9fdb39597b1a873268bd9fd35a3751fbf61b3e7f7610d77e559a1df61faa50bc.png

Loss Functions for Binary Classification

Contents

Loss Functions for Binary Classification#

Why Use Binary Cross-Entropy?#

Why Other Loss Functions Fall Short#

Brief Discussion of Optimizers#