Open In Colab

Loss Functions for Binary Classification#

Before diving into the math, let’s understand why we need to talk about loss functions:

  • Measuring Success: The loss function tells our model how well (or poorly) it’s performing.

  • Guiding Improvement: It provides the direction for optimization—like a compass pointing toward better predictions.

  • Different Problems, Different Tools: Choosing the right loss function can dramatically improve how well and how quickly our model learns.

Binary Cross-Entropy (also known as log loss) is a loss function commonly used for binary classification tasks. It measures the difference between the true labels and the predicted probabilities (usually produced by a sigmoid activation). The binary cross-entropy loss for a single example is given by:

\[ \ell(y, \hat{y}) = -\left[y \log(\hat{y}) + (1-y) \log(1-\hat{y})\right] \]

where:

  • \(y\) is the true label (0 or 1),

  • \(\hat{y}\) is the predicted probability that the output is 1,

  • \(\log\) is the natural logarithm.

For a dataset of \(N\) examples, the average loss is:

\[ L_{\text{binary}} = -\frac{1}{N} \sum_{i=1}^{N} \left[y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i)\right] \]
import numpy as np

# Define binary cross-entropy loss function
def binary_cross_entropy(y_true, y_pred):
    epsilon = 1e-12  # to avoid log(0)
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    loss = -(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
    return loss

# Example predictions and actual labels
y_true = np.array([0, 1, 1, 0])
y_pred_good = np.array([0.1, 0.9, 0.8, 0.2])  # good predictions
y_pred_bad = np.array([0.9, 0.1, 0.3, 0.8])   # bad predictions

# Calculate losses
loss_good = binary_cross_entropy(y_true, y_pred_good)
loss_bad = binary_cross_entropy(y_true, y_pred_bad)

print("Good Predictions Loss:", loss_good)
print("Average:", np.mean(loss_good))

print("\nBad Predictions Loss:", loss_bad)
print("Average:", np.mean(loss_bad))
Good Predictions Loss: [0.10536052 0.10536052 0.22314355 0.22314355]
Average: 0.164252033486018

Bad Predictions Loss: [2.30258509 2.30258509 1.2039728  1.60943791]
Average: 1.854645225687032

Why Use Binary Cross-Entropy?#

There are several reasons why binary cross-entropy is a popular choice for binary classification tasks:

  • It Speaks the Language of Probability

    • Perfect match for sigmoid outputs (0-1 range)

    • Directly measures the “surprise” of seeing the true label given a prediction

  • It Punishes Overconfident Mistakes

    • Being wrong with 99% confidence hurts much more than being wrong with 51% confidence

    • Creates stronger learning signals when the model makes confident errors

    • Example: Predicting 0.01 when the true label is 1 creates a massive gradient

  • It Plays Well with Gradient Descent

    • Smooth surface with clear gradients throughout the prediction range

    • No flat spots or sudden drops that could trap or confuse optimization

    • Mathematically elegant connection to maximum likelihood estimation

Why Other Loss Functions Fall Short#

Mean Squared Error (MSE) is given by: $\( \ell_{\text{MSE}}(y, \hat{y}) = (y - \hat{y})^2 \)$

MSE is more commonly used for regression tasks, where the model predicts a continuous value. While it’s possible to use MSE for binary classification, it has several drawbacks:

  • Designed for continuous values, not binary outcomes

  • Gradients become weaker for very wrong predictions

  • In practical terms: MSE cares more about being “in the ballpark” than being exactly right

Hinge Loss is another loss function used for binary classification, especially in the context of Support Vector Machines (SVMs). It is defined as: $\( \ell_{\text{hinge}}(y, \hat{y}) = \max(0, 1 - y \cdot \hat{y}) \)$

Hinge loss is designed to work well with classifiers that aim to maximize the margin between classes. However, it has some limitations:

  • Not as intuitive when the model outputs probabilities: Doesn’t distinguish between probabilities of 0.51 and 0.99

  • Non-smooth points in the loss function can make optimization more challenging

import numpy as np
import matplotlib.pyplot as plt

# Define binary cross-entropy loss
def binary_cross_entropy(y_true, y_pred):
    epsilon = 1e-12
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

# Define mean squared error loss
def mse_loss(y_true, y_pred):
    return (y_true - y_pred)**2


# Generate predicted probabilities from 0 to 1
y_pred = np.linspace(0.001, 0.999, 500)

# True label (1)
y_true_1 = 1
bce_loss_1 = binary_cross_entropy(y_true_1, y_pred)
mse_loss_1 = mse_loss(y_true_1, y_pred)

# True label (0)
y_true_0 = 0
bce_loss_0 = binary_cross_entropy(y_true_0, y_pred)
mse_loss_0 = mse_loss(y_true_0, y_pred)

# Plot losses for y_true = 1
plt.figure(figsize=(10,5))
plt.plot(y_pred, bce_loss_1, label='Binary Cross-Entropy (y=1)')
plt.plot(y_pred, mse_loss_1, label='Mean Squared Error (y=1)', linestyle='--')
plt.xlabel("Predicted Probability")
plt.ylabel("Loss")
plt.title("Loss Comparison for True Label = 1")
plt.legend()
plt.grid(True)
plt.show()

# Plot losses for y_true = 0
plt.figure(figsize=(10, 5))
plt.plot(y_pred, bce_loss_0, label='Binary Cross-Entropy (y=0)')
plt.plot(y_pred, mse_loss_0, label='Mean Squared Error (y=0)', linestyle='--')
plt.xlabel("Predicted Probability")
plt.ylabel("Loss")
plt.title("Loss Comparison for True Label = 0")
plt.legend()
plt.grid(True)
plt.show()
../_images/2f28208282184b267eb78297219b8345df9cdc8847e58c644f1a2ceb5a8cfec3.png ../_images/ee1892d0b3260d9ee5c43c8b91c4ad004f82332c71fe39c44c07230f674f491d.png
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import SGD

# Define the XOR dataset
X = np.array([
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1]
])
# XOR outputs: 0 if inputs are the same, 1 if they are different
y = np.array([0, 1, 1, 0])

# Create a simple neural network with one hidden layer
model = Sequential([
    Input(shape=(2,)),          
    Dense(3, activation='tanh'),  # Hidden layer with 3 neurons using tanh activation
    Dense(1, activation='sigmoid')  # Output layer with sigmoid activation
])

# Compile the model with learning rate of 0.1
optimizer = SGD(learning_rate=0.1)
model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])

# Train the model and save the history to track loss over epochs
history = model.fit(X, y, epochs=5000, verbose=0)

# Evaluate the model on the XOR dataset
loss, accuracy = model.evaluate(X, y, verbose=0)
predictions = model.predict(X)
rounded_predictions = np.round(predictions)

print(f"Final loss: {loss:.4f}")
print(f"Final accuracy: {accuracy:.4f}")
print("\nRaw predictions on the XOR dataset:")
for i, pred in enumerate(predictions):
    print(f"Input: {X[i]} → Prediction: {pred[0]:.4f} → Rounded: {rounded_predictions[i][0]}")

# Visualize the loss over epochs
plt.figure(figsize=(10, 6))
plt.plot(history.history['loss'])
plt.title('Model Loss During Training')
plt.ylabel('Binary Crossentropy Loss')
plt.xlabel('Epoch')
plt.grid(True)
sns.despine()
plt.show()

# Optional: Create a decision boundary visualization
plt.figure(figsize=(10, 6))

# Create a grid of points
h = 0.01
x_min, x_max = -0.5, 1.5
y_min, y_max = -0.5, 1.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
grid_points = np.c_[xx.ravel(), yy.ravel()]

# Get predictions for all grid points
Z = model.predict(grid_points)
Z = Z.reshape(xx.shape)

# Plot the decision boundary
plt.contourf(xx, yy, Z, cmap=plt.cm.RdBu, alpha=0.8)

# Plot the training points
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k', cmap=plt.cm.RdBu)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xlabel('Input 1')
plt.ylabel('Input 2')
plt.title('XOR Decision Boundary')
plt.colorbar()
plt.grid(True)

plt.show()
2025-05-08 15:01:47.636799: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-05-08 15:01:47.640023: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-05-08 15:01:47.653384: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1746716507.667452   29243 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746716507.671542   29243 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1746716507.683046   29243 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746716507.683057   29243 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746716507.683058   29243 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746716507.683060   29243 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-05-08 15:01:47.687150: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-05-08 15:01:49.294496: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[3], line 30
     27 model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
     29 # Train the model and save the history to track loss over epochs
---> 30 history = model.fit(X, y, epochs=5000, verbose=0)
     32 # Evaluate the model on the XOR dataset
     33 loss, accuracy = model.evaluate(X, y, verbose=0)

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py:117, in filter_traceback.<locals>.error_handler(*args, **kwargs)
    115 filtered_tb = None
    116 try:
--> 117     return fn(*args, **kwargs)
    118 except Exception as e:
    119     filtered_tb = _process_traceback_frames(e.__traceback__)

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/keras/src/backend/tensorflow/trainer.py:371, in TensorFlowTrainer.fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq)
    369 for step, iterator in epoch_iterator:
    370     callbacks.on_train_batch_begin(step)
--> 371     logs = self.train_function(iterator)
    372     callbacks.on_train_batch_end(step, logs)
    373     if self.stop_training:

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/keras/src/backend/tensorflow/trainer.py:219, in TensorFlowTrainer._make_function.<locals>.function(iterator)
    215 def function(iterator):
    216     if isinstance(
    217         iterator, (tf.data.Iterator, tf.distribute.DistributedIterator)
    218     ):
--> 219         opt_outputs = multi_step_on_iterator(iterator)
    220         if not opt_outputs.has_value():
    221             raise StopIteration

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/tensorflow/python/util/traceback_utils.py:150, in filter_traceback.<locals>.error_handler(*args, **kwargs)
    148 filtered_tb = None
    149 try:
--> 150   return fn(*args, **kwargs)
    151 except Exception as e:
    152   filtered_tb = _process_traceback_frames(e.__traceback__)

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py:833, in Function.__call__(self, *args, **kwds)
    830 compiler = "xla" if self._jit_compile else "nonXla"
    832 with OptionalXlaContext(self._jit_compile):
--> 833   result = self._call(*args, **kwds)
    835 new_tracing_count = self.experimental_get_tracing_count()
    836 without_tracing = (tracing_count == new_tracing_count)

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py:878, in Function._call(self, *args, **kwds)
    875 self._lock.release()
    876 # In this case we have not created variables on the first call. So we can
    877 # run the first trace but we should fail if variables are created.
--> 878 results = tracing_compilation.call_function(
    879     args, kwds, self._variable_creation_config
    880 )
    881 if self._created_variables:
    882   raise ValueError("Creating variables on a non-first call to a function"
    883                    " decorated with tf.function.")

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py:139, in call_function(args, kwargs, tracing_options)
    137 bound_args = function.function_type.bind(*args, **kwargs)
    138 flat_inputs = function.function_type.unpack_inputs(bound_args)
--> 139 return function._call_flat(  # pylint: disable=protected-access
    140     flat_inputs, captured_inputs=function.captured_inputs
    141 )

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/tensorflow/python/eager/polymorphic_function/concrete_function.py:1322, in ConcreteFunction._call_flat(self, tensor_inputs, captured_inputs)
   1318 possible_gradient_type = gradients_util.PossibleTapeGradientTypes(args)
   1319 if (possible_gradient_type == gradients_util.POSSIBLE_GRADIENT_TYPES_NONE
   1320     and executing_eagerly):
   1321   # No tape is watching; skip to running the function.
-> 1322   return self._inference_function.call_preflattened(args)
   1323 forward_backward = self._select_forward_and_backward_functions(
   1324     args,
   1325     possible_gradient_type,
   1326     executing_eagerly)
   1327 forward_function, args_with_tangents = forward_backward.forward()

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/tensorflow/python/eager/polymorphic_function/atomic_function.py:216, in AtomicFunction.call_preflattened(self, args)
    214 def call_preflattened(self, args: Sequence[core.Tensor]) -> Any:
    215   """Calls with flattened tensor inputs and returns the structured output."""
--> 216   flat_outputs = self.call_flat(*args)
    217   return self.function_type.pack_output(flat_outputs)

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/tensorflow/python/eager/polymorphic_function/atomic_function.py:251, in AtomicFunction.call_flat(self, *args)
    249 with record.stop_recording():
    250   if self._bound_context.executing_eagerly():
--> 251     outputs = self._bound_context.call_function(
    252         self.name,
    253         list(args),
    254         len(self.function_type.flat_outputs),
    255     )
    256   else:
    257     outputs = make_call_op_in_graph(
    258         self,
    259         list(args),
    260         self._bound_context.function_call_options.as_attrs(),
    261     )

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/tensorflow/python/eager/context.py:1688, in Context.call_function(self, name, tensor_inputs, num_outputs)
   1686 cancellation_context = cancellation.context()
   1687 if cancellation_context is None:
-> 1688   outputs = execute.execute(
   1689       name.decode("utf-8"),
   1690       num_outputs=num_outputs,
   1691       inputs=tensor_inputs,
   1692       attrs=attrs,
   1693       ctx=self,
   1694   )
   1695 else:
   1696   outputs = execute.execute_with_cancellation(
   1697       name.decode("utf-8"),
   1698       num_outputs=num_outputs,
   (...)   1702       cancellation_manager=cancellation_context,
   1703   )

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/tensorflow/python/eager/execute.py:53, in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     51 try:
     52   ctx.ensure_initialized()
---> 53   tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
     54                                       inputs, attrs, num_outputs)
     55 except core._NotOkStatusException as e:
     56   if name is not None:

KeyboardInterrupt: 

Brief Discussion of Optimizers#

Keras provides several built-in optimizers that adjust the model parameters during training. In addition to the classic Stochastic Gradient Descent (SGD), there are more advanced optimizers like adam or rmsprop.

# Compare SGD with Adam optimizer
from tensorflow.keras.optimizers import SGD, Adam, RMSprop
import matplotlib.pyplot as plt
import seaborn as sns

# Function to train model with different optimizers
def train_with_optimizer(optimizer_name):
    # Reset the model
    model = Sequential([
        Input(shape=(2,)),          
        Dense(3, activation='tanh'),
        Dense(1, activation='sigmoid')
    ])
    
    # Select optimizer
    if optimizer_name == 'SGD':
        opt = SGD(learning_rate=0.01)
    elif optimizer_name == "Adam":   
        opt = Adam(learning_rate=0.01)
    elif optimizer_name == "RMSprop":
        opt = RMSprop(learning_rate=0.01)
    
    model.compile(loss='binary_crossentropy', optimizer=opt)
    
    # Train and record history
    history = model.fit(X, y, epochs=1000, verbose=0)
    return history.history['loss']

# Train with both optimizers
sgd_loss = train_with_optimizer('SGD')
adam_loss = train_with_optimizer('Adam')
rmsprop_loss = train_with_optimizer('RMSprop')

# Plot comparison
plt.figure(figsize=(10, 6))
plt.plot(sgd_loss, label='SGD')
plt.plot(adam_loss, label='Adam')
plt.plot(rmsprop_loss, label='rmsprop')
plt.title('SGD vs Adam vs. rmsprop: Loss During Training')
plt.ylabel('Binary Crossentropy Loss')
plt.xlabel('Epoch')
plt.legend()
sns.despine()
plt.grid(True)
plt.show()
../_images/9fdb39597b1a873268bd9fd35a3751fbf61b3e7f7610d77e559a1df61faa50bc.png