Example: Multiclass Classification Problem

Example: Multiclass Classification Problem#

This example is taken from Section 4.2 of the book “Deep Learning with Python” by François Chollet.

The Reuters dataset is a set of short newswires and their topics, published by Reuters in 1986. It’s a simple, widely used toy dataset for text classification. There are 46 different topics; some topics are more represented than others, but each topic has at least 10 examples in the training set.

Like IMDB and MNIST, the Reuters dataset comes packaged as part of Keras.

Step 1: Load the data#

from tensorflow.keras.datasets import reuters 

(train_data, train_labels), (test_data, test_labels) = reuters.load_data( num_words=10000)

2025-05-08 15:02:43.045138: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-05-08 15:02:43.048407: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-05-08 15:02:43.057196: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1746716563.071376   43407 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746716563.075468   43407 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1746716563.086762   43407 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746716563.086772   43407 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746716563.086774   43407 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746716563.086775   43407 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-05-08 15:02:43.090964: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/reuters.npz

      0/2110848 ━━━━━━━━━━━━━━━━━━━━ 0s 0s/step


2110848/2110848 ━━━━━━━━━━━━━━━━━━━━ 0s 0us/step

train_data and test_data: lists of news stories, each story being a list of word indices (encoding a sequence of words).

train_labels and test_labels: The label associated with an example is an integer between 0 and 45—a topic index.

story_idx = 100
print(train_data[story_idx])
print(train_labels[story_idx])

[1, 367, 1394, 169, 65, 87, 209, 30, 306, 228, 10, 803, 305, 96, 5, 196, 15, 10, 523, 2, 3006, 293, 484, 2, 1440, 5825, 8, 145, 7, 10, 1670, 6, 10, 294, 517, 237, 2, 367, 8042, 7, 2477, 1177, 483, 1440, 5825, 8, 367, 1394, 4, 169, 387, 66, 209, 30, 2344, 652, 1496, 9, 209, 30, 2564, 228, 10, 803, 305, 96, 5, 196, 15, 51, 36, 1457, 24, 1345, 5, 4, 196, 150, 10, 523, 320, 64, 992, 6373, 13, 367, 190, 297, 64, 85, 1692, 6, 8656, 122, 9, 36, 1457, 24, 269, 4753, 27, 367, 212, 114, 45, 30, 3292, 7, 126, 2203, 13, 367, 6, 1818, 4, 169, 65, 96, 28, 432, 23, 189, 1254, 4, 9725, 320, 5, 196, 15, 10, 523, 25, 730, 190, 57, 64, 6, 9953, 2016, 6373, 7, 2, 122, 1440, 5825, 8, 269, 4753, 1217, 7, 608, 2203, 30, 3292, 1440, 5825, 8, 43, 339, 43, 231, 9, 667, 1820, 126, 212, 4197, 21, 1709, 249, 311, 13, 260, 489, 9, 65, 4753, 64, 1209, 4397, 249, 954, 36, 152, 1440, 5825, 506, 24, 135, 367, 311, 34, 420, 4, 8407, 200, 1519, 13, 137, 730, 190, 7, 104, 570, 52, 64, 2492, 7725, 4, 642, 5, 405, 7725, 2492, 24, 76, 847, 1435, 4446, 6, 10, 548, 320, 34, 325, 136, 694, 1440, 5825, 8, 10, 5184, 847, 7, 4, 169, 76, 2378, 10, 4933, 3447, 5, 141, 1082, 36, 152, 36, 8, 126, 358, 367, 65, 814, 190, 64, 2575, 10, 969, 3161, 92, 48, 6, 2245, 31, 367, 51, 570, 4753, 292, 27, 405, 212, 62, 3740, 922, 9, 2464, 27, 367, 77, 62, 4397, 7, 316, 5, 874, 36, 152, 4, 936, 1243, 5, 358, 367, 398, 57, 45, 3680, 7367, 6, 2394, 1343, 13, 373, 4504, 36, 8, 1440, 5825, 8, 42, 196, 150, 10, 523, 96, 34, 9725, 43, 16, 1261, 205, 7, 4, 65, 182, 1351, 367, 6, 351, 184, 45, 6081, 2286, 197, 1245, 13, 3187, 2, 274, 419, 714, 1351, 367, 269, 10, 96, 41, 129, 1104, 1673, 1419, 578, 36, 152, 2, 1440, 7615, 367, 1683, 484, 293, 75, 6557, 4, 8042, 152, 24, 5222, 34, 325, 834, 6, 1356, 2, 2406, 7, 4, 65, 76, 1082, 164, 1574, 212, 9, 861, 34, 8340, 13, 286, 1930, 1440, 7615, 8, 787, 36, 1830, 1082, 41, 3751, 616, 6, 382, 2, 2, 1574, 6928, 17, 12]
20

The reuters object has a method get_word_index that returns a dictionary mapping words to an integer index.

news_idx = 0

word_index = reuters.get_word_index() 

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

decoded_newswire = " ".join([reverse_word_index.get(i - 3, "?") for i in train_data[news_idx]])

print(decoded_newswire)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/reuters_word_index.json

     0/550378 ━━━━━━━━━━━━━━━━━━━━ 0s 0s/step


550378/550378 ━━━━━━━━━━━━━━━━━━━━ 0s 0us/step

? ? ? said as a result of its december acquisition of space co it expects earnings per share in 1987 of 1 15 to 1 30 dlrs per share up from 70 cts in 1986 the company said pretax net should rise to nine to 10 mln dlrs from six mln dlrs in 1986 and rental operation revenues to 19 to 22 mln dlrs from 12 5 mln dlrs it said cash flow per share this year should be 2 50 to three dlrs reuter 3

# the words indices range from 1 to 9999
print(min(min(seq) for seq in train_data))
print(max(max(seq) for seq in train_data))

print(reverse_word_index[1])
print(word_index["the"])

1
9999
the
1

Step 2: Preprocess the data#

In this step, we will convert the lists of integers into tensors that our neural network can process.

We will implement multi-hot-encoding - a binary representation commonly used in NLP - to transform our lists into vectors of 0s and 1s. Each resulting tensor will be a 10,000-element vector where:

1 indicates the word appears in the review
0 indicates the word is absent

This representation creates a standardized format that our model can efficiently process while preserving the essential information about word presence in each review.

import numpy as np

def vectorize_sequences(sequences, dimension=10000):
    results = np.zeros((len(sequences), dimension)) 
    for i, sequence in enumerate(sequences):
        for j in sequence:  
            results[i, j] = 1.
    return results 

x_train = vectorize_sequences(train_data) 
x_test = vectorize_sequences(test_data)

# check that the vectorized data is correct
for i in train_data[0]: 
    if x_train[0, i] != 1.0: 
        print(f"i={i} x_train[0, {i}]={x_train[0, i]}")

# test your understanding: why are these values not the same? 
print(sum(x_train[0]))
print(len(train_data[0]))

57.0
87

print(train_labels)

[ 3  4  3 ... 25  3 25]

def to_one_hot(labels, dimension=46):
    results = np.zeros((len(labels), dimension))

    for i, label in enumerate(labels): 
        results[i, label] = 1.
        
    return results 
    
y_train = to_one_hot(train_labels) 
y_test = to_one_hot(test_labels)

print(y_train[100])

print(train_labels[100])

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
20

# Note that there is a built-in way to do this in Keras:

from tensorflow.keras.utils import to_categorical 

y_train = to_categorical(train_labels) 
y_test = to_categorical(test_labels)

print(y_train[100])

print(train_labels[100])

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
20

Step 3: Build the model#

Softmax#

Given a vector of raw predictions (logits) \(z = (z_1, z_2, \dots, z_n)\), the softmax function \(\sigma(z)\) is defined as:

\[ \sigma(z)_i = \frac{e^{z_i}}{\sum_{j=1}^{n} e^{z_j}} \]

where:

\(\sigma(z)_i\) is the predicted probability for the \(i^{\text{th}}\) class.
The probabilities satisfy \(\sum_{i=1}^{n} \sigma(z)_i = 1\).

import numpy as np

def softmax(z):
    exp_z = np.exp(z)   
    return exp_z / np.sum(exp_z)

# Example usage:
logits = np.array([-20.0, -1.0, -1.1])
probabilities = softmax(logits)
print(probabilities)   
print(np.sum(probabilities))  # the sum of the probabilities should be 1

[2.94135151e-09 5.24979186e-01 4.75020811e-01]
1.0

import matplotlib.pyplot as plt
import seaborn as sns

# Varying the logit of class 1, fix logits for class 2 and class 3 at distinct values
logits_class1 = np.linspace(-5, 5, 200)
fixed_logit_class2 = 1.0  
fixed_logit_class3 = -1.0 

probabilities = np.array([
    softmax([z, fixed_logit_class2, fixed_logit_class3]) 
    for z in logits_class1
])

plt.figure(figsize=(8, 6))
plt.plot(logits_class1, probabilities[:, 0], label='Class 1 Probability (varies)')
plt.plot(logits_class1, probabilities[:, 1], label=f'Class 2 Probability (fixed logit = {fixed_logit_class2})')
plt.plot(logits_class1, probabilities[:, 2], label=f'Class 3 Probability (fixed logit = {fixed_logit_class3})')

plt.xlabel("Logit of Class 1")
plt.ylabel("Probability")
plt.title("Softmax Visualization")
plt.grid(True)
sns.despine()
plt.legend()
plt.show()

../_images/cbc7cc8d1820117ab10bb303b74b8571ff06c4fa5049972405c7467b9c2278c1.png

Categorical Crossentropy#

Categorical cross-entropy is a loss function commonly used to train neural networks for multi-class classification problems, where each instance belongs to exactly one category.

Intuition: It measures how far away your model’s predicted probabilities are from the true labels. A smaller loss means your model’s predictions are close to the true labels, while a larger loss means predictions are far away.

Given:

True labels: \(y = (y_1, y_2, \dots, y_n)\), represented as a one-hot vector, where exactly one element is \(1\) (the true class), and all others are \(0\).
Predicted probabilities: \(\hat{y} = (\hat{y}_1, \hat{y}_2, \dots, \hat{y}_n)\), output by the softmax function.

The categorical cross-entropy loss is:

\[ \text{Loss} = -\sum_{i=1}^{n} y_i \log(\hat{y}_i) \]

Since only the true class label \(y_i = 1\) contributes, this simplifies to:

\[ \text{Loss} = -\log(\hat{y}_{\text{true class}}) \]

import numpy as np

def categorical_crossentropy(y_true, y_pred):
    # Ensure numerical stability by adding a tiny epsilon
    epsilon = 1e-15
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.sum(y_true * np.log(y_pred))

# Example usage:
y_true = np.array([0, 0.5, 0.5])   
y_pred = np.array([0.0, 1.0, 0.0])  
loss = categorical_crossentropy(y_true, y_pred)
print(loss)   

17.269388197455342

prob_correct = np.linspace(0.001, 1.0, 200)

loss_values = []
for p in prob_correct:
    y_true = np.array([1, 0, 0])            # True class is class 1
    y_pred = np.array([p, (1 - p)/2, (1 - p)/2])  # Remaining probability split evenly
    loss = categorical_crossentropy(y_true, y_pred)
    loss_values.append(loss)

# Plot
plt.figure(figsize=(8, 6))
plt.plot(prob_correct, loss_values, label='Cross-Entropy Loss', linewidth=2)

plt.xlabel('Predicted Probability of Correct Class')
plt.ylabel('Categorical Cross-Entropy Loss')
plt.title('Categorical Cross-Entropy Loss vs. Predicted Probability')
plt.grid(False)
sns.despine()
plt.show()

../_images/b2c4876ddbbd5b864804687c5012ab3da8a99f7a222d6ca690eb3bf27ca6781f.png

Defining the Model#

from tensorflow import keras 
from tensorflow.keras.layers import Dense

model = keras.Sequential([
    Dense(64, activation="relu"),
    Dense(64, activation="relu"),
    Dense(46, activation="softmax")

])

model.compile(optimizer="rmsprop",
              loss="categorical_crossentropy",
              metrics=["accuracy"])

2025-05-08 15:02:46.997444: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)

Step 4: Train the model#

x_val = x_train[:1000]
partial_x_train = x_train[1000:]
y_val = y_train[:1000]
partial_y_train = y_train[1000:]

We will train the model for 20 epochs in mini-batches of 512 samples.

history = model.fit(partial_x_train,
                    partial_y_train, 
                    epochs=20, 
                    batch_size=512, 
                    validation_data=(x_val, y_val), 
                    verbose=0)

import matplotlib.pyplot as plt 
import seaborn as sns

sns.set_style("whitegrid")
history_dict = history.history

loss_values = history_dict["loss"]
val_loss_values = history_dict["val_loss"]

epochs = range(1, len(loss_values) + 1) 
plt.plot(epochs, loss_values, "bo", label="Training loss") 
plt.plot(epochs, val_loss_values, "b", label="Validation loss") 
plt.title("Training and validation loss") 
plt.xticks(epochs)
plt.xlabel("Epochs") 
plt.ylabel("Loss") 
plt.legend() 
sns.despine()
plt.grid(False)
plt.show()

../_images/b2f52db43b21b1f05f6e9f24b90e932eacd1b67e082476f96df9a4f9b8fcd1ca.png

acc = history_dict["accuracy"]

val_acc = history_dict["val_accuracy"]

epochs = range(1, len(loss_values) + 1) 
plt.plot(epochs, acc, "bo", label="Training acc")
plt.plot(epochs, val_acc, "b", label="Validation acc") 
plt.title("Training and validation accuracy") 
plt.xticks(epochs)
plt.xlabel("Epochs") 
plt.ylabel("Accuracy") 
plt.legend() 
sns.despine()
plt.grid(False)
plt.show()

../_images/392910e816d69be6ab89d45b2eb99d3337b26d1f9e50c979ac9b8811f4de47fc.png

model = keras.Sequential([
    Dense(64, activation="relu"), 
    Dense(64, activation="relu"),
    Dense(46, activation="softmax")

])

model.compile(optimizer="rmsprop",
              loss="categorical_crossentropy",
              metrics=["accuracy"])

model.fit(x_train, y_train, epochs=9, batch_size=512, verbose=0) 
results = model.evaluate(x_test, y_test)

print(f"The test loss is {results[0]}")
print(f"The test accuracy is {results[1]}")

 1/71 ━━━━━━━━━━━━━━━━━━━━ 7s 110ms/step - accuracy: 0.8750 - loss: 0.5827


38/71 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.8012 - loss: 0.8675  


71/71 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.7935 - loss: 0.8838

The test loss is 0.9304726123809814
The test accuracy is 0.7818343639373779

Prediction#

predictions = model.predict(x_test[0:5])

story_idx = 4
# Each entry in “predictions” is a vector of length 46:
print(predictions[story_idx].shape)

print(predictions[story_idx])
# The coefficients in this vector sum to 1, as they form a probability distribution:
print(np.sum(predictions[story_idx]))

# The largest entry is the predicted class—the class with the highest probability:
print(np.argmax(predictions[story_idx]))

1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 29ms/step


1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 39ms/step

(46,)
[2.6305049e-04 8.6284539e-04 1.5906059e-03 3.7015579e-04 2.3012476e-02
3353017e-03 3.5057899e-03 5.4795237e-04 9.0290519e-04 1.7509058e-04
4398682e-04 6.0153683e-04 1.2298219e-04 9.1666794e-01 2.7989005e-04
5420952e-04 3.3536114e-02 3.6310052e-04 7.2289717e-05 4.4279016e-04
2308012e-04 1.1465015e-04 2.3455465e-04 5.1173356e-05 1.6377677e-04
0269517e-03 1.9486474e-04 1.3158246e-03 6.8521878e-04 7.3416450e-04
2517130e-04 1.9059402e-03 3.3442717e-04 8.3261628e-05 1.9985573e-05
0538260e-05 2.2555320e-03 3.8356637e-04 6.3214893e-04 5.6452327e-04
7595130e-04 5.4877857e-04 1.7739371e-04 7.1902585e-04 3.0444539e-04
2411240e-04]
0000001
13

Some things to try:

Check the accuracy of a purely random classifier
Try using different optimizers: adam or sgd
Try using larger or smaller layers: 32 units, 128 units, and so on.
True using a single intermediate layer, or three intermediate layers.

model = keras.Sequential([
    Dense(4, activation="relu"), 
    Dense(4, activation="relu"), 
    Dense(4, activation="relu"), 
    Dense(4, activation="relu"), 
    Dense(46, activation="softmax")

])

model.compile(optimizer="rmsprop",
              loss="categorical_crossentropy",
              metrics=["accuracy"])

model.fit(x_train, y_train, epochs=9, batch_size=512, verbose=0) 
results = model.evaluate(x_test, y_test)

print(f"The test loss is {results[0]}")
print(f"The test accuracy is {results[1]}")
print("The predictions are:")   
print(model.predict(x_test))

 1/71 ━━━━━━━━━━━━━━━━━━━━ 8s 124ms/step - accuracy: 0.1875 - loss: 2.2466


51/71 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.2393 - loss: 2.3392  


71/71 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.2414 - loss: 2.3398

The test loss is 2.3458354473114014
The test accuracy is 0.24220837652683258
The predictions are:

 1/71 ━━━━━━━━━━━━━━━━━━━━ 2s 38ms/step


71/71 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step


71/71 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step

[[2.8380266e-04 3.0579374e-03 1.0505877e-03 ... 1.4799959e-03
  6.4222916e-04 5.0911894e-03]
 [1.6558405e-02 3.5782984e-01 3.2753178e-03 ... 6.0861120e-03
  3.5022804e-05 6.2859361e-04]
 [2.8552387e-02 1.5224963e-01 1.3292176e-02 ... 1.7541435e-02
  1.5101137e-03 6.1343727e-03]
 ...
 [5.1457337e-03 1.7616093e-02 9.7963298e-03 ... 1.1484849e-02
  7.6855668e-03 2.0211542e-02]
 [1.1645334e-03 7.0772101e-03 3.3421242e-03 ... 4.4272994e-03
  2.2780201e-03 1.1047248e-02]
 [2.1775760e-02 1.3471539e-01 1.2595622e-02 ... 1.6765829e-02
  1.6589700e-03 7.6065976e-03]]