Example: Classifying Movie Reviews

Example: Classifying Movie Reviews#

This example is taken from Section 4.1 of the book “Deep Learning with Python” by François Chollet.

The IMDB dataset contains a set of 50,000 highly polarized reviews from the Internet Movie Database. They are split into 25,000 reviews for training and 25,000 reviews for testing, each set consisting of 50% negative and 50% positive reviews.

We will train a model to classify movie reviews as positive or negative, based on the text content of the reviews.

Step 1: Load the data#

from tensorflow.keras.datasets import imdb 

(train_data, train_labels), (test_data, test_labels)  = imdb.load_data(num_words=10000)

2025-05-08 15:02:33.257891: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-05-08 15:02:33.261124: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-05-08 15:02:33.269656: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1746716553.283900   43378 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746716553.288101   43378 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1746716553.299354   43378 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746716553.299367   43378 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746716553.299369   43378 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746716553.299370   43378 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-05-08 15:02:33.303335: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz

       0/17464789 ━━━━━━━━━━━━━━━━━━━━ 0s 0s/step


 3915776/17464789 ━━━━━━━━━━━━━━━━━━━━ 0s 0us/step


12541952/17464789 ━━━━━━━━━━━━━━━━━━━━ 0s 0us/step


17464789/17464789 ━━━━━━━━━━━━━━━━━━━━ 0s 0us/step

train_data and test_data: lists of reviews, each review being a list of word indices (encoding a sequence of words).

train_labels and test_labels: lists of 0s and 1s, where 0 stands for “negative” and 1 stands for “positive”.

review_idx = 100
print(train_data[review_idx])
print(train_labels[review_idx])

[1, 13, 244, 6, 87, 337, 7, 628, 2219, 5, 28, 285, 15, 240, 93, 23, 288, 549, 18, 1455, 673, 4, 241, 534, 3635, 8448, 20, 38, 54, 13, 258, 46, 44, 14, 13, 1241, 7258, 12, 5, 5, 51, 9, 14, 45, 6, 762, 7, 2, 1309, 328, 5, 428, 2473, 15, 26, 1292, 5, 3939, 6728, 5, 1960, 279, 13, 92, 124, 803, 52, 21, 279, 14, 9, 43, 6, 762, 7, 595, 15, 16, 2, 23, 4, 1071, 467, 4, 403, 7, 628, 2219, 8, 97, 6, 171, 3596, 99, 387, 72, 97, 12, 788, 15, 13, 161, 459, 44, 4, 3939, 1101, 173, 21, 69, 8, 401, 2, 4, 481, 88, 61, 4731, 238, 28, 32, 11, 32, 14, 9, 6, 545, 1332, 766, 5, 203, 73, 28, 43, 77, 317, 11, 4, 2, 953, 270, 17, 6, 3616, 13, 545, 386, 25, 92, 1142, 129, 278, 23, 14, 241, 46, 7, 158]
0

word_index = imdb.get_word_index() 
print(list(word_index.items())[0:10])

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json

      0/1641221 ━━━━━━━━━━━━━━━━━━━━ 0s 0s/step


1641221/1641221 ━━━━━━━━━━━━━━━━━━━━ 0s 0us/step

[('fawn', 34701), ('tsukino', 52006), ('nunnery', 52007), ('sonja', 16816), ('vani', 63951), ('woods', 1408), ('spiders', 16115), ('hanging', 2345), ('woody', 2289), ('trawling', 52008)]

The imbdb object has a method get_word_index that returns a dictionary mapping words to an integer index.

review_idx = 200
word_index = imdb.get_word_index() 
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()]) 

# Note that the indices are offset by 3 because 0, 1, and 2 are 
# reserved indices for “padding,” “start of sequence,” and “unknown.”
decoded_review = " ".join([reverse_word_index.get(i-3, "?") 
                           for i in train_data[review_idx]])

print(f"Review index: {review_idx}")
print(f"Review data:\n{train_data[review_idx]}")
print(f"Review:\n{decoded_review}")
print(f"Label: {train_labels[review_idx]}")

Review index: 200
Review data:
[1, 14, 9, 6, 227, 196, 241, 634, 891, 234, 21, 12, 69, 6, 6, 176, 7, 4, 804, 4658, 2999, 667, 11, 12, 11, 85, 715, 6, 176, 7, 1565, 8, 1108, 10, 10, 12, 16, 1844, 2, 33, 211, 21, 69, 49, 2009, 905, 388, 99, 2, 125, 34, 6, 2, 1274, 33, 4, 130, 7, 4, 22, 15, 16, 6424, 8, 650, 1069, 14, 22, 9, 44, 4609, 153, 154, 4, 318, 302, 1051, 23, 14, 22, 122, 6, 2093, 292, 10, 10, 723, 8721, 5, 2, 9728, 71, 1344, 1576, 156, 11, 68, 251, 5, 36, 92, 4363, 133, 199, 743, 976, 354, 4, 64, 439, 9, 3059, 17, 32, 4, 2, 26, 256, 34, 2, 5, 49, 7, 98, 40, 2345, 9844, 43, 92, 168, 147, 474, 40, 8, 67, 6, 796, 97, 7, 14, 20, 19, 32, 2188, 156, 24, 18, 6090, 1007, 21, 8, 331, 97, 4, 65, 168, 5, 481, 53, 3084]
Review:
? this is a bit long 2 hours 20 minutes but it had a a lot of the famous pearl buck novel in it in other words a lot of ground to cover br br it was soap ? at times but had some visually dramatic moments too ? off by a ? attack at the end of the film that was astounding to view considering this film is about 70 years old the special effects crew on this film did a spectacular job br br paul muni and ? rainer were award winning actors in their day and they don't disappoint here both giving powerful performances the only problem is credibility as all the ? are played by ? and some of them like walter connolly just don't look real i'd like to see a re make of this movie with all asian actors not for pc reasons but to simply make the story look and sound more credible
Label: 1

# the words indices range from 1 to 9999
print(min(min(seq) for seq in train_data))
print(max(max(seq) for seq in train_data))

print(reverse_word_index[1])
print(word_index["the"])
print(max([len(seq) for seq in train_data]))

Step 2: Preprocess the data#

In this step, we will convert the lists of integers into tensors that our neural network can process.

We will implement multi-hot-encoding - a binary representation commonly used in NLP - to transform our lists into vectors of 0s and 1s. Each resulting tensor will be a 10,000-element vector where:

1 indicates the word appears in the review
0 indicates the word is absent

This representation creates a standardized format that our model can efficiently process while preserving the essential information about word presence in each review.

x = ['a', 'b', 'c']

for idx, val in enumerate(x):
    print(idx, x[idx])

a
b
c

import numpy as np
def vectorize_sequences(sequences, dimension=10000):
    results = np.zeros((len(sequences), dimension)) 
    for i, sequence in enumerate(sequences):
        for j in sequence:  
            results[i, j] = 1.
    return results 

x_train = vectorize_sequences(train_data) 
x_test = vectorize_sequences(test_data)

y_train = np.asarray(train_labels).astype("float32") 
y_test = np.asarray(test_labels).astype("float32")

# check that the vectorized data is correct
for i in train_data[0]: 
    if x_train[0, i] != 1.0: 
        print(f"i={i} x_train[0, {i}]={x_train[0, i]}")

# test your understanding: why are these values not the same? 
print(sum(x_train[0]))
print(len(train_data[0]))

120.0
218

Step 3: Build the model#

from tensorflow import keras 
from tensorflow.keras.layers import Dense

model = keras.Sequential([
    Dense(16, activation="relu"),
    Dense(16, activation="relu"),
    Dense(1, activation="sigmoid")
])

model.compile(optimizer="a",
              loss="binary_crossentropy",
              metrics=["accuracy"])

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[9], line 10
      2 from tensorflow.keras.layers import Dense
      4 model = keras.Sequential([
      5     Dense(16, activation="relu"),
      6     Dense(16, activation="relu"),
      7     Dense(1, activation="sigmoid")
      8 ])
---> 10 model.compile(optimizer="a",
     11               loss="binary_crossentropy",
     12               metrics=["accuracy"])

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py:122, in filter_traceback.<locals>.error_handler(*args, **kwargs)
    119     filtered_tb = _process_traceback_frames(e.__traceback__)
    120     # To get the full stack trace, call:
    121     # `keras.config.disable_traceback_filtering()`
--> 122     raise e.with_traceback(filtered_tb) from None
    123 finally:
    124     del filtered_tb

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/keras/src/optimizers/__init__.py:97, in get(identifier)
     95 if isinstance(obj, Optimizer):
     96     return obj
---> 97 raise ValueError(f"Could not interpret optimizer identifier: {identifier}")

ValueError: Could not interpret optimizer identifier: a

Step 4: Train the model#

To train the model, we will use the rmsprop optimizer and the binary_crossentropy loss function. We will also monitor accuracy during training. In addition, we will create a validation set by setting apart 10,000 samples from the original training data.

Here is an outline of the training loop:

Given the model, learning_rate, batch_size, epochs, train_data, train_labels, test_data, and test_labels
Initialize the optimizer with learning_rate
For each epoch from 1 to epochs
1. Initialize an empty list of training losses
2. For each batch of training data of size batch_size
  1. Extract the inputs and targets
  2. Do a forward pass through the model
  3. Calculate the loss
  4. Backward pass by updating the weights
  5. Append the loss to the list of training losses
3. Initialize an empty list of validation losses
4. For each batch of validation data
  1. Extract the inputs and targets
  2. Do a forward pass through the model
  3. Calculate the loss
  4. Append the loss to the list of validation losses
5. Calculate and print the average training loss
6. Calculate and print the average validation loss

Batch size#

The batch_size parameter is a crucial hyperparameter that determines how many examples are processed together in a single forward/backward pass:

Instead of updating model weights after each individual example (inefficient) or after the entire dataset (memory-intensive), we update after each batch
Each batch contains exactly batch_size examples (except possibly the last batch, which might be smaller)
The model weights are updated once per batch, not once per example
Smaller batch sizes mean more frequent weight updates but with noisier gradients
Larger batch sizes mean fewer weight updates per epoch but with more stable gradients

Validation Process#

For validation, the entire validation set is evaluated, but still processed in batches for memory efficiency:

The validation data is processed in batches of size batch_size, just like training data
All validation samples are evaluated and their metrics averaged together
Unlike training, no weight updates occur during validation
The final validation metric represents performance across the entire validation set
Validation typically happens once per epoch, not after every training batch

This approach enables evaluation on large validation sets that might not fit into memory all at once, while still getting a complete measure of model performance on the entire validation dataset.

x_val = x_train[:10000]
partial_x_train = x_train[10000:]
y_val = y_train[:10000]
partial_y_train = y_train[10000:]

We will train the model for 20 epochs in mini-batches of 512 samples.

history = model.fit(partial_x_train,
                    partial_y_train, 
                    epochs=20, 
                    batch_size=512, 
                    validation_data=(x_val, y_val), 
                    verbose=0)

Epoch 1/20
30/30 ━━━━━━━━━━━━━━━━━━━━ 1s 15ms/step - accuracy: 0.6869 - loss: 0.5969 - val_accuracy: 0.8643 - val_loss: 0.3931
Epoch 2/20
30/30 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.8926 - loss: 0.3361 - val_accuracy: 0.8864 - val_loss: 0.3090
Epoch 3/20
30/30 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.9213 - loss: 0.2432 - val_accuracy: 0.8867 - val_loss: 0.2839
Epoch 4/20
30/30 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step - accuracy: 0.9356 - loss: 0.1948 - val_accuracy: 0.8899 - val_loss: 0.2743
Epoch 5/20
30/30 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.9522 - loss: 0.1552 - val_accuracy: 0.8867 - val_loss: 0.2814
Epoch 6/20
30/30 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.9607 - loss: 0.1301 - val_accuracy: 0.8863 - val_loss: 0.2887
Epoch 7/20
30/30 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.9694 - loss: 0.1076 - val_accuracy: 0.8806 - val_loss: 0.3029
Epoch 8/20
30/30 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.9770 - loss: 0.0896 - val_accuracy: 0.8820 - val_loss: 0.3161
Epoch 9/20
30/30 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.9828 - loss: 0.0755 - val_accuracy: 0.8822 - val_loss: 0.3304
Epoch 10/20
30/30 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.9838 - loss: 0.0638 - val_accuracy: 0.8717 - val_loss: 0.3741
Epoch 11/20
30/30 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.9862 - loss: 0.0593 - val_accuracy: 0.8727 - val_loss: 0.3833
Epoch 12/20
30/30 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step - accuracy: 0.9922 - loss: 0.0435 - val_accuracy: 0.8773 - val_loss: 0.3902
Epoch 13/20
30/30 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.9940 - loss: 0.0359 - val_accuracy: 0.8760 - val_loss: 0.4147
Epoch 14/20
30/30 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.9951 - loss: 0.0307 - val_accuracy: 0.8780 - val_loss: 0.4320
Epoch 15/20
30/30 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.9960 - loss: 0.0246 - val_accuracy: 0.8751 - val_loss: 0.4561
Epoch 16/20
30/30 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.9969 - loss: 0.0228 - val_accuracy: 0.8724 - val_loss: 0.4916
Epoch 17/20
30/30 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.9976 - loss: 0.0181 - val_accuracy: 0.8739 - val_loss: 0.4990
Epoch 18/20
30/30 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.9993 - loss: 0.0122 - val_accuracy: 0.8725 - val_loss: 0.5202
Epoch 19/20
30/30 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.9995 - loss: 0.0105 - val_accuracy: 0.8736 - val_loss: 0.5407
Epoch 20/20
30/30 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step - accuracy: 0.9997 - loss: 0.0082 - val_accuracy: 0.8539 - val_loss: 0.6420

import matplotlib.pyplot as plt 
import seaborn as sns

sns.set_style("whitegrid")
history_dict = history.history

loss_values = history_dict["loss"]
val_loss_values = history_dict["val_loss"]

epochs = range(1, len(loss_values) + 1) 
plt.plot(epochs, loss_values, "bo", label="Training loss") 
plt.plot(epochs, val_loss_values, "b", label="Validation loss") 
plt.title("Training and validation loss") 
plt.xticks(epochs)
plt.xlabel("Epochs") 
plt.ylabel("Loss") 
plt.legend() 
sns.despine()
plt.grid(False)
plt.show()

../_images/cc8a1f6c56e1bfe03f29db1682ef5f709af44374cc58e076712950e54fc704e4.png

acc = history_dict["accuracy"]

val_acc = history_dict["val_accuracy"]

epochs = range(1, len(loss_values) + 1) 
plt.plot(epochs, acc, "bo", label="Training acc")
plt.plot(epochs, val_acc, "b", label="Validation acc") 
plt.title("Training and validation accuracy") 
plt.xticks(epochs)
plt.xlabel("Epochs") 
plt.ylabel("Accuracy") 
plt.legend() 
sns.despine()
plt.grid(False)
plt.show()

../_images/8f7f9ad4e4e2426f210cde2c3271ea1cb626f0bb58ed543d25a0df1ab23e474f.png

This shows that the model quickly starts overfitting the training data. Overfitting occurs when a model learns the training data too well, capturing not just the underlying patterns but also the random noise and peculiarities specific to the training set. We can identify overfitting by observing a characteristic divergence between training and validation metrics: while training loss continues to decrease, validation loss begins to increase or plateau.

Several indicators of overfitting in this case include:

Decreasing training loss alongside increasing validation loss
Growing gap between training and validation accuracy

This behavior suggests the model is becoming too specialized to the training examples rather than learning generalizable patterns. There are several strategies to mitigate overfitting, which we will explore later in the course. In this case, we will stop training after 4 epochs to prevent overfitting.

model = keras.Sequential([
    Dense(16, activation="relu"),
    Dense(16, activation="relu"), 
    Dense(1, activation="sigmoid")
])

model.compile(
    optimizer="rmsprop",
    loss="binary_crossentropy",
    metrics=["accuracy"])

model.fit(x_train, y_train, epochs=4, batch_size=512, verbose=0) 
results = model.evaluate(x_test, y_test, verbose=0)

print(f"The test loss is {results[0]}")
print(f"The test accuracy is {results[1]}")
print("The predictions are:")   
print(model.predict(x_test))

Epoch 1/4
49/49 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - accuracy: 0.7283 - loss: 0.5583
Epoch 2/4
49/49 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9013 - loss: 0.2886
Epoch 3/4
49/49 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9252 - loss: 0.2123
Epoch 4/4
49/49 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.9375 - loss: 0.1774
782/782 ━━━━━━━━━━━━━━━━━━━━ 0s 467us/step - accuracy: 0.8821 - loss: 0.2909
The test loss is 0.29100528359413147
The test accuracy is 0.8838000297546387
The predictions are:
782/782 ━━━━━━━━━━━━━━━━━━━━ 0s 381us/step
[[0.17098129]
 [0.99931735]
 [0.5988198 ]
 ...
 [0.0967738 ]
 [0.05774209]
 [0.46419108]]

Some things to try:

Try using different optimizers: adam or sgd
Try using one or three representation layers, and see how doing so affects validation and test accuracy.
Try using layers with more units or fewer units: 32 units, 64 units, and so on.
Try using the mse loss function instead of binary_crossentropy.
Try using the tanh activation instead of relu.

model = keras.Sequential([
    Dense(16, activation="relu"),
    Dense(16, activation="relu"), 
    Dense(1, activation="sigmoid")
])

model.compile(
    optimizer="rmsprop",
    loss="binary_crossentropy",
    metrics=["accuracy"])

model.fit(x_train, y_train, epochs=4, batch_size=5, verbose=0) 
results = model.evaluate(x_test, y_test)

print(f"The test loss is {results[0]}")
print(f"The test accuracy is {results[1]}")
print("The predictions are:")   
print(model.predict(x_test))

Epoch 1/4
5000/5000 ━━━━━━━━━━━━━━━━━━━━ 5s 860us/step - accuracy: 0.8367 - loss: 0.3737
Epoch 2/4
5000/5000 ━━━━━━━━━━━━━━━━━━━━ 4s 861us/step - accuracy: 0.9137 - loss: 0.2294
Epoch 3/4
5000/5000 ━━━━━━━━━━━━━━━━━━━━ 4s 878us/step - accuracy: 0.9236 - loss: 0.2105
Epoch 4/4
5000/5000 ━━━━━━━━━━━━━━━━━━━━ 4s 863us/step - accuracy: 0.9327 - loss: 0.1933
782/782 ━━━━━━━━━━━━━━━━━━━━ 0s 453us/step - accuracy: 0.8801 - loss: 0.3087
The test loss is 0.30427753925323486
The test accuracy is 0.8822399973869324
The predictions are:
782/782 ━━━━━━━━━━━━━━━━━━━━ 0s 333us/step
[[0.15679161]
 [0.99998003]
 [0.9494275 ]
 ...
 [0.15605684]
 [0.09611651]
 [0.56805176]]