Tutorial 7: Part-of-speech tagging with RNNs

Tutorial 7: Part-of-speech tagging with RNNs#

Part-of-speech tagging has been an important task in computational linguistics and natural language processing. We assign grammatical categories (noun, verb, adjective, etc.) to each word in a sentence. Words are often ambiguous between different categories, and the correct category often depends not just on what other words are in the sentence, but the exact order of those words in the sentence. RNNs are well poised to capture the information in word order when learning to classify words by their grammatical category.

Imports#

In addition to the libraries you’re already familiar with, we’ll use NLTK to get our labelled data which we’ll train our models on. We’ll also import ‘time’ to compare the performance of different models.

import numpy as np
import nltk
from nltk.corpus import brown
from collections import Counter
import matplotlib.pyplot as plt
import seaborn as sns
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Masking, Dense, Embedding, SimpleRNN, LSTM, Conv1D, MaxPooling1D, Flatten, Bidirectional
from sklearn.model_selection import train_test_split
import time

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[2], line 2
      1 import numpy as np
----> 2 import nltk
      3 from nltk.corpus import brown
      4 from collections import Counter

ModuleNotFoundError: No module named 'nltk'

Part 1: Data Preparation#

# Download the data we'll be using
nltk.download('universal_tagset')
nltk.download('brown')

# We'll grab just 5000 sentences from this data, so our models don't take too long to train
tagged_sentences = brown.tagged_sents(tagset='universal')[:5000]

print(f"Example sentences:\n")
print(f"{tagged_sentences[np.random.randint(len(tagged_sentences))]}\n")
print(f"{tagged_sentences[np.random.randint(len(tagged_sentences))]}")

1.1 Exploring the Dataset#

When starting a new machine learning project, it’s a good idea to take some time to explore the data you’ll be working with.

all_words = [word.lower() for sentence in tagged_sentences for word, tag in sentence]
all_tags = [tag for sentence in tagged_sentences for word, tag in sentence]

print("all_words:")
print(all_words[:10])
print("\nall_tags:")
print(all_tags[:10], "\n")

unique_words = set(all_words)
unique_tags = set(all_tags)

print(f"Total words: {len(all_words)}")
print(f"Number of unique words: {len(unique_words)}")
print(f"Number of unique tags: {len(unique_tags)}")
print(f"Unique tags: {unique_tags}\n")

tag_counts = sorted(list(Counter(all_tags).items()), key=lambda x : x[1])
plt.figure(figsize=(12,6))
sns.barplot(x=[x for x,y in tag_counts], y=[y for x,y in tag_counts])
plt.title('Distribution of POS Tags')
plt.xticks(rotation=45)
plt.ylabel('Frequency')
plt.show()

Text encoding#

Let’s move from words and categories as strings to words and categories as integers. Write this code yourself, rather than relying on any libraries you’ve seen.

# Create mappings from words and tags to indices
# Use the variable names: word_to_idx and tag_to_idx
# I'll initialize them with specpial words/tags for padding and unknown tokens.
word_to_idx = {'<PAD>' : 0, '<UNK>' : 1}
tag_to_idx = {'<PAD>' : 0}
# YOUR CODE HERE
raise NotImplementedError()

#Create reverse mappings from indices to words/tags
#Use the variable names: idx_to_word and idx_to_tag
# YOUR CODE HERE
raise NotImplementedError()

assert set(filter(lambda x : x not in ['<PAD>', '<UNK>'], word_to_idx.keys())) == unique_words
assert set(filter(lambda x : x not in ['<PAD>', '<UNK>'], tag_to_idx.keys())) == unique_tags
assert set(filter(lambda x : x not in ['<PAD>', '<UNK>'], idx_to_word.values())) == unique_words
assert set(filter(lambda x : x not in ['<PAD>', '<UNK>'], idx_to_tag.values())) == unique_tags
assert all([idx_to_word[word_to_idx[word]] == word for word in unique_words])
assert all([idx_to_tag[tag_to_idx[tag]] == tag for tag in unique_tags])

The data will need to be prepared differently for the multi-level perceptron (MLP) and for the RNN. An RNN model can take in whole sentences and output tags for each word. That pattern makes less sense for an MLP. Instead, we’ll input a centered window of fixed context length and output the tag of the word in the center of that window.

When we build our models, we’ll use a Keras Embedding layer, which means we don’t have to do any text vectorization / one-hot encoding.

def create_sequence_data(sentences, max_length=30):
    X_sequences = []
    y_sequences = []

    for sentence in sentences:
        #For each sentence, if it's greater then max_length, truncate it to just the first 30 tokens
        # Then convert the sentence, which is a list of (token, tag) pairs, into
        # (1) a list of word_indices, using word_to_idx
        # (2) a list of tag_indices, using tag_to_idx
        # Then append these to X_sequences and y_sequences
        # YOUR CODE HERE
        raise NotImplementedError()

    X_padded = pad_sequences(X_sequences, maxlen=max_length, padding='post', value=word_to_idx['<PAD>'])
    y_padded = pad_sequences(y_sequences, maxlen=max_length, padding='post', value=tag_to_idx['<PAD>'])

    return X_padded, y_padded

print("Let's look at a sample of from the sequence data we're creating:")
X_padded, y_padded = create_sequence_data(tagged_sentences[:1])
print(X_padded[0])
print(y_padded[0])
print("\nAnd look at it decoded, as a santy check:")
print(" ".join([idx_to_word[idx] for idx in X_padded[0]]))
print(" ".join([idx_to_tag[idx] for idx in y_padded[0]]))

def create_window_data(sentences, window_size=5):
    X_windows = []
    y_tags = []

    half_window = window_size // 2

    for sentence in sentences:
        padded_sentence = [('<PAD>', None)]*half_window + sentence + [('<PAD>', None)]*half_window

        for i in range(half_window, len(padded_sentence) - half_window):
            window = [
                padded_sentence[i-half_window+j][0].lower() if i-half_window+j >= half_window else padded_sentence[i-half_window+j][0]
                for j in range(window_size)
            ]

            center_tag = padded_sentence[i][1]

            if center_tag is not None:
                X_windows.append(window)
                y_tags.append(center_tag)

    X_indices = []
    for window in X_windows:
        window_indices = [word_to_idx.get(word, word_to_idx['<UNK>']) for word in window]
        X_indices.append(window_indices)

    y_indices = [tag_to_idx[tag] for tag in y_tags]

    return np.array(X_indices), np.array(y_indices)

print("\nLet's look at a sample of from the sequence data we're creating:")
X_indices, y_indices = create_window_data(tagged_sentences[:1])
print(X_indices[0])
print(y_indices[0])
print("\nAnd look at it decoded, as a sanity check:")
print(" ".join([idx_to_word[idx] for idx in X_indices[0]]))
print(idx_to_tag[y_indices[0]])

test_sentence = [("dogs", "NOUN"), ("like", "VERB"), ("rice", "NOUN")]
X_padded, y_padded = create_sequence_data([test_sentence])
assert [(idx_to_word[X_padded[0][i]], idx_to_tag[y_padded[0][i]]) for i in range(len(test_sentence))] == test_sentence

# Split sentences into training, validation, and test sets
train_sentences, test_sentences = train_test_split(tagged_sentences, test_size=0.2)
train_sentences, val_sentences = train_test_split(train_sentences, test_size=0.25)

print(f"Train sentences: {len(train_sentences)}")
print(f"Validation sentences: {len(val_sentences)}")
print(f"Test sentences: {len(test_sentences)}")

# Create sequence data
X_train_seqs, y_train_seqs = create_sequence_data(train_sentences)
X_val_seqs, y_val_seqs = create_sequence_data(val_sentences)
X_test_seqs, y_test_seqs = create_sequence_data(test_sentences)

print("\nSequence data shapes:")
print(f"X_train_seqs: {X_train_seqs.shape}, y_train_seqs: {y_train_seqs.shape}")
print(f"X_val_seqs: {X_val_seqs.shape}, y_val_seqs: {y_val_seqs.shape}")
print(f"X_test_seqs: {X_test_seqs.shape}, y_test_seqs: {y_test_seqs.shape}")

# Prepare window-based data
X_train_window, y_train_window = create_window_data(train_sentences)
X_val_window, y_val_window = create_window_data(val_sentences)
X_test_window, y_test_window = create_window_data(test_sentences)

print("\nWindow-based data shapes:")
print(f"X_train_window: {X_train_window.shape}, y_train_window: {y_train_window.shape}")
print(f"X_val_window: {X_val_window.shape}, y_val_window: {y_val_window.shape}")
print(f"X_test_window: {X_test_window.shape}, y_test_window: {y_test_window.shape}")

Train and test a MLP#

Next, create a Sequential Keras model with Dense hidden layers. The first layer should be an embedding layer that looks like this:

Embedding( input_dim=len(word_to_idx), output_dim=50, input_shape=(5,) )

Then use a Flatten layer, then two Dense hidden layers, and a final Dense output layer.

Call the variable ‘dense_model’ to store the model

# YOUR CODE HERE
raise NotImplementedError()

dense_model.summary()

Next, call model fit on dense_model and pass in X_train_window, y_train_window, and also make use of the validation data, passing in X_val_window and y_val_window.

Use the ‘time.perf_counter()’ immediately before and after dense_model.fit is called.

Print out: How long it took your model to train, and how accurate it was for the X_test_window and y_test_window data.

# YOUR CODE HERE
raise NotImplementedError()

Train and test an RNN model#

Next, create a Sequential Keras model with a Bidirectional LSTM hidden layer. The first layer should be an embedding layer that looks like this:

Embedding( input_dim=len(word_to_idx), output_dim=50, input_shape=(5,) )

Then use a Bidirectional LSTM layer that looks like this:

Bidirectional(LSTM(100, return_sequences=True))

And finally, have a Dense output layer.

Call the variable ‘rnn_model’ to store the model.

# YOUR CODE HERE
raise NotImplementedError()
rnn_model.summary()

Next, call model fit on rnn_model and pass in X_train_seqs y_train_seqs, and also make use of the validation data, passing in X_val_seqs and y_val_seqs.