Midterm Project

Midterm Project#

In this notebook, we’ll be trying to classify text messages from the SMS Spam Collection from the UC Irvine Machine Learning Repository. Below is code for loading and tokenizing the data.

import requests
import zipfile
import io
import re

import numpy as np
from tensorflow import keras 
from tensorflow.keras.layers import Dense

def load_data():
    url = "https://smsdataset1.s3.us-east-2.amazonaws.com/smsspamcollection.zip"
    filename = "SMSSpamCollection"
    
    response = requests.get(url)

    if response.status_code != 200:
        print(f"Download failed. Status code: {response.status_code}")
        return
    
    X, y = [], []
    
    zip_data = io.BytesIO(response.content)
    with zipfile.ZipFile(zip_data) as zip_ref:
        print(f"Files in the archive: {zip_ref.namelist()}")
        assert filename in zip_ref.namelist()
        with zip_ref.open(filename) as text_file:
            content = text_file.read().decode('utf-8')
            for line in content.splitlines():
                clean_line = re.sub(r'[^\w\s]', '', line.lower())
                tokens = clean_line.split()
                y.append(tokens[0].strip())
                X.append(tokens[1:])
    
    label_to_id = {"ham" : 0, "spam" : 1}
    
    words = list(set([word for x in X for word in x]))
    word_to_id = { words[i] : i for i in range(len(words)) }
    
    y = [label_to_id[label] for label in y]
    X = [[word_to_id[word] for word in x] for x in X]

    return X, y, len(word_to_id)

2025-05-08 15:07:59.449897: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-05-08 15:07:59.452966: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-05-08 15:07:59.461314: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1746716879.475007   75220 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746716879.479070   75220 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1746716879.490461   75220 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746716879.490471   75220 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746716879.490473   75220 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746716879.490475   75220 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-05-08 15:07:59.494770: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

Let’s start by multi-hot encoding the data. Complete the function vectorize_sequences here:

data, labels, dimension = load_data()

def vectorize_sequences(sequences, dimension):
    # YOUR CODE HERE
    raise NotImplementedError()

X = vectorize_sequences(data, dimension)

Files in the archive: ['SMSSpamCollection', 'readme']

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[2], line 7
      3 def vectorize_sequences(sequences, dimension):
      4     # YOUR CODE HERE
      5     raise NotImplementedError()
----> 7 X = vectorize_sequences(data, dimension)

Cell In[2], line 5, in vectorize_sequences(sequences, dimension)
      3 def vectorize_sequences(sequences, dimension):
      4     # YOUR CODE HERE
----> 5     raise NotImplementedError()

NotImplementedError: 

Next, convert the labels into an numpy array of type float32

# YOUR CODE HERE
raise NotImplementedError()

Next, split the dataset into training and test data and labels. Assign these to the variables train_X, train_y, test_X, test_y

# YOUR CODE HERE
raise NotImplementedError()

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[4], line 2
      1 # YOUR CODE HERE
----> 2 raise NotImplementedError()

NotImplementedError: 

Now create a model using the Sequential function from Keras. There is no one right answer for what your model. Play around with parameters of the model. What’s the simplest the model can be and still achieve good accuracy? When does adding more neurons no longer improve the models accuracy?

# YOUR CODE HERE
raise NotImplementedError()

model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])

history = model.fit(train_X,
                train_y, 
                epochs=50, 
                batch_size=512)

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[5], line 2
      1 # YOUR CODE HERE
----> 2 raise NotImplementedError()
      4 model.compile(optimizer="rmsprop",
      5               loss="binary_crossentropy",
      6               metrics=["accuracy"])
      8 history = model.fit(train_X,
      9                 train_y, 
     10                 epochs=50, 
     11                 batch_size=512)

NotImplementedError: 

Now print out the test loss and test accuracy of your model.

# YOUR CODE HERE
raise NotImplementedError()

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[6], line 2
      1 # YOUR CODE HERE
----> 2 raise NotImplementedError()

NotImplementedError: