Tutorial 2: Reading CSV files

Tutorial 2: Reading CSV files#

Part 1#

1. Reading `csv` files#

A csv file is a comma separated values file. It is a simple file format used to store tabular data, such as a spreadsheet or database. The first row of the file typically contains the column names, and the following rows contain the data.

The file comedy_comparisons_metadata.csv contains metadata about videos on YouTube. The file is available at the following URL: https://raw.githubusercontent.com/epacuit/introduction-machine-learning/refs/heads/main/tutorials/comedy_comparisons_metadata.csv

Use the csv Python package (https://docs.python.org/3/library/csv.html) to read the file. Create a list metadata that contains dictionaries for each row in the file. The keys of the dictionaries should be the column names and the values should be the corresponding values in the row. For example, the first dictionary in the list should be: “video_id”, “title”, “view_count”, “like_count”, “comment_count”, “duration”, corresponding to the columns in the file.

import csv

# YOUR CODE HERE
raise NotImplementedError()

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[2], line 4
      1 import csv
      3 # YOUR CODE HERE
----> 4 raise NotImplementedError()

NotImplementedError: 

assert len(metadata) == 11541
assert all([type(x) == dict for x in metadata])
assert all([type(x) == dict for x in metadata])
assert all([sorted(list(x.keys())) == sorted(['video_id', 'duration', 'title', 'view_count', 'like_count', 'comment_count', ]) for x in metadata])

def avg_view_count(metadata):
    """
    Calculate the average view count of the videos in the metadata.   Return the average rounded to two decimal places.
    """
    # YOUR CODE HERE
    raise NotImplementedError()

assert avg_view_count(metadata) == 891988.54
assert avg_view_count(metadata[100:200]) == 1152895.47

2. Write a function that accepts the `metadata` list, the `video_id`, and a column name, and return the value of the column name for the video_id.#

For instance, get_value(metadata, 'DE1-cD3pTkA', 'like_count') should the number of likes for the video with id “DE1-cD3pTkA”.

def get_value(metadata, video_id, col_name):

    # YOUR CODE HERE
    raise NotImplementedError()

assert get_value(metadata, 'XZqSz_X-j8Y', 'view_count') == 1919
assert get_value(metadata, 'XZqSz_X-j8Y', 'like_count') == 7
assert get_value(metadata, 'XZqSz_X-j8Y', 'comment_count') == 3
 

Part 2: Predicting Video Comparisons from Metadata#

In this part, we will attempt to predict which of two YouTube videos is considered funnier based on their metadata.

The dataset comedy_comparisons.csv is a subset of the YouTube Comedy Slam Preference dataset, available from the UC Irvine Machine Learning Repository. It contains pairwise comparisons of videos, where each row records the video IDs of two videos and indicates which one was rated as funnier by a user.

You can access the file at the following URL:
https://raw.githubusercontent.com/epacuit/introduction-machine-learning/refs/heads/main/tutorials/test_comedy_comparisons_restricted.csv.

Tasks#

Read the Dataset: Read the file test_comedy_comparisons_restricted.csv and create a list of dictionaries. Each dictionary should have the keys "video_id_1", "video_id_2", and "winner".
- "video_id_1" and "video_id_2" should store the video IDs being compared.
- "winner" should be 1 if video_id_1 is considered funnier and 0 if video_id_2 is considered funnier.
- The function should return a list of such dictionaries.
Implement Comparison Functions: Write three different comparison functions of the following form:
```
def is_funnier(video_id_1, video_id_2, metadata):
    """
    Returns True if video_id_1 is predicted to be funnier than video_id_2 based on metadata.
    """
```
Each function should predict which video is funnier based on some metadata attribute, such as: - Number of views - Number of likes - Number of comments
Evaluate Accuracy: Write a function evaluate that accepts the list of comparisons created in step 1 and evaluates the accuracy of a comparison function. The accuracy is the proportion of comparisons where the function correctly predicts the funnier video.

import csv

# YOUR CODE HERE
raise NotImplementedError()

def is_funnier_1(metadata, video_id1, video_id2):
    # YOUR CODE HERE
    raise NotImplementedError()

def is_funnier_2(metadata, video_id1, video_id2):
    # YOUR CODE HERE
    raise NotImplementedError()

def is_funnier_3(metadata, video_id1, video_id2):
    # YOUR CODE HERE
    raise NotImplementedError()

def evaluate(metadata, comparisons, is_funnier):
    # YOUR CODE HERE
    raise NotImplementedError()

print("The accuracy of is_funnier_1 is", evaluate(metadata, comparisons, is_funnier_1))

print("The accuracy of is_funnier_2 is", evaluate(metadata, comparisons, is_funnier_2))

print("The accuracy of is_funnier_3 is", evaluate(metadata, comparisons, is_funnier_3))