Open In Colab

Before you turn this problem in, make sure everything runs as expected. First, restart the kernel (in the menubar, select Kernel \(\rightarrow\) Restart) and then run all cells (in the menubar, select Cell \(\rightarrow\) Run All).

Make sure you fill in any place that says YOUR CODE HERE or “YOUR ANSWER HERE”, as well as your name and collaborators below:

NAME = ""
COLLABORATORS = ""

Tutorial 3: Brief Introduction to Pandas#

This tutorial will provide a very brief introduction to the Pandas library. Pandas is a powerful data manipulation library for Python.

For a more in-depth introduction to Pandas, read the Pandas Documentation.

Import the Pandas library and read a dataset#

The first step is to read a dataset into a Pandas DataFrame. A DataFrame is a two-dimensional, size-mutable, and heterogeneous tabular data structure with labeled axes (rows and columns).

import pandas as pd 
url = 'https://raw.githubusercontent.com/epacuit/introduction-machine-learning/refs/heads/main/tutorials/comedy_comparisons_metadata.csv'

df = pd.read_csv(url)
type(df) # the type of the object is a DataFrame
pandas.core.frame.DataFrame
df.head() # shows the first 5 rows of the DataFrame
video_id title view_count like_count comment_count duration
0 DE1-cD3pTkA Walker Texas Ranger clip LIVE 224.0 1.0 0.0 PT32S
1 XZqSz_X-j8Y Egg mcmuffin of a crappy commercial 1919.0 7.0 3.0 PT44S
2 vzpD6OogahQ Potter Puppet Pals: School Is For Losers 14958231.0 87806.0 11321.0 PT33S
3 _OpzEHBDwQE Very Funny Commercial Think Beyond Your Ear (... 61696.0 179.0 2.0 PT38S
4 yzGWOpop6i8 Happy Valentines Day 15852.0 315.0 77.0 PT1M14S
df.columns # lists the columns of the DataFrame
Index(['video_id', 'title', 'view_count', 'like_count', 'comment_count',
       'duration'],
      dtype='object')
df.shape # shows the shape of the DataFrame (rows, columns)
(11541, 6)
len(df) # shows the number of rows in the DataFrame
11541

Creating a DataFrame#

df_1 = pd.DataFrame({
    'A': [1, 2, 3], 
    'B': [4, 5, 6],
    'C': ['x', 'y', 'z']})

df_1
A B C
0 1 4 x
1 2 5 y
2 3 6 z
df_2 = pd.DataFrame(
    [[1, 4, 'x'], [2, 5, 'y'], [3, 6, 'z']], 
    columns=['A', 'B', 'C'])
df_2
A B C
0 1 4 x
1 2 5 y
2 3 6 z

Selecting Columns#

The next step is to filter the DataFrame to select specific rows or columns.

df['title'] # A column can be accessed by using the column name as a key
0                            Walker Texas Ranger clip LIVE
1                      Egg mcmuffin of a crappy commercial
2                 Potter Puppet Pals: School Is For Losers
3        Very Funny Commercial  Think Beyond Your Ear (...
4                                     Happy Valentines Day
                               ...                        
11536              Dylan hit by a ball [Slow motion remix]
11537            Trick or Treat! Funny Halloween Dog Video
11538                                   Framed Thief Prank
11539           Harry Redknapp gets hit by Fernando Torres
11540                    ~600 Subs Special~ MMV- Enchanted
Name: title, Length: 11541, dtype: object
type(df['title']) # the type of a column is a Series
pandas.core.series.Series
titles = df['title'] # the column can be stored in a variable

titles.head() # shows the first 5 rows of the Series
0                        Walker Texas Ranger clip LIVE
1                  Egg mcmuffin of a crappy commercial
2             Potter Puppet Pals: School Is For Losers
3    Very Funny Commercial  Think Beyond Your Ear (...
4                                 Happy Valentines Day
Name: title, dtype: object
print(f"The 11th title is {titles[10]}") # a value in a Series can be accessed by using the index as a key

print(f'The 11th title is {df["title"][10]}') # you can also access a value in a Series by using the DataFrame and column name

print("\nThe first 5 values of the title: ", titles[0:5]) # A Series object can be sliced like a list

print("\nThe last 5 values of the title: ", titles[-5:]) # A Series object can be sliced like a list 
The 11th title is PCD Flying Fun Cam
The 11th title is PCD Flying Fun Cam

The first 5 values of the title:  0                        Walker Texas Ranger clip LIVE
1                  Egg mcmuffin of a crappy commercial
2             Potter Puppet Pals: School Is For Losers
3    Very Funny Commercial  Think Beyond Your Ear (...
4                                 Happy Valentines Day
Name: title, dtype: object

The last 5 values of the title:  11536       Dylan hit by a ball [Slow motion remix]
11537     Trick or Treat! Funny Halloween Dog Video
11538                            Framed Thief Prank
11539    Harry Redknapp gets hit by Fernando Torres
11540             ~600 Subs Special~ MMV- Enchanted
Name: title, dtype: object
print("The unique values of the view_counts column are: ", df['view_count'].unique()) # shows the unique values of a column

print("The number of values of the view_counts column are: ", df['view_count'].count()) # shows the number of values of a column

print("The number of unique values of the view_counts column are: ", df['view_count'].nunique()) # shows the number of unique values of a column

print(" ", len(list(set(df['view_count'])))) # Another way to find the number of unique elements in a column
The unique values of the view_counts column are:  [2.2400000e+02 1.9190000e+03 1.4958231e+07 ... 1.3728710e+06 1.7410000e+03
 5.9710000e+03]
The number of values of the view_counts column are:  11532
The number of unique values of the view_counts column are:  9310
  9319

Typically, it is faster to use the Pandas’s built-in functions.

%%timeit

df['view_count'].nunique()
315 μs ± 1.16 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit 

len(list(set(df['view_count'])))
1.27 ms ± 5.98 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Filtering the DataFrame#

df[df['view_count'] > 1000000] # shows the rows that have a view_count greater than 1,000,000
video_id title view_count like_count comment_count duration
2 vzpD6OogahQ Potter Puppet Pals: School Is For Losers 14958231.0 87806.0 11321.0 PT33S
18 rQjr5xjugUM Ah meu Deus do céu, Quebraram(quebraru) a mesa... 4364246.0 37822.0 NaN PT1M15S
23 U2_CSYI8O0Y Kud - Koninginnemarkt 1835025.0 15321.0 1829.0 PT50S
25 TugslL45aXk Don't argue with an ibex 4677355.0 34588.0 3597.0 PT1M30S
39 TXtP8UfIfEA JibJab Valentine's Day eCards.flv_(360p).mp4 20272233.0 19727.0 695.0 PT47S
... ... ... ... ... ... ...
11506 a91eiu_eer4 Optical Illusion Girlfriend 6239948.0 61008.0 3554.0 PT1M24S
11516 59xYAKjCnIo SUPER TIMOR 1131909.0 5171.0 261.0 PT50S
11530 q-RLqLx1iYI Epic and Honest Mobile Home Commercial 5594751.0 29208.0 2779.0 PT1M20S
11531 Ej9wCko6dR4 Bud Light Jackie Moon Commercial 2154189.0 3547.0 349.0 PT1M21S
11538 dHuPgeClQC0 Framed Thief Prank 1372871.0 10009.0 220.0 PT1M28S

1475 rows × 6 columns

df['view_count'] > 1000000 # returns a boolean Series
0        False
1        False
2         True
3        False
4        False
         ...  
11536    False
11537    False
11538     True
11539    False
11540    False
Name: view_count, Length: 11541, dtype: bool
df[df['comment_count'].isin([5, 10, 1000])] # shows the rows that have a comment count that is either 5, 10, or 1000 greater than 1,000,000
video_id title view_count like_count comment_count duration
28 rBVoffKg6QA Talking Cats (High Version) 42287.0 91.0 10.0 PT56S
54 jDD8yq18Gaw 19:57 From Euston - WHAT REALLY HAPPENED 19296.0 22.0 10.0 PT34S
109 XqkjPnrqbSc dahk fokaha tarjama ( chehal fi Sa3a ) 85511.0 101.0 5.0 PT1M13S
149 v-kHJjqLya4 Funny Dog Commercial 65501.0 104.0 10.0 PT1M1S
201 -3g_ENHcvGY Grave yard Smash 1369.0 4.0 5.0 PT33S
... ... ... ... ... ... ...
11174 AxQK9pp8nVc louie c.k. balls in your pus-hole 2765.0 8.0 5.0 PT39S
11313 -UzGgGJkIFU JB SKETCH BAD BOY 14321.0 22.0 10.0 PT1M4S
11392 DiA8VItyPIk Tryin first gear wheelies 1181.0 8.0 5.0 PT1M13S
11396 NhHRfS5bWug Natanek 2013 AD 614.0 3.0 5.0 PT1M30S
11429 4DQxWeTuG7I Family Guy Peter and the Osama Clawed 3195.0 19.0 10.0 PT54S

238 rows × 6 columns

df[(df['view_count'] > 1000000) & (df['comment_count'] > 1000)] # shows the rows that have a view_count greater than 1,000,000 and a comment_count greater than 1000
video_id title view_count like_count comment_count duration
2 vzpD6OogahQ Potter Puppet Pals: School Is For Losers 14958231.0 87806.0 11321.0 PT33S
23 U2_CSYI8O0Y Kud - Koninginnemarkt 1835025.0 15321.0 1829.0 PT50S
25 TugslL45aXk Don't argue with an ibex 4677355.0 34588.0 3597.0 PT1M30S
46 D7_0SOTQLIQ Notorious B.I.G. calms down crying baby - orig... 5977394.0 40051.0 3408.0 PT1M24S
75 ao-9B8IV9_E Funny Animals 16529238.0 18952.0 3217.0 PT10M34S
... ... ... ... ... ... ...
11475 fKvXiBVK-K4 Šta reći? 3050087.0 13179.0 2288.0 PT36S
11476 T7EzZBlQsoA Gerald Anderson and Jose Manalo budoy 3573758.0 4975.0 1210.0 PT1M24S
11490 -iYGAcn66o4 Attempt To Overtake Big Rig Ends In Devastatin... 3914975.0 12173.0 4078.0 PT30S
11506 a91eiu_eer4 Optical Illusion Girlfriend 6239948.0 61008.0 3554.0 PT1M24S
11530 q-RLqLx1iYI Epic and Honest Mobile Home Commercial 5594751.0 29208.0 2779.0 PT1M20S

721 rows × 6 columns

One issue to be aware of is that when you filter a DataFrame, the index of the original DataFrame is preserved.

df[df["view_count"] % 2 == 0] # get all rows where the view_count is even
video_id title view_count like_count comment_count duration
0 DE1-cD3pTkA Walker Texas Ranger clip LIVE 224.0 1.0 0.0 PT32S
3 _OpzEHBDwQE Very Funny Commercial Think Beyond Your Ear (... 61696.0 179.0 2.0 PT38S
4 yzGWOpop6i8 Happy Valentines Day 15852.0 315.0 77.0 PT1M14S
6 DOD38Y72zyk Minecraft: TNT Rocket 2174.0 16.0 3.0 PT1M9S
8 c1NaGio2Geo Caller X episode 3 13716.0 525.0 130.0 PT1M18S
... ... ... ... ... ... ...
11524 gJyAigxDWBI Knowwheretorun1984 set ablaze by Jordan Maxwel... 158.0 3.0 2.0 PT43S
11529 Sn00l_ttSi4 "Strangers on the Street" - QCC 92.0 0.0 0.0 PT1M12S
11532 XcVw-qHdufE Trailer: Being Human - Series 4 Episode 2 - BB... 24146.0 79.0 21.0 PT41S
11535 K-w_Ie-iBgY Gears of War 3 Rage 226.0 1.0 1.0 PT50S
11536 HKMgwmqPOxs Dylan hit by a ball [Slow motion remix] 546.0 4.0 1.0 PT1M21S

5795 rows × 6 columns

# get the 1 element of the view_count column
print("The 2nd element of the view_count column is ", df.loc[1, 'view_count'])
df[df["view_count"] % 2 == 0][1] # produces an error since the item with index 1 does not have an even view_count
The 2nd element of the view_count column is  1919.0
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
   3804 try:
-> 3805     return self._engine.get_loc(casted_key)
   3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 1

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[23], line 3
      1 # get the 1 element of the view_count column
      2 print("The 2nd element of the view_count column is ", df.loc[1, 'view_count'])
----> 3 df[df["view_count"] % 2 == 0][1] # produces an error since the item with index 1 does not have an even view_count

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/pandas/core/frame.py:4102, in DataFrame.__getitem__(self, key)
   4100 if self.columns.nlevels > 1:
   4101     return self._getitem_multilevel(key)
-> 4102 indexer = self.columns.get_loc(key)
   4103 if is_integer(indexer):
   4104     indexer = [indexer]

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
   3807     if isinstance(casted_key, slice) or (
   3808         isinstance(casted_key, abc.Iterable)
   3809         and any(isinstance(x, slice) for x in casted_key)
   3810     ):
   3811         raise InvalidIndexError(key)
-> 3812     raise KeyError(key) from err
   3813 except TypeError:
   3814     # If we have a listlike key, _check_indexing_error will raise
   3815     #  InvalidIndexError. Otherwise we fall through and re-raise
   3816     #  the TypeError.
   3817     self._check_indexing_error(key)

KeyError: 1
df[df["view_count"] % 2 == 0].values[1] # use the values attribute to get the values in the filtered dataframe.

Statistics about the DataFrame#

df["view_count"].sum() # shows the mean of the view_count column
sum(df["view_count"]) # outputs nan because the column has missing values (listed at NaN)
sum(df["view_count"].dropna()) # outputs the sum of the view_count column without the missing values

Again, it is faster to use the Pandas’s built-in functions.

%%timeit

df["view_count"].sum()  
%%timeit 

sum(df["view_count"].dropna())
df["like_count"].max() # shows the max of the view_count column
df.describe() # shows the summary statistics of the DataFrame

Combining DataFrames#

df_3 = pd.DataFrame({
    'A': [10, 11, 12], 
    'B': [13, 14, 15],
    'C': ['xx', 'yy', 'zz']})

df_3
combined_df = pd.concat([df_1, df_2, df_3], axis=0) # concatenates the two DataFrames along the rows

combined_df
combined_df = pd.concat([df_1, df_2, df_3], axis=1) # concatenates the two DataFrames along the columns

combined_df

Task#

def compare_views(df, video_id1, video_id2): 
    """Return True if the view count of video_id1 is greater than the view count of video_id2; otherwise, return False. If video_id1 or video_id2 is not in the DataFrame, raise a ValueError that outputs the string 'video_id is not in the DataFrame', where video_id is not present in the frame."""
    
    # YOUR CODE HERE
    raise NotImplementedError()
assert compare_views(df, 'vzpD6OogahQ', 'yzGWOpop6i8') == True
assert compare_views(df, 'yzGWOpop6i8', 'vzpD6OogahQ') == False
assert compare_views(df, 'DE1-cD3pTkA', 'XZqSz_X-j8Y') == False
assert compare_views(df, 'XZqSz_X-j8Y', 'DE1-cD3pTkA') == True
assert compare_views(df, 'yzGWOpop6i8', 'yzGWOpop6i8') == False
try:
    compare_views(df, 'vzpD6OogahQ', 'not_in_df1')
except ValueError as e:
    assert str(e) == 'not_in_df1 is not in the DataFrame'
try:
    compare_views(df, 'not_in_df2', 'vzpD6OogahQ')
except ValueError as e:
    assert str(e) == 'not_in_df2 is not in the DataFrame'