Before you turn this problem in, make sure everything runs as expected. First, restart the kernel (in the menubar, select Kernel \(\rightarrow\) Restart) and then run all cells (in the menubar, select Cell \(\rightarrow\) Run All).
Make sure you fill in any place that says YOUR CODE HERE or “YOUR ANSWER HERE”, as well as your name and collaborators below:
NAME = ""
COLLABORATORS = ""
Tutorial 3: Brief Introduction to Pandas#
This tutorial will provide a very brief introduction to the Pandas library. Pandas is a powerful data manipulation library for Python.
For a more in-depth introduction to Pandas, read the Pandas Documentation.
Import the Pandas library and read a dataset#
The first step is to read a dataset into a Pandas DataFrame. A DataFrame is a two-dimensional, size-mutable, and heterogeneous tabular data structure with labeled axes (rows and columns).
import pandas as pd
url = 'https://raw.githubusercontent.com/epacuit/introduction-machine-learning/refs/heads/main/tutorials/comedy_comparisons_metadata.csv'
df = pd.read_csv(url)
type(df) # the type of the object is a DataFrame
pandas.core.frame.DataFrame
df.head() # shows the first 5 rows of the DataFrame
| video_id | title | view_count | like_count | comment_count | duration | |
|---|---|---|---|---|---|---|
| 0 | DE1-cD3pTkA | Walker Texas Ranger clip LIVE | 224.0 | 1.0 | 0.0 | PT32S |
| 1 | XZqSz_X-j8Y | Egg mcmuffin of a crappy commercial | 1919.0 | 7.0 | 3.0 | PT44S |
| 2 | vzpD6OogahQ | Potter Puppet Pals: School Is For Losers | 14958231.0 | 87806.0 | 11321.0 | PT33S |
| 3 | _OpzEHBDwQE | Very Funny Commercial Think Beyond Your Ear (... | 61696.0 | 179.0 | 2.0 | PT38S |
| 4 | yzGWOpop6i8 | Happy Valentines Day | 15852.0 | 315.0 | 77.0 | PT1M14S |
df.columns # lists the columns of the DataFrame
Index(['video_id', 'title', 'view_count', 'like_count', 'comment_count',
'duration'],
dtype='object')
df.shape # shows the shape of the DataFrame (rows, columns)
(11541, 6)
len(df) # shows the number of rows in the DataFrame
11541
Creating a DataFrame#
df_1 = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6],
'C': ['x', 'y', 'z']})
df_1
| A | B | C | |
|---|---|---|---|
| 0 | 1 | 4 | x |
| 1 | 2 | 5 | y |
| 2 | 3 | 6 | z |
df_2 = pd.DataFrame(
[[1, 4, 'x'], [2, 5, 'y'], [3, 6, 'z']],
columns=['A', 'B', 'C'])
df_2
| A | B | C | |
|---|---|---|---|
| 0 | 1 | 4 | x |
| 1 | 2 | 5 | y |
| 2 | 3 | 6 | z |
Selecting Columns#
The next step is to filter the DataFrame to select specific rows or columns.
df['title'] # A column can be accessed by using the column name as a key
0 Walker Texas Ranger clip LIVE
1 Egg mcmuffin of a crappy commercial
2 Potter Puppet Pals: School Is For Losers
3 Very Funny Commercial Think Beyond Your Ear (...
4 Happy Valentines Day
...
11536 Dylan hit by a ball [Slow motion remix]
11537 Trick or Treat! Funny Halloween Dog Video
11538 Framed Thief Prank
11539 Harry Redknapp gets hit by Fernando Torres
11540 ~600 Subs Special~ MMV- Enchanted
Name: title, Length: 11541, dtype: object
type(df['title']) # the type of a column is a Series
pandas.core.series.Series
titles = df['title'] # the column can be stored in a variable
titles.head() # shows the first 5 rows of the Series
0 Walker Texas Ranger clip LIVE
1 Egg mcmuffin of a crappy commercial
2 Potter Puppet Pals: School Is For Losers
3 Very Funny Commercial Think Beyond Your Ear (...
4 Happy Valentines Day
Name: title, dtype: object
print(f"The 11th title is {titles[10]}") # a value in a Series can be accessed by using the index as a key
print(f'The 11th title is {df["title"][10]}') # you can also access a value in a Series by using the DataFrame and column name
print("\nThe first 5 values of the title: ", titles[0:5]) # A Series object can be sliced like a list
print("\nThe last 5 values of the title: ", titles[-5:]) # A Series object can be sliced like a list
The 11th title is PCD Flying Fun Cam
The 11th title is PCD Flying Fun Cam
The first 5 values of the title: 0 Walker Texas Ranger clip LIVE
1 Egg mcmuffin of a crappy commercial
2 Potter Puppet Pals: School Is For Losers
3 Very Funny Commercial Think Beyond Your Ear (...
4 Happy Valentines Day
Name: title, dtype: object
The last 5 values of the title: 11536 Dylan hit by a ball [Slow motion remix]
11537 Trick or Treat! Funny Halloween Dog Video
11538 Framed Thief Prank
11539 Harry Redknapp gets hit by Fernando Torres
11540 ~600 Subs Special~ MMV- Enchanted
Name: title, dtype: object
print("The unique values of the view_counts column are: ", df['view_count'].unique()) # shows the unique values of a column
print("The number of values of the view_counts column are: ", df['view_count'].count()) # shows the number of values of a column
print("The number of unique values of the view_counts column are: ", df['view_count'].nunique()) # shows the number of unique values of a column
print(" ", len(list(set(df['view_count'])))) # Another way to find the number of unique elements in a column
The unique values of the view_counts column are: [2.2400000e+02 1.9190000e+03 1.4958231e+07 ... 1.3728710e+06 1.7410000e+03
5.9710000e+03]
The number of values of the view_counts column are: 11532
The number of unique values of the view_counts column are: 9310
9319
Typically, it is faster to use the Pandas’s built-in functions.
%%timeit
df['view_count'].nunique()
315 μs ± 1.16 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit
len(list(set(df['view_count'])))
1.27 ms ± 5.98 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Filtering the DataFrame#
df[df['view_count'] > 1000000] # shows the rows that have a view_count greater than 1,000,000
| video_id | title | view_count | like_count | comment_count | duration | |
|---|---|---|---|---|---|---|
| 2 | vzpD6OogahQ | Potter Puppet Pals: School Is For Losers | 14958231.0 | 87806.0 | 11321.0 | PT33S |
| 18 | rQjr5xjugUM | Ah meu Deus do céu, Quebraram(quebraru) a mesa... | 4364246.0 | 37822.0 | NaN | PT1M15S |
| 23 | U2_CSYI8O0Y | Kud - Koninginnemarkt | 1835025.0 | 15321.0 | 1829.0 | PT50S |
| 25 | TugslL45aXk | Don't argue with an ibex | 4677355.0 | 34588.0 | 3597.0 | PT1M30S |
| 39 | TXtP8UfIfEA | JibJab Valentine's Day eCards.flv_(360p).mp4 | 20272233.0 | 19727.0 | 695.0 | PT47S |
| ... | ... | ... | ... | ... | ... | ... |
| 11506 | a91eiu_eer4 | Optical Illusion Girlfriend | 6239948.0 | 61008.0 | 3554.0 | PT1M24S |
| 11516 | 59xYAKjCnIo | SUPER TIMOR | 1131909.0 | 5171.0 | 261.0 | PT50S |
| 11530 | q-RLqLx1iYI | Epic and Honest Mobile Home Commercial | 5594751.0 | 29208.0 | 2779.0 | PT1M20S |
| 11531 | Ej9wCko6dR4 | Bud Light Jackie Moon Commercial | 2154189.0 | 3547.0 | 349.0 | PT1M21S |
| 11538 | dHuPgeClQC0 | Framed Thief Prank | 1372871.0 | 10009.0 | 220.0 | PT1M28S |
1475 rows × 6 columns
df['view_count'] > 1000000 # returns a boolean Series
0 False
1 False
2 True
3 False
4 False
...
11536 False
11537 False
11538 True
11539 False
11540 False
Name: view_count, Length: 11541, dtype: bool
df[df['comment_count'].isin([5, 10, 1000])] # shows the rows that have a comment count that is either 5, 10, or 1000 greater than 1,000,000
| video_id | title | view_count | like_count | comment_count | duration | |
|---|---|---|---|---|---|---|
| 28 | rBVoffKg6QA | Talking Cats (High Version) | 42287.0 | 91.0 | 10.0 | PT56S |
| 54 | jDD8yq18Gaw | 19:57 From Euston - WHAT REALLY HAPPENED | 19296.0 | 22.0 | 10.0 | PT34S |
| 109 | XqkjPnrqbSc | dahk fokaha tarjama ( chehal fi Sa3a ) | 85511.0 | 101.0 | 5.0 | PT1M13S |
| 149 | v-kHJjqLya4 | Funny Dog Commercial | 65501.0 | 104.0 | 10.0 | PT1M1S |
| 201 | -3g_ENHcvGY | Grave yard Smash | 1369.0 | 4.0 | 5.0 | PT33S |
| ... | ... | ... | ... | ... | ... | ... |
| 11174 | AxQK9pp8nVc | louie c.k. balls in your pus-hole | 2765.0 | 8.0 | 5.0 | PT39S |
| 11313 | -UzGgGJkIFU | JB SKETCH BAD BOY | 14321.0 | 22.0 | 10.0 | PT1M4S |
| 11392 | DiA8VItyPIk | Tryin first gear wheelies | 1181.0 | 8.0 | 5.0 | PT1M13S |
| 11396 | NhHRfS5bWug | Natanek 2013 AD | 614.0 | 3.0 | 5.0 | PT1M30S |
| 11429 | 4DQxWeTuG7I | Family Guy Peter and the Osama Clawed | 3195.0 | 19.0 | 10.0 | PT54S |
238 rows × 6 columns
df[(df['view_count'] > 1000000) & (df['comment_count'] > 1000)] # shows the rows that have a view_count greater than 1,000,000 and a comment_count greater than 1000
| video_id | title | view_count | like_count | comment_count | duration | |
|---|---|---|---|---|---|---|
| 2 | vzpD6OogahQ | Potter Puppet Pals: School Is For Losers | 14958231.0 | 87806.0 | 11321.0 | PT33S |
| 23 | U2_CSYI8O0Y | Kud - Koninginnemarkt | 1835025.0 | 15321.0 | 1829.0 | PT50S |
| 25 | TugslL45aXk | Don't argue with an ibex | 4677355.0 | 34588.0 | 3597.0 | PT1M30S |
| 46 | D7_0SOTQLIQ | Notorious B.I.G. calms down crying baby - orig... | 5977394.0 | 40051.0 | 3408.0 | PT1M24S |
| 75 | ao-9B8IV9_E | Funny Animals | 16529238.0 | 18952.0 | 3217.0 | PT10M34S |
| ... | ... | ... | ... | ... | ... | ... |
| 11475 | fKvXiBVK-K4 | Šta reći? | 3050087.0 | 13179.0 | 2288.0 | PT36S |
| 11476 | T7EzZBlQsoA | Gerald Anderson and Jose Manalo budoy | 3573758.0 | 4975.0 | 1210.0 | PT1M24S |
| 11490 | -iYGAcn66o4 | Attempt To Overtake Big Rig Ends In Devastatin... | 3914975.0 | 12173.0 | 4078.0 | PT30S |
| 11506 | a91eiu_eer4 | Optical Illusion Girlfriend | 6239948.0 | 61008.0 | 3554.0 | PT1M24S |
| 11530 | q-RLqLx1iYI | Epic and Honest Mobile Home Commercial | 5594751.0 | 29208.0 | 2779.0 | PT1M20S |
721 rows × 6 columns
One issue to be aware of is that when you filter a DataFrame, the index of the original DataFrame is preserved.
df[df["view_count"] % 2 == 0] # get all rows where the view_count is even
| video_id | title | view_count | like_count | comment_count | duration | |
|---|---|---|---|---|---|---|
| 0 | DE1-cD3pTkA | Walker Texas Ranger clip LIVE | 224.0 | 1.0 | 0.0 | PT32S |
| 3 | _OpzEHBDwQE | Very Funny Commercial Think Beyond Your Ear (... | 61696.0 | 179.0 | 2.0 | PT38S |
| 4 | yzGWOpop6i8 | Happy Valentines Day | 15852.0 | 315.0 | 77.0 | PT1M14S |
| 6 | DOD38Y72zyk | Minecraft: TNT Rocket | 2174.0 | 16.0 | 3.0 | PT1M9S |
| 8 | c1NaGio2Geo | Caller X episode 3 | 13716.0 | 525.0 | 130.0 | PT1M18S |
| ... | ... | ... | ... | ... | ... | ... |
| 11524 | gJyAigxDWBI | Knowwheretorun1984 set ablaze by Jordan Maxwel... | 158.0 | 3.0 | 2.0 | PT43S |
| 11529 | Sn00l_ttSi4 | "Strangers on the Street" - QCC | 92.0 | 0.0 | 0.0 | PT1M12S |
| 11532 | XcVw-qHdufE | Trailer: Being Human - Series 4 Episode 2 - BB... | 24146.0 | 79.0 | 21.0 | PT41S |
| 11535 | K-w_Ie-iBgY | Gears of War 3 Rage | 226.0 | 1.0 | 1.0 | PT50S |
| 11536 | HKMgwmqPOxs | Dylan hit by a ball [Slow motion remix] | 546.0 | 4.0 | 1.0 | PT1M21S |
5795 rows × 6 columns
# get the 1 element of the view_count column
print("The 2nd element of the view_count column is ", df.loc[1, 'view_count'])
df[df["view_count"] % 2 == 0][1] # produces an error since the item with index 1 does not have an even view_count
The 2nd element of the view_count column is 1919.0
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
3804 try:
-> 3805 return self._engine.get_loc(casted_key)
3806 except KeyError as err:
File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()
File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 1
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[23], line 3
1 # get the 1 element of the view_count column
2 print("The 2nd element of the view_count column is ", df.loc[1, 'view_count'])
----> 3 df[df["view_count"] % 2 == 0][1] # produces an error since the item with index 1 does not have an even view_count
File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/pandas/core/frame.py:4102, in DataFrame.__getitem__(self, key)
4100 if self.columns.nlevels > 1:
4101 return self._getitem_multilevel(key)
-> 4102 indexer = self.columns.get_loc(key)
4103 if is_integer(indexer):
4104 indexer = [indexer]
File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
3807 if isinstance(casted_key, slice) or (
3808 isinstance(casted_key, abc.Iterable)
3809 and any(isinstance(x, slice) for x in casted_key)
3810 ):
3811 raise InvalidIndexError(key)
-> 3812 raise KeyError(key) from err
3813 except TypeError:
3814 # If we have a listlike key, _check_indexing_error will raise
3815 # InvalidIndexError. Otherwise we fall through and re-raise
3816 # the TypeError.
3817 self._check_indexing_error(key)
KeyError: 1
df[df["view_count"] % 2 == 0].values[1] # use the values attribute to get the values in the filtered dataframe.
Statistics about the DataFrame#
df["view_count"].sum() # shows the mean of the view_count column
sum(df["view_count"]) # outputs nan because the column has missing values (listed at NaN)
sum(df["view_count"].dropna()) # outputs the sum of the view_count column without the missing values
Again, it is faster to use the Pandas’s built-in functions.
%%timeit
df["view_count"].sum()
%%timeit
sum(df["view_count"].dropna())
df["like_count"].max() # shows the max of the view_count column
df.describe() # shows the summary statistics of the DataFrame
Combining DataFrames#
df_3 = pd.DataFrame({
'A': [10, 11, 12],
'B': [13, 14, 15],
'C': ['xx', 'yy', 'zz']})
df_3
combined_df = pd.concat([df_1, df_2, df_3], axis=0) # concatenates the two DataFrames along the rows
combined_df
combined_df = pd.concat([df_1, df_2, df_3], axis=1) # concatenates the two DataFrames along the columns
combined_df
Task#
def compare_views(df, video_id1, video_id2):
"""Return True if the view count of video_id1 is greater than the view count of video_id2; otherwise, return False. If video_id1 or video_id2 is not in the DataFrame, raise a ValueError that outputs the string 'video_id is not in the DataFrame', where video_id is not present in the frame."""
# YOUR CODE HERE
raise NotImplementedError()
assert compare_views(df, 'vzpD6OogahQ', 'yzGWOpop6i8') == True
assert compare_views(df, 'yzGWOpop6i8', 'vzpD6OogahQ') == False
assert compare_views(df, 'DE1-cD3pTkA', 'XZqSz_X-j8Y') == False
assert compare_views(df, 'XZqSz_X-j8Y', 'DE1-cD3pTkA') == True
assert compare_views(df, 'yzGWOpop6i8', 'yzGWOpop6i8') == False
try:
compare_views(df, 'vzpD6OogahQ', 'not_in_df1')
except ValueError as e:
assert str(e) == 'not_in_df1 is not in the DataFrame'
try:
compare_views(df, 'not_in_df2', 'vzpD6OogahQ')
except ValueError as e:
assert str(e) == 'not_in_df2 is not in the DataFrame'