Tutorial 3: Brief Introduction to Pandas

Tutorial 3: Brief Introduction to Pandas#

This tutorial will provide a very brief introduction to the Pandas library. Pandas is a powerful data manipulation library for Python.

For a more in-depth introduction to Pandas, read the Pandas Documentation.

Import the Pandas library and read a dataset#

The first step is to read a dataset into a Pandas DataFrame. A DataFrame is a two-dimensional, size-mutable, and heterogeneous tabular data structure with labeled axes (rows and columns).

import pandas as pd 

url = 'https://raw.githubusercontent.com/epacuit/introduction-machine-learning/refs/heads/main/tutorials/comedy_comparisons_metadata.csv'

df = pd.read_csv(url)

type(df) # the type of the object is a DataFrame

pandas.core.frame.DataFrame

df.head() # shows the first 5 rows of the DataFrame

	video_id	title	view_count	like_count	comment_count	duration
0	DE1-cD3pTkA	Walker Texas Ranger clip LIVE	224.0	1.0	0.0	PT32S
1	XZqSz_X-j8Y	Egg mcmuffin of a crappy commercial	1919.0	7.0	3.0	PT44S
2	vzpD6OogahQ	Potter Puppet Pals: School Is For Losers	14958231.0	87806.0	11321.0	PT33S
3	_OpzEHBDwQE	Very Funny Commercial Think Beyond Your Ear (...	61696.0	179.0	2.0	PT38S
4	yzGWOpop6i8	Happy Valentines Day	15852.0	315.0	77.0	PT1M14S

df.columns # lists the columns of the DataFrame

Index(['video_id', 'title', 'view_count', 'like_count', 'comment_count',
       'duration'],
      dtype='object')

df.shape # shows the shape of the DataFrame (rows, columns)

(11541, 6)

len(df) # shows the number of rows in the DataFrame

Creating a DataFrame#

df_1 = pd.DataFrame({
    'A': [1, 2, 3], 
    'B': [4, 5, 6],
    'C': ['x', 'y', 'z']})

df_1

	A	B	C
0	1	4	x
1	2	5	y
2	3	6	z

df_2 = pd.DataFrame(
    [[1, 4, 'x'], [2, 5, 'y'], [3, 6, 'z']], 
    columns=['A', 'B', 'C'])
df_2

	A	B	C
0	1	4	x
1	2	5	y
2	3	6	z

Selecting Columns#

The next step is to filter the DataFrame to select specific rows or columns.

df['title'] # A column can be accessed by using the column name as a key

                          Walker Texas Ranger clip LIVE
                    Egg mcmuffin of a crappy commercial
               Potter Puppet Pals: School Is For Losers
      Very Funny Commercial  Think Beyond Your Ear (...
                                   Happy Valentines Day
                               ...                        
            Dylan hit by a ball [Slow motion remix]
          Trick or Treat! Funny Halloween Dog Video
                                 Framed Thief Prank
         Harry Redknapp gets hit by Fernando Torres
                  ~600 Subs Special~ MMV- Enchanted
Name: title, Length: 11541, dtype: object

type(df['title']) # the type of a column is a Series

pandas.core.series.Series

titles = df['title'] # the column can be stored in a variable

titles.head() # shows the first 5 rows of the Series

                      Walker Texas Ranger clip LIVE
                Egg mcmuffin of a crappy commercial
           Potter Puppet Pals: School Is For Losers
  Very Funny Commercial  Think Beyond Your Ear (...
                               Happy Valentines Day
Name: title, dtype: object

print(f"The 11th title is {titles[10]}") # a value in a Series can be accessed by using the index as a key

print(f'The 11th title is {df["title"][10]}') # you can also access a value in a Series by using the DataFrame and column name

print("\nThe first 5 values of the title: ", titles[0:5]) # A Series object can be sliced like a list

print("\nThe last 5 values of the title: ", titles[-5:]) # A Series object can be sliced like a list

The 11th title is PCD Flying Fun Cam
The 11th title is PCD Flying Fun Cam

The first 5 values of the title:  0                        Walker Texas Ranger clip LIVE
1                  Egg mcmuffin of a crappy commercial
2             Potter Puppet Pals: School Is For Losers
3    Very Funny Commercial  Think Beyond Your Ear (...
4                                 Happy Valentines Day
Name: title, dtype: object

The last 5 values of the title:  11536       Dylan hit by a ball [Slow motion remix]
11537     Trick or Treat! Funny Halloween Dog Video
11538                            Framed Thief Prank
11539    Harry Redknapp gets hit by Fernando Torres
11540             ~600 Subs Special~ MMV- Enchanted
Name: title, dtype: object

print("The unique values of the view_counts column are: ", df['view_count'].unique()) # shows the unique values of a column

print("The number of values of the view_counts column are: ", df['view_count'].count()) # shows the number of values of a column

print("The number of unique values of the view_counts column are: ", df['view_count'].nunique()) # shows the number of unique values of a column

print(" ", len(list(set(df['view_count'])))) # Another way to find the number of unique elements in a column

The unique values of the view_counts column are:  [2.2400000e+02 1.9190000e+03 1.4958231e+07 ... 1.3728710e+06 1.7410000e+03
 5.9710000e+03]
The number of values of the view_counts column are:  11532
The number of unique values of the view_counts column are:  9310
  9319

Typically, it is faster to use the Pandas’s built-in functions.

%%timeit

df['view_count'].nunique()

315 μs ± 1.16 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%%timeit 

len(list(set(df['view_count'])))

1.27 ms ± 5.98 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Filtering the DataFrame#

df[df['view_count'] > 1000000] # shows the rows that have a view_count greater than 1,000,000

	video_id	title	view_count	like_count	comment_count	duration
2	vzpD6OogahQ	Potter Puppet Pals: School Is For Losers	14958231.0	87806.0	11321.0	PT33S
18	rQjr5xjugUM	Ah meu Deus do céu, Quebraram(quebraru) a mesa...	4364246.0	37822.0	NaN	PT1M15S
23	U2_CSYI8O0Y	Kud - Koninginnemarkt	1835025.0	15321.0	1829.0	PT50S
25	TugslL45aXk	Don't argue with an ibex	4677355.0	34588.0	3597.0	PT1M30S
39	TXtP8UfIfEA	JibJab Valentine's Day eCards.flv_(360p).mp4	20272233.0	19727.0	695.0	PT47S
...	...	...	...	...	...	...
11506	a91eiu_eer4	Optical Illusion Girlfriend	6239948.0	61008.0	3554.0	PT1M24S
11516	59xYAKjCnIo	SUPER TIMOR	1131909.0	5171.0	261.0	PT50S
11530	q-RLqLx1iYI	Epic and Honest Mobile Home Commercial	5594751.0	29208.0	2779.0	PT1M20S
11531	Ej9wCko6dR4	Bud Light Jackie Moon Commercial	2154189.0	3547.0	349.0	PT1M21S
11538	dHuPgeClQC0	Framed Thief Prank	1372871.0	10009.0	220.0	PT1M28S

1475 rows × 6 columns

df['view_count'] > 1000000 # returns a boolean Series

      False
      False
       True
      False
      False
         ...  
  False
  False
   True
  False
  False
Name: view_count, Length: 11541, dtype: bool

df[df['comment_count'].isin([5, 10, 1000])] # shows the rows that have a comment count that is either 5, 10, or 1000 greater than 1,000,000

	video_id	title	view_count	like_count	comment_count	duration
28	rBVoffKg6QA	Talking Cats (High Version)	42287.0	91.0	10.0	PT56S
54	jDD8yq18Gaw	19:57 From Euston - WHAT REALLY HAPPENED	19296.0	22.0	10.0	PT34S
109	XqkjPnrqbSc	dahk fokaha tarjama ( chehal fi Sa3a )	85511.0	101.0	5.0	PT1M13S
149	v-kHJjqLya4	Funny Dog Commercial	65501.0	104.0	10.0	PT1M1S
201	-3g_ENHcvGY	Grave yard Smash	1369.0	4.0	5.0	PT33S
...	...	...	...	...	...	...
11174	AxQK9pp8nVc	louie c.k. balls in your pus-hole	2765.0	8.0	5.0	PT39S
11313	-UzGgGJkIFU	JB SKETCH BAD BOY	14321.0	22.0	10.0	PT1M4S
11392	DiA8VItyPIk	Tryin first gear wheelies	1181.0	8.0	5.0	PT1M13S
11396	NhHRfS5bWug	Natanek 2013 AD	614.0	3.0	5.0	PT1M30S
11429	4DQxWeTuG7I	Family Guy Peter and the Osama Clawed	3195.0	19.0	10.0	PT54S

238 rows × 6 columns

df[(df['view_count'] > 1000000) & (df['comment_count'] > 1000)] # shows the rows that have a view_count greater than 1,000,000 and a comment_count greater than 1000

	video_id	title	view_count	like_count	comment_count	duration
2	vzpD6OogahQ	Potter Puppet Pals: School Is For Losers	14958231.0	87806.0	11321.0	PT33S
23	U2_CSYI8O0Y	Kud - Koninginnemarkt	1835025.0	15321.0	1829.0	PT50S
25	TugslL45aXk	Don't argue with an ibex	4677355.0	34588.0	3597.0	PT1M30S
46	D7_0SOTQLIQ	Notorious B.I.G. calms down crying baby - orig...	5977394.0	40051.0	3408.0	PT1M24S
75	ao-9B8IV9_E	Funny Animals	16529238.0	18952.0	3217.0	PT10M34S
...	...	...	...	...	...	...
11475	fKvXiBVK-K4	Šta reći?	3050087.0	13179.0	2288.0	PT36S
11476	T7EzZBlQsoA	Gerald Anderson and Jose Manalo budoy	3573758.0	4975.0	1210.0	PT1M24S
11490	-iYGAcn66o4	Attempt To Overtake Big Rig Ends In Devastatin...	3914975.0	12173.0	4078.0	PT30S
11506	a91eiu_eer4	Optical Illusion Girlfriend	6239948.0	61008.0	3554.0	PT1M24S
11530	q-RLqLx1iYI	Epic and Honest Mobile Home Commercial	5594751.0	29208.0	2779.0	PT1M20S

721 rows × 6 columns

One issue to be aware of is that when you filter a DataFrame, the index of the original DataFrame is preserved.

df[df["view_count"] % 2 == 0] # get all rows where the view_count is even

	video_id	title	view_count	like_count	comment_count	duration
0	DE1-cD3pTkA	Walker Texas Ranger clip LIVE	224.0	1.0	0.0	PT32S
3	_OpzEHBDwQE	Very Funny Commercial Think Beyond Your Ear (...	61696.0	179.0	2.0	PT38S
4	yzGWOpop6i8	Happy Valentines Day	15852.0	315.0	77.0	PT1M14S
6	DOD38Y72zyk	Minecraft: TNT Rocket	2174.0	16.0	3.0	PT1M9S
8	c1NaGio2Geo	Caller X episode 3	13716.0	525.0	130.0	PT1M18S
...	...	...	...	...	...	...
11524	gJyAigxDWBI	Knowwheretorun1984 set ablaze by Jordan Maxwel...	158.0	3.0	2.0	PT43S
11529	Sn00l_ttSi4	"Strangers on the Street" - QCC	92.0	0.0	0.0	PT1M12S
11532	XcVw-qHdufE	Trailer: Being Human - Series 4 Episode 2 - BB...	24146.0	79.0	21.0	PT41S
11535	K-w_Ie-iBgY	Gears of War 3 Rage	226.0	1.0	1.0	PT50S
11536	HKMgwmqPOxs	Dylan hit by a ball [Slow motion remix]	546.0	4.0	1.0	PT1M21S

5795 rows × 6 columns

# get the 1 element of the view_count column
print("The 2nd element of the view_count column is ", df.loc[1, 'view_count'])
df[df["view_count"] % 2 == 0][1] # produces an error since the item with index 1 does not have an even view_count

The 2nd element of the view_count column is  1919.0

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
   3804 try:
-> 3805     return self._engine.get_loc(casted_key)
   3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 1

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[23], line 3
      1 # get the 1 element of the view_count column
      2 print("The 2nd element of the view_count column is ", df.loc[1, 'view_count'])
----> 3 df[df["view_count"] % 2 == 0][1] # produces an error since the item with index 1 does not have an even view_count

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/pandas/core/frame.py:4102, in DataFrame.__getitem__(self, key)
   4100 if self.columns.nlevels > 1:
   4101     return self._getitem_multilevel(key)
-> 4102 indexer = self.columns.get_loc(key)
   4103 if is_integer(indexer):
   4104     indexer = [indexer]

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
   3807     if isinstance(casted_key, slice) or (
   3808         isinstance(casted_key, abc.Iterable)
   3809         and any(isinstance(x, slice) for x in casted_key)
   3810     ):
   3811         raise InvalidIndexError(key)
-> 3812     raise KeyError(key) from err
   3813 except TypeError:
   3814     # If we have a listlike key, _check_indexing_error will raise
   3815     #  InvalidIndexError. Otherwise we fall through and re-raise
   3816     #  the TypeError.
   3817     self._check_indexing_error(key)

KeyError: 1

df[df["view_count"] % 2 == 0].values[1] # use the values attribute to get the values in the filtered dataframe.

Statistics about the DataFrame#

df["view_count"].sum() # shows the mean of the view_count column

sum(df["view_count"]) # outputs nan because the column has missing values (listed at NaN)

sum(df["view_count"].dropna()) # outputs the sum of the view_count column without the missing values

Again, it is faster to use the Pandas’s built-in functions.

%%timeit

df["view_count"].sum()  

%%timeit 

sum(df["view_count"].dropna())

df["like_count"].max() # shows the max of the view_count column

df.describe() # shows the summary statistics of the DataFrame

Combining DataFrames#

df_3 = pd.DataFrame({
    'A': [10, 11, 12], 
    'B': [13, 14, 15],
    'C': ['xx', 'yy', 'zz']})

df_3

combined_df = pd.concat([df_1, df_2, df_3], axis=0) # concatenates the two DataFrames along the rows

combined_df

combined_df = pd.concat([df_1, df_2, df_3], axis=1) # concatenates the two DataFrames along the columns

combined_df

Task#

def compare_views(df, video_id1, video_id2): 
    """Return True if the view count of video_id1 is greater than the view count of video_id2; otherwise, return False. If video_id1 or video_id2 is not in the DataFrame, raise a ValueError that outputs the string 'video_id is not in the DataFrame', where video_id is not present in the frame."""
    
    # YOUR CODE HERE
    raise NotImplementedError()

assert compare_views(df, 'vzpD6OogahQ', 'yzGWOpop6i8') == True
assert compare_views(df, 'yzGWOpop6i8', 'vzpD6OogahQ') == False
assert compare_views(df, 'DE1-cD3pTkA', 'XZqSz_X-j8Y') == False
assert compare_views(df, 'XZqSz_X-j8Y', 'DE1-cD3pTkA') == True
assert compare_views(df, 'yzGWOpop6i8', 'yzGWOpop6i8') == False
try:
    compare_views(df, 'vzpD6OogahQ', 'not_in_df1')
except ValueError as e:
    assert str(e) == 'not_in_df1 is not in the DataFrame'
try:
    compare_views(df, 'not_in_df2', 'vzpD6OogahQ')
except ValueError as e:
    assert str(e) == 'not_in_df2 is not in the DataFrame'