Web Scraping


Abstract

This article shows a simple program written in Python to do a basic web scraping. As an exercice, we get the titles of Youtube videos and the number of views, then we store these information in a Pandas DataFrame.


Introduction

Web scraping can be a powerful tool, as many data science projects start with the first step of obtaining an appropriate data set, so why not utilize the sea of data that is the world wide web for that?
The following article will web scrape data from youtube (the title and number of view of videos) using python 3 and BeautifulSoup. These data will be then stored in a DataFrame (pandas package) and written in an Excel file.

Robots exclusion standard

Robots exclusion standard, also known as Robots.txt, is a standard used by websites to communicate with web robots. The standard specifies how to inform the web robot about which areas of the website should not be scanned.

Youtube robots.txt can be found here: https://www.youtube.com/robots.txt

After having a look at it we can see that we are allowed to retrieve the desired data.

The method

Web scraping programs are simple. They consist of 2 actions:

  • Getting the HTML page
  • Parsing the HTML to get the targeted data

We will use a package named requests to get the HTML and the popular BeautifulSoup package for pulling data out of HTML data. BeautifulSoup provides idiomatic ways of navigating, searching, and modifying the parsed tree of HTML (or XML) data.

When getting the HTML page, we will need to locate where are the targeted data in the HTML code before using BeautifulSoup to get it.

/!\ The HTML gotten programmatically can be different from what you can get when seing the source of the web page in a browser. In fact, the web server can recognise a web crawler and return a different response to your request.

Web scraping

In [38]:
#### Importing the packages
from bs4 import BeautifulSoup
from requests import get
from pandas import DataFrame
In [39]:
#### Initialize a list of youtube videos urls
URLS_LIST = ["https://www.youtube.com/watch?v=mbyG85GZ0PI&t=8s",
        "https://www.youtube.com/watch?v=bQI5uDxrFfA",
        "https://www.youtube.com/watch?v=Quh6x4kG6VY",
        "https://www.youtube.com/watch?v=b-yhKUINb7o",
        "https://www.youtube.com/watch?v=XUj5JbQihlU&t=146s",
        "https://www.youtube.com/watch?v=aircAruvnKk&t=524s",
        "https://www.youtube.com/watch?v=UzxYlbK2c7E&t=2291s",
        "https://www.youtube.com/watch?v=uXt8qF2Zzfo",
        "https://www.youtube.com/watch?v=_PwhiWxHK8o&t=1s"]
In [40]:
def webscrap_video_info(video_url):
    """
    Get the title and the number of views of a Youtube video
    :param video_url: link to the youtube video (string)
    :return: a dictionary containing the title of the video and its number of views
    """

    # get the HTML from the URL of the video
    requests_response = get(video_url)
    html_data = requests_response.text
    
    # Organize HTML data in a BeautifulSoup object to take advantage of BeautifulSoup's search features.
    # We ask BeautifulSoup to parse the HTML with the package "lxml" because it is faster than the one
    # included in the standard python library.
    html_soup = BeautifulSoup(html_data, "lxml")

    # Get the "div" tag with id="watch7-content" using the BeautifulSoup's find() function
    watch7_content_div = html_soup.find("div", {"id": "watch7-content"})

    # Get the "meta" tag containe in the selected "div" tag and that contains the title of the video
    video_title_meta_tag = watch7_content_div.find("meta", {"itemprop": "name"})
    video_title_meta_tag_attributes_dictionary = video_title_meta_tag.attrs
    video_title = video_title_meta_tag_attributes_dictionary['content']

    # Get the "meta" tag containe in the selected "div" tag and that contains the number of views of the video
    video_views_meta_tag = watch7_content_div.find("meta", {"itemprop": "interactionCount"})
    video_views_meta_tag_attributes_dictionary = video_views_meta_tag.attrs
    number_of_views = video_views_meta_tag_attributes_dictionary['content']

    return {'title': video_title, 'number of views': number_of_views}
In [41]:
def write_results_in_excel_file(file_path, results_list):
    """
    Creates an Excel file that contains the results (titles and number of views of the videos).
    The function transforms the list of results into Dataframe pandas then uses the function to_excel()
    to write the Dataframe into an Excel file.
    :param file_path: Excel file path
    :param results_list: the list of results (list of dictionaries containing the title and the number of views of each video)
    """

    # Initialization of 2 lists that will be used to create the Dataframe
    titles_list = []
    number_of_views_list = []

    # We loop on the list of results to fill the list of titles and the list of number of views
    for result_dictionary in results_list:
        video_title = result_dictionary['title']
        number_of_views = result_dictionary['number of views']

        titles_list.append(video_title)
        number_of_views_list.append(number_of_views)

    # Creating the Dataframe from a dictionary that contains the previous listsExporting the Dataframe into an Excel file (file path given by the file_path argument)
    result_dataframe = DataFrame({"title": titles_list, "Views": number_of_views_list})
    # Exporting the Dataframe into an Excel file (file path given by the file_path argument)
    result_dataframe.to_excel(file_path, sheet_name="Youtube")
In [45]:
if __name__ == '__main__':
    """
    Let's test our functions
    """

    # Initializing the result list
    results_list = []

    # We loop on the list of urls and we call the function webscrap_video_info ()
    # with the url of a video as input to get the title and the number of views of the video.
    for url in URLS_LIST:

        # We get the title and the number of views of the video in a dictionary
        video_Info_dictionary = webscrap_video_info(url)

        # We store this dictionary in the list of results
        results_list.append(video_Info_dictionary)

    # We write the results in an Excel file
    write_results_in_excel_file("youtube_videos.xlsx", results_list)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

%d bloggers like this: