Intro to Web Scraping - How to make your own dataset!

SAFE Research Data Center

Errikos Melissinos, Research Assistant - October 2020

Logo SAFE

Documentation:
YouTube video (94 mins)

Import Libraries

Requests

Requests (https://requests.readthedocs.io) is a library that allows your Python code to interact with websites. We are only going to use the functionality that downloads a web page based on a specific url link.

The only function that we will use from this library is .get(). As an example, I will use a template from https://www.w3schools.com.

Detour to HTML

The programming languages that are usually used for the design of websites are HTML, Javascript and CSS. Each of these has its own syntax and all three can be used together when designing a website.

HTML is used to define the structure of all the elements that the website has. Javascript usually regulates the dynamic elements of websites. Finally, CSS is used for the styling of websites.

In our case, we will be taking advantage of the structure of HTML in order to navigate the pages that we are interested in. It is also possible to use the structure of CSS for the same purppose but here we will not focus on that. So, let's look at the webpage that we downloaded above: https://www.w3schools.com/howto/tryhow_make_a_website.htm

We can take a look at the source code of the page within our browser by viewing the source of the page.

I cannot of course go into details about HTML in this webinar, but it is necessary to have a very basic ideas about HTML's structure. The basic aspect that we will use is the structure of tagged elements and tagged elements within tagged elements.

Beautiful Soup

Beautiful Soup (https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is the main library we will use that enables web scraping. Technically, it is a library for pulling data out of HTML and XML files.

So, now we can use BeautifulSoup to get into the website that we downloaded with Requests.

This allows us to navigate the webpage. We can access its tags as we would access the attributes of an object.

Above we have mostly used the names of the tags in order to navigate the page. However, tags can also have ceratin attributes and these we can also use to find what we need.

Now, let's talk a little bit about the differenet kinds of information that we can access.

CSV

CSV (https://docs.python.org/3/library/csv.html) will help us save our files at the end.

Web Scraping in practise

These libraries (and an idea) are all we need to create our own simple dataset. This particular example is inspired by an initiative in which Prof. Guido Friebel of Goethe University also participates (https://women-economics.com).

Let's start:

After inspecting the webpage that we are interested in, we can proceed by finding information on one individual. Once we have figured this out we can go and make a loop around our code so that we get information on all the individuals.

In the following cell we will just navigate the page, in order to find the info that we are interested in:

Check that everything is fine

Now we can copy what we did above into a loop that will download the data for all individuals.

We can check that everything worked as we expected:

Now we will move to something that is a little bit more involved as we will be picking up data from different links.

Again we start by exploring:

Reminder: Be careful how you use these requests. Certain websites may want to block you from web scraping. For others you may want to avoid these methods and use an API that is provided specifically for this purpose.

Again, we can check that everything worked as we expected:

Saving the data

Finally we want to write this data on a file that we can use for our research

Thank you for participating!

Somthing Extra

Scrape Table to Pandas