Skip to content

P. Web scraping I

Bogdan Tudorache edited this page Feb 24, 2021 · 2 revisions

[Disclaimer]: in order to properly web scrape pages you need at least basic front end html/css/php skills. You have to know what you're looking for and also where to find it

As presented in a previous chapter we used Selenium and a Chromium web driver to web scrape but I realised only after writing that post that I should have emphasised more.

A. What is web scraping?

As per Wikipedia is the process of extracting data from a website.

Although, in any browser you can go to any webpage and simply Control+S or Command + S to save everything to your desktop we will definitely strive for the automatic process that implies manual labor only when we write the code that will do that for us.

I am a big fan of automations and will never consider doing a manual task in a repetitive manner.

So by the end of this article you will know how to build a basic yet highly efficient web crawler (the bot/script that is used to automatically scrape pages).

You can find the example here: webcrawler.py , but before you can use it you need some prerequisites.

B. Requests

Even though we will use requests only to get data, this python library can do a lot more, so definitely check out their documentation page.

Installing:

$ pip install requests

C. Beautifulsoup4

After we download our page we will definitely want it in html format and only then parse it, for this we will install beautifulsoup. For more information on this python library please check their documentation page.

Installing:

$ pip install beautifulsoup4

Now that we have all prerequisites we can get down to business.

C. Web crawler code breakdown

# importing both requests and bs4
import requests
from bs4 import BeautifulSoup as soup

# we need the header so that the we don't get banned and the websites considers our interraction as a simple user request
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36' }

# page address that we wish to scrape
berrynews = "http://berrynews.org/netherlands-en.html?"
# requesting the page from the web server
page = requests.get(berrynews,headers = headers)
#parsing the page to html format
htmlpage = soup(page.content,'html.parser')

# i did not do this but you can also view how your page loos like by: print(htmlpage) 
# and yes, you will see the same thing as F12 in the Chrome sidebar

# verifying that our web crawler retrieved  
print("Print page header1:", htmlpage.h1)
print("Print text from page header1:", htmlpage.h1.text)

If you want to improve your skills and learn more about this highly demanding and interesting subject i recommend that you practice on Test Sites, this way you can safely practice and not get banned.


                                            **Congrats, you're done!**

Conclusion

We have learned about web scraping. We have also learned how to install the python libraries we need to do the basic and bare minimum web scraping. We went over the example web crawler.

If you hit a problem or have feedback (which is highly welcomed) please feel free to get in touch, more details in the footer.