Discover web scraping !

Discover web scraping !

ยท

0 min read

hello.jpeg

Hi gentlemen! Euh... there is some ladies here? helloooOO!! It's a lease, isn't it?

In previous episode, we built a quiz about superheroes! Today, we'll write a script which scrape website which talk about superheroes once again. As you'll see, this script will be short.

What do i need to know to be ready to write this script?

#Knowledges

connaissance.jpg

As usual, there are some things in which you need to familiar for coding this script:

  • list
  • dictionary
  • scraping (using BeautifulSoup and/or MechanicalSoup)
  • html and css

Before use web scraping on a website, be sure that website permit it in status. For this script, i download and update a template. Then, i hosted it. The url of website is on [heroku] (heroes-team.herokuapp.com)

When we visit the website, we can see different sections of content such as:

  • credo section (presentation of team credo)
  • team section (heroes and stories)
  • last missions (accomplished by heroes)

we'll scrape title and two others sections:

  1. credo section
  2. team sections. You're free to scrape others! So let's go!!!

woman-ready.jpg

Web scraping in action!

Our script will be write into a file called heroes_scrape.py

Firstly, let's import necessary modules.

Don't forget to install mechanicalsoup

heroes_scrape.py

import mechanicalsoup
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request

Then, save website url into a variable:

website_url = 'https://heroes-team.herokuapp.com/'

Let's code now...

code.jpg

Scrape with BeautifulSoup

Scrape website title

We want to get title of website, by using BeautifulSoup. Let's see how do it.

heroes_url = Request(website_url, headers={'User-Agent': 'Mozilla/5.0'})

html_page = urlopen(heroes_url)
html_text = html_page.read().decode('utf-8')
html_beautifulsoup = BeautifulSoup(html_text, 'lxml')
print(html_beautifulsoup.title.string)

Yeah! It's work. Now, let's see mechanicalsoup.

Scrape with mechanicalsoup

Let's use mechanicalsoup now for web scraping. Start by create a new browser object.

my_browser = mechanicalsoup.Browser()
page = my_browser.get(website_url)
html_mechanicalsoup = page.soup

Now, scrape credo section.

team_credo = html_mechanicalsoup.select('.service-heading')
credo_words = [] # it will contain credo values

for credo in team_credo:
    credo_words.append(credo.string)

credo_s = ','.join(credo_words)
print(f'Team heroes credo is: {credo_s.replace(",", "-")}')

NB: I use replace to design credo format

Scrape team section

Here, we'll only collect data on the first three heroes:

  • Samsnison
  • Thor
  • Green Lantern Hope that you'll continue with scrape last heroes data.

At this point, let's take one minute to think about our data structure. Firstly, we want to get "warrior name". Next profile name and story. After all, get rest of data:

  • origin
  • power
  • characteristics
  • team

Now, we know what we want. Time to organize it. We'll use dict to have good structure of one heroe and list for all of them. Here is an example of our structure:

heroe = {
    name: ...,
    Profile name: ...,
    Story: ...,
    rest_of_data: {
        origin: ...,
        power: ...,
        caractristics: ...,
        team: ...
    }
}

Let's start by define variables which contains data we want:

heroes_list = []
heroe_data = {}

heroes_portfolio = html_mechanicalsoup.select('.portfolio-caption')
heroe_name = html_mechanicalsoup.select('.portfolio-caption h4')
profile_name = html_mechanicalsoup.select('.modal-body h2')
heroe_story = html_mechanicalsoup.select('.modal-body .story')
heroe_origin = html_mechanicalsoup.select('.rest_of_datas > li:nth-of-type(1)')
heroe_power = html_mechanicalsoup.select('.rest_of_datas > li:nth-of-type(2)')
heroe_caracteritics = html_mechanicalsoup.select('.rest_of_datas > li:nth-of-type(3)')
heroe_team = html_mechanicalsoup.select('.rest_of_datas > li:nth-of-type(4)')

This code isn't difficult to understand, we need just to know css selector. Indeed, mechanicalsoup offer possibility to use css selectors.

Fill out now our list with heroes data.

for x in range(0, len(heroes_portfolio)):
    heroe_data = {
        'name': heroe_name[x].string,
        'profile': profile_name[x].string,
        'story': heroe_story[x].string,
        'rest_of_datas': [
            heroe_origin[x].string,
            heroe_power[x].string,
            heroe_caracteritics[x].string,
            heroe_team[x].string
        ]
    }

    heroes_list.append(heroe_data)

On website, Thor is at position 2. So let's display his data

print(heroes_list[2])

๐Ÿ˜ฑ Hugh !!! The result is ugly! However, we've at least the good result. Now, comment previous print line and let's bring an esthetic touch to display.

def display_heroe_data(heroe_number):
    the_heroe = heroes_list[heroe_number]

    print(f'\nPresentation of "{the_heroe["name"]}"')
    print(f'-> {the_heroe["profile"]}\n\
-> Story: {the_heroe["story"].rstrip()}\n\
-> Others informations: \n\
    ** {the_heroe["rest_of_datas"][0]}\n\
    ** {the_heroe["rest_of_datas"][1]}\n\
    ** {the_heroe["rest_of_datas"][2]}\n\
    ** {the_heroe["rest_of_datas"][3]}\n\
    ')

display_heroe_data(2) # Display Thor data

๐Ÿ˜Ž Yeah! Now, the display is better than previously. What we'll do now is to permit user to choose what heroes he wants to know data. User choice will be between three first heroes. Free are you to change it in order to extend it to all heroes presented.

Before continue, comment code we use to display "Thor".

limit_of_heroes = 3

print('WELCOME TO SCRIPTING DAY!')
print('Today, we see web scraping.website url is:\n\
https://heroes-team.herokuapp.com/')

i = 1
print('**Heroes list**')
for heroe in heroes_list:
    if i <= limit_of_heroes:
        print(f'{i}-{heroe["name"]}')
    i += 1

user_choice = input('Choose heroe number you want to get informations: ')
user_choice = int(user_choice)

while user_choice < 0 or user_choice > limit_of_heroes :
    user_choice = int(input('Choose heroe number you want to get informations: '))

display_heroe_data(user_choice-1) # remove 1 (-1) because list start by 0

hourra.jpeg

TADA!!! That's all. Congratulations! We scrape a website! Now with trainning, you can scrape small informations on website. Note that this a basic fundament in web scraping. Others best tools are available. If you want know about web scraping, you must learn advanced concepts in it. I recommend you to learn more on scrapy.

Of course, you can see complete script on my Github!

Here is result that i obtain when i run script!

Okay, so have a nice day, or afternoon or night ๐Ÿ˜‚, and see us next time for another scripting day episode!

end.jpg