In the Big Data world, Web Scraping or Data extraction services are the primary requisites for Big Data Analytics. Pulling up data from the web has become almost inevitable for companies to stay in business. Next question that comes up is how to go about web scraping as a beginner.
Data can be extracted or scraped from a web source using a number of methods. Popular websites like Google, Facebook, or Twitter offer APIs to view and extract the available data in a structured manner. This prevents the use of other methods that may not be preferred by the API provider. However, the demand to scrape a website arises when the information is not readily offered by the website. Python, an open source programming language is often used for Web Scraping due to its simple and rich ecosystem. It contains a library called “BeautifulSoup” which carries on this task. Let’s take a deeper look into web scraping using python.
Setting up a Python Environment:
To carry out web scraping using Python, you will first have to install the Python Environment, which enables to run code written in the python language. The libraries perform data scraping;
Beautiful Soup is a convenient-to-use python library. It is one of the finest tools for extracting information from a webpage. Professionals can scrape information from web pages in the form of tables, lists, or paragraphs. Urllib2 is another library that can be used in combination with the BeautifulSoup library for fetching the web pages. Filters can be added to extract specific information from web pages. Urllib2 is a Python module that can fetch URLs.
For MAC OSX :
To install Python libraries on MAC OSX, users need to open a terminal win and type in the following commands, single command at a time:
pip install BeautifulSoup4
pip install lxml
For Windows 7 & 8 users:
Windows 7 & 8 users need to ensure that the python environment gets installed first. Once, the environment is installed, open the command prompt and find the way to root C:/ directory and type in the following commands:
Once the libraries are installed, it is time to write data scraping code.
Data scraping must be done for a distinct objective such as to scrape current stock of a retail store. First, a web browser is required to navigate the website that contains this data. After identifying the table, right click anywhere on it and then select inspect element from the dropdown menu list. This will cause a window to pop-up on the bottom or side of your screen displaying the website’s html code. The rankings appear in a table. You might need to scan through the HTML data until you find the line of code that highlights the table on the webpage.
Python offers some other alternatives for HTML scraping apart from BeautifulSoup. They include:
Web scraping converts unstructured data from HTML code into structured form such as tabular data in an Excel worksheet. Web scraping can be done in many ways ranging from the use of Google Docs to programming languages. For people who do not have any programming knowledge or technical competencies, it is possible to acquire web data by using web scraping services that provide ready to use data from websites of your preference.
To perform web scraping, users must have a sound knowledge of HTML tags. It might help a lot to know that HTML links are defined using anchor tag i.e. <a> tag, “<a href=“http://…”>The link needs to be here </a>”. An HTML list comprises <ul> (unordered) and <ol> (ordered) list. The item of list starts with <li>.
HTML tables are defined with<Table>, row as <tr> and columns are divided into data as <td>;
<!DOCTYPE html> : A HTML document starts with a document type declaration
The main part of the HTML document in unformatted, plain text is defined by <body> and </body> tags
The headings in HTML are defined using the heading tags from <h1> to <h5>
Paragraphs are defined with the <p> tag in HTML
An entire HTML document is contained between <html> and </html>
Using BeautifulSoup in Scraping:
While scraping a webpage using BeautifulSoup, the main concern is to identify the final objective. For instance, if you would like to extract a list from webpage, a step wise approach is required:
First and foremost step is to import the required libraries:
#import the library used to query a website
#specify the url wiki = “https://”
#Query the website and return the html to the variable ‘page’
page = urllib2.urlopen(wiki)
#import the Beautiful soup functions to parse the data returned from the website
from bs4 import BeautifulSoup
#Parse the html in the ‘page’ variable, and store it in Beautiful Soup format
soup = BeautifulSoup(page)
Use function “prettify” to visualize nested structure of HTML page
Working with Soup tags:
Soup<tag> is used for returning content between opening and closing tag including tag.
Out:<title>List of Presidents in India till 2010 – Wikipedia, the free encyclopedia</title>
soup.<tag>.string: Return string within given tag
Out:u ‘List of Presidents in India and Brazil till 2010 in India – Wikipedia, the free encyclopedia’
Find all the links within page’s <a> tags: Tag a link using tag “<a>”. So, go with option soup.a and it should return the links available in the web page. Let’s do it.
Find the right table:
As a table to pull up information about Presidents in India and Brazil till 2010 is being searched for, identifying the right table first is important. Here’s a command to scrape information enclosed in all table tags.
Identify the right table by using attribute “class” of table needs to filter the right table. Thereafter, inspect the class name by right clicking on the required table of web page as follows:
Copy the class name or find the class name of right table from the last command’s output.
right_table=soup.find(‘table’, class_=’wikitable sortable plainrowheaders’)
That’s how we can identify the right table.
Extract the information to DataFrame: There is a need to iterate through each row (tr) and then assign each element of tr (td) to a variable and add it to a list. Let’s analyse the Table’s HTML structure of the table. (extract information for table heading <th>)
To access value of each element, there is a need to use “find(text=True)” option with each element. Finally, there is data in dataframe.
There are various other ways to scrape data using “BeautifulSoup” that reduce manual efforts to collect data from web pages. Code written in BeautifulSoup is considered to be more robust than the regular expressions. The web scraping method we discussed use “BeautifulSoup” and “urllib2” libraries in Python. That was a brief beginner’s guide to start using Python for web scraping.