Introduction to Scrapy

Scrapy is a powerful Python-based web crawling framework that helps a developer to define how one or more websites can be scrapped. Scrappy uses crawlers called Spiders, which can extract, process, and save the data. Since Scrapy is built on Twisted, an asynchronous networking framework, its performance is extremely fast due to the non-blocking mechanism to send requests.

 

Advantages of Scrapy

  • It has an in-built mechanism called Selectors to locate and extract data from a web page using XPath and CSS.
  • Scrapy does not need extensive coding like other frameworks. All you need to do is define the website and the data to be extracted. Scrapy handles most of the heavy work.
  • Scrapy is a free, open-source, and cross-platform. 
  • It is fast, powerful, and easily extensible due to its asynchronous handling of requests. 
  • It can conveniently build and scale crawlers for large projects.
  • Using Scrapy, we can crawl any web page regardless of raw data availability.
  • Less memory and CPU consumption compared to other libraries.
  • Supports data exporting to different formats such as CSV, XML, JSON, and JSON Line.
  • Better community support for developers.
Find open Python developer jobs on our website

 

Installing Scrapy 

Scrapy is supported in both Python 2 and 3, but in this tutorial, we will be using Python 3. There are basically 2 main methods to install Scrapy. If you’re using Anaconda, it can be installed from the conda-forge channel using the following command. Anaconda for Python can be downloaded here.

 

conda install -c conda-forge scrapy 

 

The other way is by using  ‘pip’ package management software for python. Using pip, you can directly install Scrapy into your computer with the command below.

pip install scrapy

Both of the above methods will install the latest versions of Scrapy. 

Creating a Scrapy Project

In order to create the project, we need to move to the folder where the project should be in. This can be done through the command below.

 

cd path-to-your-project

 

Next, we will create a Scrapy project using the below command. We will name our project as “scrapy_tutorial”

 

scrapy startproject scrapy_tutorial

 

Once the project is created, there will be a folder and a configuration file created.

This folder is created to collate different components of the crawler that will be created later.

Once you enter the project folder, you can see the project structure and supporting files.

 Let’s have a look at them in detail 

 

File Description
spiders This directory contains all the spiders in the form of a python class. Whenever Scrapy is requested to run, it will be searched in this folder.
items.py Includes container that will be loaded along with scraped data.
middleware.py It contains Spider’s processing mechanism to handle requests and responses.
pipeline.py It contains a set of Python classes to process scraped data further.
settings.py Any customized settings can be added to this file.

 

Creating a Spider

Spider is a class that contains the methodology to scrape and extract the data form the site defined. In other words, it determines how to perform the crawl. 

 

In order to create a Spider, we can use the command below.

 

scrapy genspider <spidername> <your-link-here>

 

For spidername, you can give any name for your spider and for the link, you can give the URL of the site or domain that you are going to scrape data from. In this tutorial, we will extract customer product reviews for  Apple iPhone X from Amazon.com. 

 

We will call our spider as reviewspider.

 

scrapy genspider reviewspider https://www.amazon.com/Apple-iPhone-256GB-Silver-T-Mobile/product-reviews/B07RV52TRF/ref=cm_cr_dp_d_show_all_btm?

 

Scrapy Shell

Scrapy shell is an interactive shell similar to a python shell in which you can try and debug your code for data scraping. Using this shell, you can test out your XPath and CSS expressions and verify the data that they extract without even having to run your spider. Therefore, it is a faster and a valuable tool for developing and debugging.

 

Scrapy shell can be launched using the below command

 

scrapy shell <url>

Identifying the HTML structure

Before we start coding our spider we need to analyze how the web page is structured and identify its patterns.

In the image above, we can see that each review contains a review text and a star rating. We will be extracting both this information.

 

In order to view the HTML structure of the page, we can right-click and go to Inspect, or we can view it through the Browser’s Developer Tools.

 

.According to the above HTML structure, all review is enclosed in a division of id “cm_cr-review_list,” which has multiple sub-divisions for each review. 

When we further expand each review division, we can see there are separate blocks for each component of the review. Our focus will be on the star rating and the review text.

 

<i data-hook="review-star-rating" class="a-icon a-icon-star a-star-1 review-rating">

<span class="a-icon-alt">1.0 out of 5 stars</span>

</i>

 

Here, star ratings are identified by the class “a-icon-alt”’.

 

<span data-hook="review-body" class="a-size-base review-text review-text-content">

<span class="">The screen was cracked and the phone did not turn in after 24 hours of charging.</span>

</span>

 

The review text is identified by the class ‘a-size-base’. Next, we will use this information to define our Spider.

 

Defining the Scrapy parser

 

When we open the Spider that we created in the earlier stage, we can see that there is a class created inside. 

 

# -*- coding: utf-8 -*-

import scrapy

 

class ReviewspiderSpider(scrapy.Spider):

   name = 'reviewspider'

   allowed_domains = ['amazon.com']

   start_urls = ['https://www.amazon.com/Apple-iPhone-256GB-Silver-T-Mobile/product-reviews/B07RV52TRF/ref=cm_cr_dp_d_show_all_btm?']

 
   def parse(self, response):

       pass

 

This is the basic template of the Spider and “allowed_domains” and “start_urls” are created based on the link we provided when we created the spider.

 

The logic for extracting our data will be written in the parse function, which will be fired when landing on the page defined by “start_urls”

Scrapy allows crawling multiple URLs simultaneously. For this, identify the Base URL and then identify the part of the other URLs that need to join the base URL and append them using urljoin(). However, in this example, we will use only the base URL.

 

Below is the code which is written in the Scrapy Parser to scrape review data.

 

def parse(self, response):

 

   star_rating=response.xpath('//span[@class="a-icon-alt"]/text()').extract()

   comments=response.xpath('//span[@class="a-size-base review-text review-text-content"]/span/text()').extract()

   count = 0

   for item in zip(star_rating, comments):

       # create a dictionary to store the scraped info

       scraped_data = {

           'Star Rating': item[0],

           'Rating Text': item[1],

       }

 

       # yield or give the scraped info to scrapy

       yield scraped_data

 

Scrapy comes with its own mechanism called Selectors to extract data. These Selectors use XPath and CSS expressions to select different elements in the HTML documents. In code above uses XPath as the Selector.

 

star_rating=response.xpath('//span[@class="a-icon-alt"]/text()').extract()

 

In the above code line, Scrapy uses XPath to reach a node in the response and extract its data in the form of a text.

 

 for item in zip(star_rating, comments):

       # create a dictionary to store the scraped info

       scraped_data = {

           'Star Rating': item[0],

           'Rating Text': item[1],

       }

 

In the above code, we are adding each item to the Python dictionary.

 

yield scraped_data

 

The yield statement returns the scraped data for Scrapy to process and store.

Handling Multiple Pages

In the above example, we have defined the Parse function to scrape data from a single page. But this is not sufficient when there are reviews that are available on multiple pages. Therefore we will extend our code so that we can navigate to all the available pages and extract the data the same way it did before.

 

Pages are navigated through the Next Page button, and its HTML code is as below.

 

<li class="a-last">

<a href="/Apple-iPhone-256GB-Silver-T-Mobile/product-reviews/B07RV52TRF/ref=cm_cr_arp_d_paging_btm_2?ie=UTF8&amp;pageNumber=2">

Next page<span class="a-letter-space"></span>

<span class="a-letter-space"></span>→</a>

</li>

 

Now we need to modify our spider so it can identify the Next button through a Selector and check if it exists. If it does, we can navigate to it and call our parser again. This can be done by adding the following code.

 

     next_page = response.css('.a-last a ::attr(href)').extract_first()

        if next_page:

            yield scrapy.Request(

                response.urljoin(next_page),

                callback=self.parse

            )

 

We have used CSS as the Selector for the Next page. Once it is identified, the extract_first() gets the first match and checks if it exists. If it does, we call the self.parse method for the new URL.

 

Running the Spider

 

Once the spider is built, we can easily run the spider using the following command.

scrapy runspider scrapy_tutorial/spiders/reviewspider.py -o scraped_data.csv

 

The runspider command takes the reviewspider.py as the input file and produces the CVS file scraped_data.cvs, which has the collected results. 

 

Scrapy Feed Exports

Scrapy provides the Feed Export option to store the extracted data in different formats or serialization methods. It supports formats such as CVS, XML, and JSON.

For example, if you want your output in CVS format, got to settings.py file and type in the below lines.

FEED_FORMAT="csv"

FEED_URI="scraped_data.csv"

 

Save this file and rerun the spider. Then, you can see the CVS file formed under your project directory.

If you want a timestamp or name of the spider along with your file name you can use %(time)s or %(name)s to you FEED_URI

 

For example:

 

    • ‘FEED_URI’: “scraped_data_%(time)s.json”

 

 

Conclusion

 

Scrapy is a powerful web scraping framework, and we have shown how easily it can be used to crawl and scrape Amazon reviews. It is mainly for parsing HTML documents, but it is easy to learn and extremely quick and comes with various functionalities.

 

So, Scrapy is a complete development tool for web scraping and probably the best choice for large scale crawling projects. If you want to learn more about Scrapy, you can always follow their documentation.

Find open Python developer jobs on our website

 


Leave a Reply

Your email address will not be published. Required fields are marked *