scrape website using scrapy python

Reading Time : ~ .

Installing Scrapy:

System packages Required:

on CentOs/RHCE Distributions

yum install epel-release
yum install python-pip python-devel gcc libxml2-devel libxslt-devel openssl-devel libffi-devel

on Debian distributions

apt-get install python-pip python-dev libxml2-dev libxslt-dev libssl-dev libffi-dev

Installing Scrapy after Dependant packages are installed is easy using python package manager "pip"

pip install scrapy
scrapy version  # To check version of scrapy and also ensuring scrapy is installed properly

Setting up Scrapy:

Scrapy is a framework and not a package, so we need to init a scrapy project by

scrapy startproject <project-name> #[imdb in this case]

This command will create a scrapy project. Later we will create a spider which on running will scrape the site

scrapy genspider imdbScraper

This will create a folder with below structure

|-- imdb 
|   |--
|   |--               # project item files
|   |--           # project pipeline files
|   |--            # project settings files
|   `-- spiders                # contains all scrapers
|       |--     # scraper for imdb
|       |--
`-- scrapy.cfg                 # configuration file

Now all our scraping code goes into

import scrapy
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

class ImdbscraperSpider(scrapy.Spider):
    name = "imdbScraper"
    allowed_domains = [""]

    def start_requests(self):
        urls = [
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        table = response.xpath('//table[@data-caller-name="chart-top250movie"]')
        links = table.xpath(".//tbody/tr")
        for link in links:
            args = ('.//td[@class="titleColumn"]/a/text()').extract_first(),'.//td[@class="titleColumn"]/a/@href').extract_first())
            print 'MovieName is %s and  imdb link is %s' %args

Now Run the below command for spider to start crawling

scrapy crawl imdbScraper

The above code execution starts at start_requests and iterates over every url in urls list [In next blogpost, we will look into scraping multiple pages, where this comes in handy]. then scrapes the page looks for table with data variable "data-caller-name" and value "chart-top250movie" and iterates over every tr in tbody and fetches text and link from td with class name "titleColumn" and prints it.

In Next tutorials, we will look into how to export data and also use these links to scrape and save more information. Visit Scrapy website for more detailed documentaion.

Happy Coding....

    By Posted On

Need any Help in your Project?Let's Talk

Latest Comments
Related Articles
Understanding self and __init__ method in python Class. Anjaneyulu Batta

Understand self and __init__ method in python Class?
Before understanding the "self" and "__init__" methods in python class, it's very helpful if we have the ...

Continue Reading...
Programming with python: Decorators Ashwin Kumar

Python decorators supports aspect-oriented programming. It is used to add or modify code in functions or classes. Using decorators will provide security, tracing, looking ..etc ...

Continue Reading...
QRCode generation in python Sandeep Jagata

A Quick Response code(QRCode) is a two-dimensional pictographic code used for its fast readability and comparatively large storage capacity. The code consists of black modules ...

Continue Reading...

Subscribe To our news letter

Subscribe to our news letter to receive latest blog posts into your inbox. Please fill your email address in the below form.
*We don't provide your email contact details to any third parties