scrape website using scrapy python

Reading Time : ~ .

Installing Scrapy:

System packages Required:

on CentOs/RHCE Distributions

yum install epel-release
yum install python-pip python-devel gcc libxml2-devel libxslt-devel openssl-devel libffi-devel

on Debian distributions

apt-get install python-pip python-dev libxml2-dev libxslt-dev libssl-dev libffi-dev

Installing Scrapy after Dependant packages are installed is easy using python package manager "pip"

pip install scrapy
scrapy version  # To check version of scrapy and also ensuring scrapy is installed properly

Setting up Scrapy:

Scrapy is a framework and not a package, so we need to init a scrapy project by

scrapy startproject <project-name> #[imdb in this case]

This command will create a scrapy project. Later we will create a spider which on running will scrape the site

scrapy genspider imdbScraper

This will create a folder with below structure

|-- imdb 
|   |--
|   |--               # project item files
|   |--           # project pipeline files
|   |--            # project settings files
|   `-- spiders                # contains all scrapers
|       |--     # scraper for imdb
|       |--
`-- scrapy.cfg                 # configuration file

Now all our scraping code goes into

import scrapy
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

class ImdbscraperSpider(scrapy.Spider):
    name = "imdbScraper"
    allowed_domains = [""]

    def start_requests(self):
        urls = [
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        table = response.xpath('//table[@data-caller-name="chart-top250movie"]')
        links = table.xpath(".//tbody/tr")
        for link in links:
            args = ('.//td[@class="titleColumn"]/a/text()').extract_first(),'.//td[@class="titleColumn"]/a/@href').extract_first())
            print 'MovieName is %s and  imdb link is %s' %args

Now Run the below command for spider to start crawling

scrapy crawl imdbScraper

The above code execution starts at start_requests and iterates over every url in urls list [In next blogpost, we will look into scraping multiple pages, where this comes in handy]. then scrapes the page looks for table with data variable "data-caller-name" and value "chart-top250movie" and iterates over every tr in tbody and fetches text and link from td with class name "titleColumn" and prints it.

In Next tutorials, we will look into how to export data and also use these links to scrape and save more information. Visit Scrapy website for more detailed documentaion.

Happy Coding....

    By Posted On

Need any Help in your Project?Let's Talk

Latest Comments
Related Articles
Debugging in Python Ramya Ambati

When something goes wrong with your code instead of using standard debugging techniques such as print statements use debugging tools. I found two great tools ...

Continue Reading...
Building and Parsing XML Document using Python Swetha Naretla

Creating XML document with required elements, Then Parsing it using Python to generate a serialized form of its contents.

Continue Reading...
Working with python collections part 2 Rakesh babu Podishetty

Python collections - Counter is to count the frequency of character, OrderedDict is to track the order of the contents in which they are added ...

Continue Reading...

Subscribe To our news letter

Subscribe to our news letter to receive latest blog posts into your inbox. Please fill your email address in the below form.
*We don't provide your email contact details to any third parties