Installing Scrapy:

System packages Required:

on CentOs/RHCE Distributions

yum install epel-release
yum install python-pip python-devel gcc libxml2-devel libxslt-devel openssl-devel libffi-devel

on Debian distributions

apt-get install python-pip python-dev libxml2-dev libxslt-dev libssl-dev libffi-dev

Installing Scrapy after Dependant packages are installed is easy using python package manager "pip"

pip install scrapy
scrapy version  # To check version of scrapy and also ensuring scrapy is installed properly

Setting up Scrapy:

Scrapy is a framework and not a package, so we need to init a scrapy project by

scrapy startproject <project-name> #[imdb in this case]

This command will create a scrapy project. Later we will create a spider which on running will scrape the site

scrapy genspider imdbScraper www.imdb.com

This will create a folder with below structure

imdb
|-- imdb 
|   |-- __init__.py
|   |-- items.py               # project item files
|   |-- pipelines.py           # project pipeline files
|   |-- settings.py            # project settings files
|   `-- spiders                # contains all scrapers
|       |-- imdbScraper.py     # scraper for imdb
|       |-- __init__.py
`-- scrapy.cfg                 # configuration file

Now all our scraping code goes into imdBScraper.py

import scrapy
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

class ImdbscraperSpider(scrapy.Spider):
    name = "imdbScraper"
    allowed_domains = ["www.imdb.com"]

    def start_requests(self):
        urls = [
            'http://www.imdb.com/chart/top',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        table = response.xpath('//table[@data-caller-name="chart-top250movie"]')
        links = table.xpath(".//tbody/tr")
        for link in links:
            args = (link.select('.//td[@class="titleColumn"]/a/text()').extract_first(), link.select('.//td[@class="titleColumn"]/a/@href').extract_first())
            print 'MovieName is %s and  imdb link is %s' %args
        pass

Now Run the below command for spider to start crawling

scrapy crawl imdbScraper

The above code execution starts at start_requests and iterates over every url in urls list [In next blogpost, we will look into scraping multiple pages, where this comes in handy]. then scrapes the page looks for table with data variable "data-caller-name" and value "chart-top250movie" and iterates over every tr in tbody and fetches text and link from td with class name "titleColumn" and prints it.

In Next tutorials, we will look into how to export data and also use these links to scrape and save more information. Visit Scrapy website for more detailed documentaion.

Happy Coding....

Subscribe To our news letter

Subscribe to our news letter to receive latest blog posts into your inbox. Please fill your email address in the below form.
*We don't provide your email contact details to any third parties
Latest Comments
Related Articles
Working with python collections part 1 Rakesh babu Podishetty

Python Collections - named tuple is to access by the names specified and deque is to append and pop the elements from both sides of ...

Continue Reading...
Publishing Python Modules with PIP via PyPi Ashwin Kumar

We'll install so many packages in our day to day python development. Now in this blog post, we'll try to know how to create our ...

Continue Reading...
Python using yield and generators. Dinesh Deshmukh

Generators are memory efficient. They allow us to code with minimum intermediate arguments, less data structures.
Generators are of two types, generator expressions and generator ...

Continue Reading...