scrape website using scrapy python

Reading Time : ~ .

Installing Scrapy:

System packages Required:

on CentOs/RHCE Distributions

yum install epel-release
yum install python-pip python-devel gcc libxml2-devel libxslt-devel openssl-devel libffi-devel

on Debian distributions

apt-get install python-pip python-dev libxml2-dev libxslt-dev libssl-dev libffi-dev

Installing Scrapy after Dependant packages are installed is easy using python package manager "pip"

pip install scrapy
scrapy version  # To check version of scrapy and also ensuring scrapy is installed properly

Setting up Scrapy:

Scrapy is a framework and not a package, so we need to init a scrapy project by

scrapy startproject <project-name> #[imdb in this case]

This command will create a scrapy project. Later we will create a spider which on running will scrape the site

scrapy genspider imdbScraper www.imdb.com

This will create a folder with below structure

imdb
|-- imdb 
|   |-- __init__.py
|   |-- items.py               # project item files
|   |-- pipelines.py           # project pipeline files
|   |-- settings.py            # project settings files
|   `-- spiders                # contains all scrapers
|       |-- imdbScraper.py     # scraper for imdb
|       |-- __init__.py
`-- scrapy.cfg                 # configuration file

Now all our scraping code goes into imdBScraper.py

import scrapy
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

class ImdbscraperSpider(scrapy.Spider):
    name = "imdbScraper"
    allowed_domains = ["www.imdb.com"]

    def start_requests(self):
        urls = [
            'http://www.imdb.com/chart/top',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        table = response.xpath('//table[@data-caller-name="chart-top250movie"]')
        links = table.xpath(".//tbody/tr")
        for link in links:
            args = (link.select('.//td[@class="titleColumn"]/a/text()').extract_first(), link.select('.//td[@class="titleColumn"]/a/@href').extract_first())
            print 'MovieName is %s and  imdb link is %s' %args
        pass

Now Run the below command for spider to start crawling

scrapy crawl imdbScraper

The above code execution starts at start_requests and iterates over every url in urls list [In next blogpost, we will look into scraping multiple pages, where this comes in handy]. then scrapes the page looks for table with data variable "data-caller-name" and value "chart-top250movie" and iterates over every tr in tbody and fetches text and link from td with class name "titleColumn" and prints it.

In Next tutorials, we will look into how to export data and also use these links to scrape and save more information. Visit Scrapy website for more detailed documentaion.

Happy Coding....

    By Posted On
SENIOR DEVELOPER at MICROPYRAMID

Need any Help in your Project?Let's Talk

Latest Comments
Related Articles
Programming with python: Decorators Ashwin Kumar

Python decorators supports aspect-oriented programming. It is used to add or modify code in functions or classes. Using decorators will provide security, tracing, looking ..etc ...

Continue Reading...
Programming with python: Descriptors Ramya Ambati

Python descriptors are object attributes that are only invoked for new style of classes. Python descriptors comes under the category of meta programming(code that manipulates ...

Continue Reading...
Integrate Twitter Social API into Django App Swetha Naretla

Integrating Twitter sign in (OAuth) in Django App, which includes
1. Capturing username via Twitter Login
2. Updating authenticated user current status on twitter(tweets).

Continue Reading...

Subscribe To our news letter

Subscribe to our news letter to receive latest blog posts into your inbox. Please fill your email address in the below form.
*We don't provide your email contact details to any third parties