Export Data with Scrapy python

Reading Time : ~ .

In previous blog post we collected top 250 movies names and their links from imDb. In this blog post, lets look into exporting that data into files.

For more information on scrapy, visit here

Now lets review the parse function in imdbScraper.py from previous blog post.

     def parse(self, response):
        table = response.xpath('//table[@data-caller-name="chart-top250movie"]')
        links = table.xpath(".//tbody/tr")
        for link in links:
            args = (link.select('.//td[@class="titleColumn"]/a/text()').extract_first(),
                    link.select('.//td[@class="titleColumn"]/a/@href').extract_first())
            print 'MovieName is %s and  imdb link is %s' %args
        pass

Declaring Items:

Items are like models for our scraped items, From above code snippet, we are collecting movie name and link. so our model will contain two fields

from scrapy import Field

class ImdbItem(scrapy.Item):
    name = Field()
    url = Field()

now that we have items declared in items.py. we can instantiate it as ImdbItem(name="xyz", url="some-url-here"). We need to make changes to imdbScraper.py.

    from imdb.items import ImdbItem

    def parse(self, response):
        table = response.xpath('//table[@data-caller-name="chart-top250movie"]')
        links = table.xpath(".//tbody/tr")
        for link in links:
            yield ImdbItem(name=link.xpath('.//td[@class="titleColumn"]/a/text()').extract_first(),
                     url = link.xpath('.//td[@class="titleColumn"]/a/@href').extract_first())

After the modifications the imdbScraper.py will look as 

# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
from imdb.items import ImdbItem

class ImdbscraperSpider(scrapy.Spider):
    name = "imdbScraper"
    allowed_domains = ["www.imdb.com"]

    def start_requests(self):
        urls = [
            'http://www.imdb.com/chart/top',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        table = response.xpath('//table[@data-caller-name="chart-top250movie"]')
        links = table.xpath(".//tbody/tr")
        for link in links:
            yield ImdbItem(name=link.xpath('.//td[@class="titleColumn"]/a/text()').extract_first(),
                     url = link.xpath('.//td[@class="titleColumn"]/a/@href').extract_first())

Exporting Data

Exporting data to CSV or JSOn can be done by two ways:

1. Using Feed Exporters in settings.py

Feed exporters is feature available in scrapy since scrapy 0.10. for all available options refer this link

FEED_FORMAT = "csv"                                             # can also be JSON and other options declared in file.
FEED_EXPORT_FIELDS = ["name", "url"]                # you can decide what fields to export and also order of fields.
FEED_URI = "file:///tmp/export.csv"                          # it also supports s3 with help of boto and ftp. in case you're intrested.

Adding these three lines and running the spider again will save the data to /tmp/export.csv. 

2. Using Pipelines

Pipeline is a method to be performed over an action. Pipeline can also be used to determine if item should be further processed or dropped. we can define a pipeline to write to a json everytime we recieve a item. The  below is a code snippet to save items to a json file

import json

class ImdbPipeline(object):
    def open_spider(self, spider):
        self.file = open('items.json', 'wb')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

After writing this pipeline dont forget to enable it by adding the class of our item pipeline in settings,py under ITEM_PIPELINES:

ITEM_PIPELINES = {
   'imdb.pipelines.ImdbPipeline': 300,
}

The integer values you assign to classes in this setting determine the order in which they run: items go through from lower valued to higher valued classes.

Now running the spider again will generate items.json file with links that we scraped.

You can use either of the methods to save data into files.

In next blog post, we will look into how to scrape data from multiple pages and also explore saving the data into mysql.

    By Posted On
SENIOR DEVELOPER at MICROPYRAMID

Need any Help in your Project?Let's Talk

Latest Comments
Related Articles
Python Web Scraping with Beautiful soup Sandeep Jagata

Download all One Piece animation series episodes by scraping site using BeautifulSoup python library.

Continue Reading...
Sending SMS, MMS using Twilio. Chaitanya Kattineni

A simple Tutorial on sending SMS and MMS in python using Twilio. In this tutorial you will learn how to send SMS, MMS and checking ...

Continue Reading...
Working with python collections part 1 Rakesh babu Podishetty

Python Collections - named tuple is to access by the names specified and deque is to append and pop the elements from both sides of ...

Continue Reading...

Subscribe To our news letter

Subscribe to our news letter to receive latest blog posts into your inbox. Please fill your email address in the below form.
*We don't provide your email contact details to any third parties