In previous blog post we collected top 250 movies names and their links from imDb. In this blog post, lets look into exporting that data into files.

For more information on scrapy, visit here

Now lets review the parse function in imdbScraper.py from previous blog post.

     def parse(self, response):
        table = response.xpath('//table[@data-caller-name="chart-top250movie"]')
        links = table.xpath(".//tbody/tr")
        for link in links:
            args = (link.select('.//td[@class="titleColumn"]/a/text()').extract_first(),
                    link.select('.//td[@class="titleColumn"]/a/@href').extract_first())
            print 'MovieName is %s and  imdb link is %s' %args
        pass

Declaring Items:

Items are like models for our scraped items, From above code snippet, we are collecting movie name and link. so our model will contain two fields

from scrapy import Field

class ImdbItem(scrapy.Item):
    name = Field()
    url = Field()

now that we have items declared in items.py. we can instantiate it as ImdbItem(name="xyz", url="some-url-here"). We need to make changes to imdbScraper.py.

    from imdb.items import ImdbItem

    def parse(self, response):
        table = response.xpath('//table[@data-caller-name="chart-top250movie"]')
        links = table.xpath(".//tbody/tr")
        for link in links:
            yield ImdbItem(name=link.xpath('.//td[@class="titleColumn"]/a/text()').extract_first(),
                     url = link.xpath('.//td[@class="titleColumn"]/a/@href').extract_first())

After the modifications the imdbScraper.py will look as 

# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
from imdb.items import ImdbItem

class ImdbscraperSpider(scrapy.Spider):
    name = "imdbScraper"
    allowed_domains = ["www.imdb.com"]

    def start_requests(self):
        urls = [
            'http://www.imdb.com/chart/top',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        table = response.xpath('//table[@data-caller-name="chart-top250movie"]')
        links = table.xpath(".//tbody/tr")
        for link in links:
            yield ImdbItem(name=link.xpath('.//td[@class="titleColumn"]/a/text()').extract_first(),
                     url = link.xpath('.//td[@class="titleColumn"]/a/@href').extract_first())

Exporting Data

Exporting data to CSV or JSOn can be done by two ways:

1. Using Feed Exporters in settings.py

Feed exporters is feature available in scrapy since scrapy 0.10. for all available options refer this link

FEED_FORMAT = "csv"                                             # can also be JSON and other options declared in file.
FEED_EXPORT_FIELDS = ["name", "url"]                # you can decide what fields to export and also order of fields.
FEED_URI = "file:///tmp/export.csv"                          # it also supports s3 with help of boto and ftp. in case you're intrested.

Adding these three lines and running the spider again will save the data to /tmp/export.csv. 

2. Using Pipelines

Pipeline is a method to be performed over an action. Pipeline can also be used to determine if item should be further processed or dropped. we can define a pipeline to write to a json everytime we recieve a item. The  below is a code snippet to save items to a json file

import json

class ImdbPipeline(object):
    def open_spider(self, spider):
        self.file = open('items.json', 'wb')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

After writing this pipeline dont forget to enable it by adding the class of our item pipeline in settings,py under ITEM_PIPELINES:

ITEM_PIPELINES = {
   'imdb.pipelines.ImdbPipeline': 300,
}

The integer values you assign to classes in this setting determine the order in which they run: items go through from lower valued to higher valued classes.

Now running the spider again will generate items.json file with links that we scraped.

You can use either of the methods to save data into files.

In next blog post, we will look into how to scrape data from multiple pages and also explore saving the data into mysql.

Subscribe To our news letter

Subscribe to our news letter to receive latest blog posts into your inbox. Please fill your email address in the below form.
*We don't provide your email contact details to any third parties
Latest Comments
Related Articles
Python development environment on windows Ashwin Kumar

Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Python's simple, easy to learn syntax emphasizes readability and therefore reduces the cost of ...

Continue Reading...
Getting Started with the IPython Notebook Shirisha Gaddi

IPython is a set of tools developed to make it easier for the programmers to work with Python and data. IPython provides extensions to the ...

Continue Reading...
Debugging in Python Ramya Ambati

When something goes wrong with your code instead of using standard debugging techniques such as print statements use debugging tools. I found two great tools ...

Continue Reading...