System packages Required:
on CentOs/RHCE Distributions
yum install epel-release yum install python-pip python-devel gcc libxml2-devel libxslt-devel openssl-devel libffi-devel
on Debian distributions
apt-get install python-pip python-dev libxml2-dev libxslt-dev libssl-dev libffi-dev
Installing Scrapy after Dependant packages are installed is easy using python package manager "pip"
pip install scrapy scrapy version # To check version of scrapy and also ensuring scrapy is installed properly
Setting up Scrapy:
Scrapy is a framework and not a package, so we need to init a scrapy project by
scrapy startproject <project-name> #[imdb in this case]
This command will create a scrapy project. Later we will create a spider which on running will scrape the site
scrapy genspider imdbScraper www.imdb.com
This will create a folder with below structure
imdb |-- imdb | |-- __init__.py | |-- items.py # project item files | |-- pipelines.py # project pipeline files | |-- settings.py # project settings files | `-- spiders # contains all scrapers | |-- imdbScraper.py # scraper for imdb | |-- __init__.py `-- scrapy.cfg # configuration file
Now all our scraping code goes into imdBScraper.py
import scrapy from scrapy.selector import Selector from scrapy.http import HtmlResponse class ImdbscraperSpider(scrapy.Spider): name = "imdbScraper" allowed_domains = ["www.imdb.com"] def start_requests(self): urls = [ 'http://www.imdb.com/chart/top', ] for url in urls: yield scrapy.Request(url=url, callback=self.parse) def parse(self, response): table = response.xpath('//table[@data-caller-name="chart-top250movie"]') links = table.xpath(".//tbody/tr") for link in links: args = (link.select('.//td[@class="titleColumn"]/a/text()').extract_first(), link.select('.//td[@class="titleColumn"]/a/@href').extract_first()) print 'MovieName is %s and imdb link is %s' %args pass
Now Run the below command for spider to start crawling
scrapy crawl imdbScraper
The above code execution starts at start_requests and iterates over every url in urls list [In next blogpost, we will look into scraping multiple pages, where this comes in handy]. then scrapes the page looks for table with data variable "data-caller-name" and value "chart-top250movie" and iterates over every tr in tbody and fetches text and link from td with class name "titleColumn" and prints it.
In Next tutorials, we will look into how to export data and also use these links to scrape and save more information. Visit Scrapy website for more detailed documentaion.