Extract data from PDF and all Microsoft Office files in python

Reading Time : ~ .

We will see how to extract text from PDF and all Microsoft Office files.

Generating OCR for PDF:

The quick way to get/extract text from PDFs in Python is with the Python library "slate". Slate is a Python package that simplifies the process of extracting text.

Installation:

$ pip install slate
$ pip install pdfminer

Usage:

import slate
with open('sample.pdf', 'rb') as f:
    pdf_text = slate.PDF(f)
    print pdf_text

Output: ['Sample text...', '......', '......']

* The PDF class, of slate, takes file-like object and extracts all the text from the PDF file. It provides the output as a list of strings(one for each page).

* NOTE: If the PDF file has password, then pass the password as second parameter.

Example: 

import slate
with open('test_doc.pdf', 'rb') as f:
    pdf_text = slate.PDF(f, "pass the PDF file password here")
    print pdf_text

Output: ['Sample text...', '......', '......']
    By Posted On
SENIOR DEVELOPER at MICROPYRAMID

Need any Help in your Project?Let's Talk

Latest Comments
Related Articles
How to create Periodic Tasks in Celery? Nikhila Mergu

Periodic Task is a scheduling task which will run at regular intervals of time. Celery is a powerful, production-ready asynchronous job queue, which allows you ...

Continue Reading...
Django efficient implementation of Amazon s3 and Cloudfront CDN for faster loading. Chaitanya Kattineni

Django by default to store the files in your local file system. To make your files load quickly and secure we need to go for ...

Continue Reading...
Extract text with OCR for all image types in python using pytesseract Shirisha Gaddi

Optical Character Recognition(OCR) is the process of electronically extracting text from images or any documents like PDF and reusing it in a variety of ways ...

Continue Reading...

Subscribe To our news letter

Subscribe to our news letter to receive latest blog posts into your inbox. Please fill your email address in the below form.
*We don't provide your email contact details to any third parties