By continuing to navigate on this website, you accept the use of cookies to serve you more relevant services & content .
For more information and to change the setting of cookies on your computer, please read our Cookie Policy.

Extract data from PDF and all Microsoft Office files in python

We will see how to extract text from PDF and all Microsoft Office files.

Generating OCR for PDF:

The quick way to get/extract text from PDFs in Python is with the Python library "slate". Slate is a Python package that simplifies the process of extracting text.

Installation:

$ pip install slate
$ pip install pdfminer

Usage:

import slate
with open('sample.pdf', 'rb') as f:
    pdf_text = slate.PDF(f)
    print pdf_text

Output: ['Sample text...', '......', '......']

* The PDF class, of slate, takes file-like object and extracts all the text from the PDF file. It provides the output as a list of strings(one for each page).

* NOTE: If the PDF file has password, then pass the password as second parameter.

Example: 

import slate
with open('test_doc.pdf', 'rb') as f:
    pdf_text = slate.PDF(f, "pass the PDF file password here")
    print pdf_text

Output: ['Sample text...', '......', '......']
    Posted On
  • 21 September 2014
  • By
  • Micropyramid

Need any Help in your Project?Let's Talk

Latest Comments
Related Articles
How to create Custom User Model or Extend User Model in Django?

Django provides built in authentication which is good for most of the cases, but you may have needs that are being served with the existing ...

Continue Reading...
Add captcha to django web page using Python-reCaptcha

Python-reCaptcha is a pythonic and well-documented reCAPTCHA client that supports all the features of the remote API to generate and verify CAPTCHA challenges. To add ...

Continue Reading...
Django-REST Framework Object Level Permissions and User Level Permissions

Django-REST User Level Permissions and Object Level Permissions. User Level Permissions and Object level Permissions allow to serve customers based on their access levels or ...

Continue Reading...
open source packages

Subscribe To our news letter

Subscribe and Stay Updated about our Webinars, news and articles on Django, Python, Machine Learning, Amazon Web Services, DevOps, Salesforce, ReactJS, AngularJS, React Native.
* We don't provide your email contact details to any third parties