We will see how to extract text from PDF and all Microsoft Office files.

Generating OCR for PDF:

The quick way to get/extract text from PDFs in Python is with the Python library "slate". Slate is a Python package that simplifies the process of extracting text.

Installation:

$ pip install slate
$ pip install pdfminer

Usage:

import slate
with open('sample.pdf', 'rb') as f:
    pdf_text = slate.PDF(f)
    print pdf_text

Output: ['Sample text...', '......', '......']

* The PDF class, of slate, takes file-like object and extracts all the text from the PDF file. It provides the output as a list of strings(one for each page).

* NOTE: If the PDF file has password, then pass the password as second parameter.

Example: 

import slate
with open('test_doc.pdf', 'rb') as f:
    pdf_text = slate.PDF(f, "pass the PDF file password here")
    print pdf_text

Output: ['Sample text...', '......', '......']

Subscribe To our news letter

Subscribe to our news letter to receive latest blog posts into your inbox. Please fill your email address in the below form.
*We don't provide your email contact details to any third parties
Latest Comments
Related Articles
How to create responsive thumbnails in Django templates using sorl-thumbnail? Chaitanya Kattineni

Sorl thumbnail is the package which is being widely used to generate thumbnail in Django. It will create thumbnail of given size for the given ...

Continue Reading...
Export html web page to pdf using jspdf Vidyasagar Rudraram

jsPDF is used to generate pdf files in client-side Javascript. You can find the links for jsPDF here and also you can find the link ...

Continue Reading...
Add captcha to django web page using Python-reCaptcha Divya Sri

Python-reCaptcha is a pythonic and well-documented reCAPTCHA client that supports all the features of the remote API to generate and verify CAPTCHA challenges. To add ...

Continue Reading...