We’ve done our fair share of projects the past few years involving natural language processing of unstructured text. This text has come from Word documents, PDFs, PowerPoint slides, emails and, of course, web pages. Given great Python tools like nltk, textblob, and scikit-learn that make the analysis part of the process simpler, it’s surprising how tedious it is to actually extract the text from each of these different types of data sources.
To avoid adding entries to the seemingly endless list of one-off scripts that we have written to accomplish this task, we wrote textract, a python package that provides a simple user interface for extracting text from any document. Ok, ok, ok. You can’t extract text from any document at the moment, but textract integrates support for many common formats and we designed it to be as easy as possible to add other document formats. The whole thing is up on github, to make it easier for the community to add their own integrations.
There are two primary ways you can use textract. From the command line, you simply call textract on any particular file like this:
textract little_bo_peep.doc > little_bo_peep.txt
Since the package is written in python, you can also obtain the text within your python scripts like this:
<b>import</b> textract little_bo_peep = textract.process("little_bo_peep.doc")
We plan to actively maintain this project now and in the future and hope that you find it useful. If you have any suggestions (new file formats, UI improvements, documentation clarifications, etc) or are interested in contributing, all participation is welcome!