Extract Text from from multi-page PDF with only Images

· Read in about 2 min · (256 words) ·

Sometimes there are only images in a PDF. In such cases you can not select text to copy / paste or just for reference.

To extract text from an Image or a PDF containing only images, I used Tesseract OCR Engine and Ghostscript. I am running Fedora 19 at the moment, however these steps should apply to an older version of Fedora or Ubuntu. ( I believe this can be done on Windows as well ). Both Tesseract and Ghostscript are free softwares.

First, install both Tesseract and Ghostscript on Fedora:

$ sudo yum install -y ghostscript tesseract

Now go to the folder where your PDF is located ( assuming that it is named as story.pdf ):

$ cd ~/Downloads/

Next, extract each page from PDF as a PNG. For this I used Ghostscript. Note the resolution ( -r300 ):

$ ghostscript -dNOPAUSE -dBATCH -sDEVICE=pngalpha -r300 -sOutputFile="page%03d".png story.pdf
$ ls page*.png
page001.png
page002.png
...

Once we have a PNG for each page, we can use the OCR software to extract text:

$ for f in page*.png ; do tesseract $f $f.out; done
$ ls page*.out.txt
page001.png.out.txt
page002.png.out.txt
...

So, now we have all the text from images into text files. Tesseract works quite well with OCR output, and obviously it cant read drawing or misprinted characters quite well, still its quite accurate.

I hope it is helpful for you.

References: