Extract Text From PDF

· by author · Read in about 1 min · (99 Words)

Extract Text from Images in multi-page PDF

To extract text from PDF, you would need two software installed on your machine.

Installing these on Fedora is very easy:

$ sudo yum install -y ghostscript tesseract

Now if your PDF file is named story.pdf the you can extract text as follows:

$ ghostscript -dNOPAUSE -dBATCH -sDEVICE=pngalpha -r300 -sOutputFile="page%03d".png story.pdf
$ for f in page*.png ; do tesseract $f $f.out; done