I actually found a software called OCRmyPDF that can be installed through PIP, and it uses Tesseract as a dependency (as well as Ghostscript):

https://ocrmypdf.readthedocs.io/en/latest/introduction.html

Since I already had Anaconda installed, I only had to download and install Ghostscript and Tesseract before running

pip install ocrmypdf

I was successfully able to run

ocrmypdf input.pdf output.pdf

and I now have a local setup for running OCR on PDFs!

Best,
Nikhil

On Fri, Apr 19, 2024 at 11:58 PM Nikhil Vohra <nikhil.vohra@stonybrook.edu> wrote:
Thanks!

I just installed it on Windows, but I get the following error:

Error in pixReadStream: Pdf reading is not supported

Most of the full-length documents that I would need to convert to screen-reader-accessible text are in the form of PDFs. For simple pictures, I could just use NVDA. Do you know of any workaround?

I used the command

tesseract my_file.pdf my_output.txt

Is there some other command or flag to use for PDFs?

Best,
Nikhil

On Fri, Apr 19, 2024 at 5:39 PM Patrick Smyth <patrick@iotaschool.com> wrote:
I use OCR pretty extensively. On Linux, I've written a short script that lets me draw boxes on the screen, and the contents get screenshotted and sent to tessaract, then to espeak. I can also do the current window, etc, and I write a little script to constantly scan and read the screen for trying to play some games, though I don't use it that often.

On Windows, I combined a utility called Capture2Text with a tool that scans the clipboard, TextAloud, to do something similar. I just set up TextAloud to read aloud any clipboard updates, and set Capture2Text to dump the results of OCR to the clipboard, and the tools work very well together. I got through a number of games such as Disco Elysium with this technique. I do have some limited low vision, though, that lets me draw boxes around blobs on the screen.

I've been working more with neural networks lately and I think they might have some applications for OCR, since they might be able to do the part where the block of text is identified. I've noticed that OCR tends to fail if the box isn't drawn tightly around the text before being sent to tessaract.

In case it's useful to anyone, here's a link to the short script to OCR text on Linux. You need tessaract, espeak, and gnome-screenshot on your path.




_______________________________________________
Data Science mailing list -- datascience@blindcoders.com
To unsubscribe send an email to datascience-leave@blindcoders.com