Pdf ocr linux command line

2/10/2024

Original PDF on the left OCR PDF on the right. With a quick command, I ran it through the “ocrmypdf” program and got out a nearly identical PDF that was smaller (just 9 mb) and allowed me to select the text (image on the right below). My initial PDF (on the left below) was 14 mb in size and looked fine, but I couldn’t select the text – it was just an image. Navigate to the directory where you have your PDF you want to have recognized then type in the following: $ ocrmypdf input.pdf output.pdf

Once it was installed, I gave it a whirl. The first option was a command line program called “ocrmypdf.” That sounds like a dream! I quickly installed it on my Kubuntu machine: $ sudo apt install ocrmypdfĪ number of additional packages were installed as well. There, I found two new options for OCR on Linux. A quick Google search landed me on Stack Exchange (where I seem to spend a lot of time these days). I got the updates started, then realized that I hadn’t checked to see if any progress had been made on OCR on Linux for quite a while (probably a couple of years). However, my virtual machine was giving me some issues and required me to install some updates that were going to take a while (’cause, Windows!). I then converted the TIF files from Scan Tailor into PDF files, put them in the correct order, and was ready to OCR them in the software I used in Windows. The scan looked good (especially after I used Scan Tailor’s Dewarping feature to flatten the pages). I scanned a chapter I wrote in a book recently. But, I think I can safely move past that thanks to recent advances in OCR on Linux. Up until now, I have kept a software package on a Windows virtual machine (in Virtualbox) specifically to OCR PDFs on the rare occasion when I need to do that. However, the occasional need arises when I either have to scan something myself or I receive a document that does not have selectable text and is just an image. Most of them were digital documents to begin with and the text is readily selectable. One of the few tasks I have not been able to do on Linux since I switched over from Windows more than a decade ago is optical character recognition (OCR) of PDF documents.

0 Comments

Pdf ocr linux command line

Leave a Reply.

Author

Archives

Categories