Reading pdf text using Elixir
Let’s say you want a use to be able to upload a PDF and then your system needs to understand what the text was in the PDF they uploaded, good news is there’s an Elixir solution for that! It’s OCR or Optical Character Recognition which is used to find printed or handwritten text characters inside of an image.
To quickly test how this works in Elixir we will take three steps:
- Install the binaries for tesseract (an OCR engine).
- Include the tesseract-ocr-elixir lib in your dependencies.
- Test the packages functionality using IEx.
1. Installing Tesseract
If you’re using a Mac you can install tesseract using Homebrew:
brew install tesseract
If not the tesseract website has more options for installation.
2. Add the tesseract-ocr-elixir lib to dependencies
In your application add the library tesseract-ocr-elixir to deps. This is an Elixir wrapper for OCR.
def deps do
[
{:tesseract_ocr, "~> 0.1.5"}
]
end
To install the new dependency run mix deps.get
.
3. Test the library functionality
To do a quick test run your application using iex -S mix
or if you’re using Phoenix iex -S mix phx.server
.
Now you can test the library by the read
function which will print out any words OCR finds:
iex> TesseractOcr.read("test/resources/testocr.pdf")
"test pdf content"