How to run tesseract on GIF file in linux

29/12/2020
Chưa phân loại
Tesseract is an OCR (Optical Character Recognition) system, among the best ones. OCR software is capable to understand text from images and scanned documents (including handwriting if you train it). An OCR system can be useful for a lot of tasks like word counting scanned documents, automatic transcription, characters conversion from image to text and more.

LinuxHint already published a tutorial explaining how to install and understand Tesseract’s training.

This tutorial shows Tesseract’s installation process in Debian/Ubuntu systems but won’t extended on training functionalities, if you aren’t familiarized with this software reading the mentioned article may be a good introduction.  Then we will show you how to process a GIF image with Tesseract to get the text out of it.

Tesseract installation:

Run:

apt install tesseract-ocr

Now you need to install imagemagick which is an image converter.

Once installed we can already test Tesseract, to test it I found a gif licensed for reuse.

Now lets see what happens when we run tesseract on the gif image:

tesseract 2002NY40.gif 1result

Now do a “less” on 1result.txt

less 1result.txt

Here is the image with it’s text:

In this Tesseract ́s default settings are pretty accurate, usually to get such accuracy it requires training. Let’s try another free image I found on Wiki Commons, after downloading it run:

tesseract Actualizar_GNULinux_Terminal_apt-get.gif 2result

Now check the file’s content.

less 2result.txt


That’s was the result while the original image’s content was:

In order to improve the character recognition we have many options and steps to follow which were detailed in our previous tutorial: border removal, noise removal, size optimization and page rotation among other functions like crop.

For this tutorial we’ll use textcleaner, a script developed by Fred’s ImageMagick Scripts.

Download the script and run:

./textcleaner -g -e stretch -f 25 -o 10 -s 1
 Actualizar_GNULinux_Terminal_apt-get.gif test.gif

Note: before running the script give it execution permissions by running “chmod +x textcleaner” as root or with sudo prefix.

Where:

textcleaner: calls the program

-g: Convert the image to grayscale

-e: enache

-f: filtersize

-s: sharpamt,amount of pixel sharpening to be applied to the result.

For information and examples of use with textcleaner visit http://www.fmwconcepts.com/imagemagick/textcleaner/index.php

As you see textcleaner changed the background color, increasing the contrast between the font and background.

If we run tesseract probably the result will be different:

tesseract test.gif testoutput

less testoutput

As you see the result really improved even when it isn’t fully accurate.

The command convert provided by imagemagick allows us to extract frames from gif images to be processed later by Tesseract, this is useful if there is extraible content in different frames of the gif image.

The syntax is simple:

convert <image.gif> <output.jpg>

The result will be generated as number of files as frames in the gif, in the provided example the results would be: output-0.jpg, output-1.jpg, output-2.jpg, etc.

Then you can process them with tesseract, instructing it to process all files with a wildcard saving the result in a single file by running:

for i in output-* ; do tesseract $i outputresult;  done;

Imagemagick has a huge variety of options to optimize images and there is not a generic mode, for each kind of scenario you should read convert’s command man page.

I hope you found this tutorial on Tesseract resulted useful.

ONET IDC thành lập vào năm 2012, là công ty chuyên nghiệp tại Việt Nam trong lĩnh vực cung cấp dịch vụ Hosting, VPS, máy chủ vật lý, dịch vụ Firewall Anti DDoS, SSL… Với 10 năm xây dựng và phát triển, ứng dụng nhiều công nghệ hiện đại, ONET IDC đã giúp hàng ngàn khách hàng tin tưởng lựa chọn, mang lại sự ổn định tuyệt đối cho website của khách hàng để thúc đẩy việc kinh doanh đạt được hiệu quả và thành công.
Bài viết liên quan

How to Install Java 9 on Ubuntu

Java is not included in Ubuntu by default. There are many application that do not work without Java. So, you require to...
28/12/2020

Installing NetBeans IDE on Debian 10

NetBeans is an open source IDE for Java, JavaScript, PHP and Web development. It has a nice user interface and it’s easy...
29/12/2020

Install Clear Linux (Intel Linux OS) on a Virtual Machine using VirtualBox

The Clear Linux Project for Intel Architecture is building a Linux OS distribution targeted for various cloud use cases. The...
28/12/2020
Bài Viết

Bài Viết Mới Cập Nhật

Dịch Vụ Xây Dựng Hệ Thống Peering Với Internet Exchange (IXP)
04/04/2025

Dịch Vụ Triển Khai VPN Site-to-Site & Remote Access
04/04/2025

Dịch Vụ Thiết Lập Hệ Thống Tường Lửa (Firewall)
04/04/2025

Dịch Vụ Triển Khai Hệ Thống Ảo Hóa & Cloud
04/04/2025

Dịch Vụ Triển Khai Hệ Thống Ceph
04/04/2025