Install Tesseract OCR on Linux

28/12/2020
Chưa phân loại

Tesseract: A free OCR solution

Introduction

Tessereact is considered one of the best OCR solutions available. Since 2006 it is sponsored by Google, previously it was developed by Hewlett Packard in C and C++ between 1985 and 1998.   The system is capable to identify even handwriting, it can learn increasing it’s accuracy, and is among the most developed and complete in the market.

It easily beats commercial competitors like ABBY, if you are looking for a serious solution for OCR, Tesseract is the most accurate one, but don’t expect for massive solutions: it uses a core per process, which means a 8 core processor (hyperthreading accepted) will be able to process 8 or 16 images simultaneously.

When I used Tesseract we managed thousands of potential customers uploading handwritten content, images with text, etc. We used 48 core servers, with DatabaseByDesign and then with AWS, we never had a resources problem.

We had an uploader which discriminated between text files like Microsoft Office or Open Office files and images or scanned documents. The uploader determined whatever the OCR or PHP scripts would process an order, in the field of text recognition.

Tesseact is a great solution, but before thinking about it you must know, last Tesseract’s versions brought big improvements, some of them mean hard work. While training could last for hours or days, recent Tesserct’s versions training may be of days, weeks, or even months if you are looking for a multilingual OCR solution.


Installing Tesseract 4 on Debian / Ubuntu:

apt-get install tesseract-ocr

If you are using a different Linux distribution, you’ll need to copy the last github repository version and copy the .traineddata file into ‘tessdata’ (/usr/share/tesseract-ocr/tessdata or /usr/share/tessdata).

By default Tesseract will install the English language pack, to install additional languages run

apt-get install tesseract-ocr-LANG

for example, to add Hebrew:

apt-get install tesseract-ocr-heb

You can include all languages by running:

apt-get install tesseract-ocr-all

In order for Tesseract to work properly, we will need to use the command “convert”  (convert between image formats as well as resize an image, blur, crop, despeckle, dither, draw on, flip, join, re-sample, and much more) provided by Imagemagick:

Lets install imagemagick with apt-get:

apt-get install imagemagick

Now let’s test Tesseract, find an image containing text and run:

tesseract [image_name] [output file_name]

If installed properly, Tesseract will extract the text from the image.

When I worked with Tesseract, all we needed was to word count documents. Like with any other program you can, and must, train it, in Word we can define some symbols which can be counted or not, if to count or not numbers, etc. the same with Tesseract.

We can also train it’s sensibility to specific images.


Tesseract Optimization:

Size Optimization: According to official sources, the optimal pixels size for an image to be processed successfully by Tesseract is 300DPI. We’ll need to process any image using the -r parameter to enforce this DPI. Increasing the DPI will also increase the processing time.

Page rotation: If when scanned the page wasn’t properly rotated and stays 180° or 45°, Tesseract’s accuracy will decrease, you can use this Python script to automatically detect and fix rotation issues.

Border Removal: According to Tesseract’s official man, borders can erroneously be picked as characters, especially dark borders and where there is gradation variety. Removing borders may be a good step to achieve the maximal accuracy with Tesseract.

Removing Noise: According to Tesseracts, noise “is random variation of brightness or colour in an image”. We can remove it in the binarization step, which means polarizing it’s colors.


Training Tesseract:

While most of tutorials cover only Tesseract’s installation, I will summarize how to train your OCR system, here we can find a tutorial for all versions. In this article I’ll summarize how to train Tesseract 4 which includes a new “neural network-based recognition engine that delivers significantly higher accuracy (on document images) than the previous versions, in return for a significant increase in required compute power. On complex languages however, it may actually be faster than base Tesseract.”

Before continuing we will need to install additional libraries:

sudo apt-get install libicu-dev
sudo apt-get install libpango1.0-dev
sudo apt-get install libcairo2-dev

And we will install the training tools by running, within the Tesseract directory:

make
make training
sudo make training-install

According to Tesseract’s official wiki, we have 3 current options to train our OCR system:

  • “Fine tune. Starting with an existing trained language, train on your specific additional data. This may work for problems that are close to the existing training data, but different in some subtle way, like a particularly unusual font. May work with even a small amount of training data.
  • Cut off the top layer (or some arbitrary number of layers) from the network and retrain a new top layer using the new data. If fine tuning doesn’t work, this is most likely the next best option. Cutting off the top layer could still work for training a completely new language or script, if you start with the most similar looking script.
  • Retrain from scratch. This is a daunting task, unless you have a very representative and sufficiently large training set for your problem. If not, you are likely to end up with an over-fitted network that does really well on the training data, but not on the actual data.

While the above options may sound different, the training steps are actually almost identical, apart from the command line, so it is relatively easy to try it all ways, given the time or hardware to run them in parallel.”

In this tutorial, we will only run the tesstrain.sh script which will call necessary programs to train a specific language.

First of all lets clone all the files within our /usr/share/tesseract-ocr:

git clone https://github.com/tesseract-ocr/tesseract

Go to /usr/share/tesseract-ocr/tesseract/training (Tesseract’s default installation directory) and run:

  $ ./tesstrain.sh --lang heb --langdata_dir /usr/share/tesseract-ocr/langdata --tessdata_dir /usr/share/tesseract-ocr/tessdata  

Change “heb” for the language you want to train, and also edit the path to your data.

Within the directory /usr/share/tesseract-ocr/tesseract/training you will find the file language-specific.sh useful to add rules for specific languages.


Troubleshooting

Tesseract is to me the best OCR solution, but recently it made huge changes from the past versions and many users are complaining about changes or things which are no longer working, I wouldn’t worry since the changes seem to give great results. Tesseract’s community is very active, in case you find problems running tesseract, become part of Tesseract’s community here.

ONET IDC thành lập vào năm 2012, là công ty chuyên nghiệp tại Việt Nam trong lĩnh vực cung cấp dịch vụ Hosting, VPS, máy chủ vật lý, dịch vụ Firewall Anti DDoS, SSL… Với 10 năm xây dựng và phát triển, ứng dụng nhiều công nghệ hiện đại, ONET IDC đã giúp hàng ngàn khách hàng tin tưởng lựa chọn, mang lại sự ổn định tuyệt đối cho website của khách hàng để thúc đẩy việc kinh doanh đạt được hiệu quả và thành công.
Bài viết liên quan

Installing and securing Plesk

In this tutorial I will explain how to install and secure a Plesk server, Plesk is the easiest way to manage a server professionally...
29/12/2020

How to install PlayOnLinux 4.2.10 on Ubuntu 17.04 Zesty

Before we take a quick look at how to install PlayOnLinux Ubuntu, lets review what PlayOnLinux is about. PlayOnLinux 4.2.10...
28/12/2020

[10 phút ] [Ansible] [Cơ bản] [Phần 4] Viết Playbook trên Ansible

Qua ba bài viết trước, chúng ta đã làm quen cơ bản với Ansible. Trong bài viết này, chúng ta sẽ...
30/12/2020
Bài Viết

Bài Viết Mới Cập Nhật

Mua Proxy V6 Nuôi Facebook Spam Hiệu Quả Tại Onetcomvn
03/06/2024

Hướng dẫn cách sử dụng ProxyDroid để duyệt web ẩn danh
03/06/2024

Mua proxy Onet uy tín tại Onet.com.vn
03/06/2024

Thuê mua IPv4 giá rẻ, tốc độ nhanh, uy tín #1
28/05/2024

Thuê địa chỉ IPv4 IPv6 trọn gói ở đâu chất lượng, giá RẺ nhất?
27/05/2024