Nhập mã khuyến mãi ONETCOMVN được giảm 10%

Tài Khoản

Install Tesseract OCR on Linux

28/12/2020

đang xem

Chưa phân loại

Tesseract: A free OCR solution

Introduction

Tessereact is considered one of the best OCR solutions available. Since 2006 it is sponsored by Google, previously it was developed by Hewlett Packard in C and C++ between 1985 and 1998. The system is capable to identify even handwriting, it can learn increasing it’s accuracy, and is among the most developed and complete in the market.

It easily beats commercial competitors like ABBY, if you are looking for a serious solution for OCR, Tesseract is the most accurate one, but don’t expect for massive solutions: it uses a core per process, which means a 8 core processor (hyperthreading accepted) will be able to process 8 or 16 images simultaneously.

When I used Tesseract we managed thousands of potential customers uploading handwritten content, images with text, etc. We used 48 core servers, with DatabaseByDesign and then with AWS, we never had a resources problem.

We had an uploader which discriminated between text files like Microsoft Office or Open Office files and images or scanned documents. The uploader determined whatever the OCR or PHP scripts would process an order, in the field of text recognition.

Tesseact is a great solution, but before thinking about it you must know, last Tesseract’s versions brought big improvements, some of them mean hard work. While training could last for hours or days, recent Tesserct’s versions training may be of days, weeks, or even months if you are looking for a multilingual OCR solution.

Installing Tesseract 4 on Debian / Ubuntu:

apt-get install tesseract-ocr

If you are using a different Linux distribution, you’ll need to copy the last github repository version and copy the .traineddata file into ‘tessdata’ (/usr/share/tesseract-ocr/tessdata or /usr/share/tessdata).

By default Tesseract will install the English language pack, to install additional languages run

apt-get install tesseract-ocr-LANG

for example, to add Hebrew:

apt-get install tesseract-ocr-heb

You can include all languages by running:

apt-get install tesseract-ocr-all

In order for Tesseract to work properly, we will need to use the command “convert” (convert between image formats as well as resize an image, blur, crop, despeckle, dither, draw on, flip, join, re-sample, and much more) provided by Imagemagick:

Lets install imagemagick with apt-get:

apt-get install imagemagick

Now let’s test Tesseract, find an image containing text and run:

tesseract [image_name] [output file_name]

If installed properly, Tesseract will extract the text from the image.

When I worked with Tesseract, all we needed was to word count documents. Like with any other program you can, and must, train it, in Word we can define some symbols which can be counted or not, if to count or not numbers, etc. the same with Tesseract.

We can also train it’s sensibility to specific images.

Tesseract Optimization:

Size Optimization: According to official sources, the optimal pixels size for an image to be processed successfully by Tesseract is 300DPI. We’ll need to process any image using the -r parameter to enforce this DPI. Increasing the DPI will also increase the processing time.

Page rotation: If when scanned the page wasn’t properly rotated and stays 180° or 45°, Tesseract’s accuracy will decrease, you can use this Python script to automatically detect and fix rotation issues.

Border Removal: According to Tesseract’s official man, borders can erroneously be picked as characters, especially dark borders and where there is gradation variety. Removing borders may be a good step to achieve the maximal accuracy with Tesseract.

Removing Noise: According to Tesseracts, noise “is random variation of brightness or colour in an image”. We can remove it in the binarization step, which means polarizing it’s colors.

Training Tesseract:

While most of tutorials cover only Tesseract’s installation, I will summarize how to train your OCR system, here we can find a tutorial for all versions. In this article I’ll summarize how to train Tesseract 4 which includes a new “neural network-based recognition engine that delivers significantly higher accuracy (on document images) than the previous versions, in return for a significant increase in required compute power. On complex languages however, it may actually be faster than base Tesseract.”

Before continuing we will need to install additional libraries:

sudo apt-get install libicu-dev
sudo apt-get install libpango1.0-dev
sudo apt-get install libcairo2-dev

And we will install the training tools by running, within the Tesseract directory:

make
make training
sudo make training-install

According to Tesseract’s official wiki, we have 3 current options to train our OCR system:

“Fine tune. Starting with an existing trained language, train on your specific additional data. This may work for problems that are close to the existing training data, but different in some subtle way, like a particularly unusual font. May work with even a small amount of training data.
Cut off the top layer (or some arbitrary number of layers) from the network and retrain a new top layer using the new data. If fine tuning doesn’t work, this is most likely the next best option. Cutting off the top layer could still work for training a completely new language or script, if you start with the most similar looking script.
Retrain from scratch. This is a daunting task, unless you have a very representative and sufficiently large training set for your problem. If not, you are likely to end up with an over-fitted network that does really well on the training data, but not on the actual data.

While the above options may sound different, the training steps are actually almost identical, apart from the command line, so it is relatively easy to try it all ways, given the time or hardware to run them in parallel.”

In this tutorial, we will only run the tesstrain.sh script which will call necessary programs to train a specific language.

First of all lets clone all the files within our /usr/share/tesseract-ocr:

git clone https://github.com/tesseract-ocr/tesseract

Go to /usr/share/tesseract-ocr/tesseract/training (Tesseract’s default installation directory) and run:

  $ ./tesstrain.sh --lang heb --langdata_dir /usr/share/tesseract-ocr/langdata --tessdata_dir /usr/share/tesseract-ocr/tessdata

Change “heb” for the language you want to train, and also edit the path to your data.

Within the directory /usr/share/tesseract-ocr/tesseract/training you will find the file language-specific.sh useful to add rules for specific languages.

Troubleshooting

Tesseract is to me the best OCR solution, but recently it made huge changes from the past versions and many users are complaining about changes or things which are no longer working, I wouldn’t worry since the changes seem to give great results. Tesseract’s community is very active, in case you find problems running tesseract, become part of Tesseract’s community here.

ONET IDC

ONET IDC thành lập vào năm 2012, là công ty chuyên nghiệp tại Việt Nam trong lĩnh vực cung cấp dịch vụ Hosting, VPS, máy chủ vật lý, dịch vụ Firewall Anti DDoS, SSL… Với 10 năm xây dựng và phát triển, ứng dụng nhiều công nghệ hiện đại, ONET IDC đã giúp hàng ngàn khách hàng tin tưởng lựa chọn, mang lại sự ổn định tuyệt đối cho website của khách hàng để thúc đẩy việc kinh doanh đạt được hiệu quả và thành công.

Chia sẻ

Bài Viết Mới

Điều khoản dịch vụ” (Terms of Service)

Hướng dẫn fake ip bằng phần mềm SStap

VPS treo game là gì? Thuê VPS treo game giá rẻ, không lo giật lag

BitBrowser – Best Anti-Detect Browser!

Dịch Vụ Xây Dựng Hệ Thống Peering Với Internet Exchange (IXP)

Dịch Vụ Triển Khai VPN Site-to-Site & Remote Access

Dịch Vụ Thiết Lập Hệ Thống Tường Lửa (Firewall)

Dịch Vụ Triển Khai Hệ Thống Ảo Hóa & Cloud

Dịch Vụ Triển Khai Hệ Thống Ceph

Dịch Vụ Triển Khai Hệ Thống BGP Multi-Peer Cho ISP

Bài Viết

Bài Viết Mới Cập Nhật

Điều khoản dịch vụ” (Terms of Service)

ĐIỀU KHOẢN DỊCH VỤ (Áp dụng cho dịch vụ VPS, Hosting, Proxy) Cập nhật lần cuối: [ngày/tháng/năm] 1. Giới thiệu Khi sử...

05/07/2025

Hướng dẫn fake ip bằng phần mềm SStap

Hướng dẫn Tải và cài đặt Các bạn vào Google gõ từ khóa “Download SStap” hoặc vào sẵn link https://sourceforge.net/projects/sstap/files/latest/download Sau...

10/06/2025

VPS treo game là gì? Thuê VPS treo game giá rẻ, không lo giật lag

Bạn đam mê những tựa game online và muốn cày cuốc không ngừng nghỉ, nhưng chiếc máy tính cá nhân lại không đủ “trâu”...

02/06/2025

BitBrowser – Best Anti-Detect Browser!

Good anti association effect, complete browser fingerprint modification, affordable price! Please recommend it to friends around you! BitBrowser – anti detect browser, Dorang Account Defense Association ⚙️ Function: – RPA automation – API script – Extended plug -in – Window synchronization – Support Global Proxy IP Used for: Capital monetization, crypto，E-commerce, Social Media Marketing, Shopping Price Comparison, Price Comparison, Advertising, Alliance Marketing, Agency Operation, Self-testing etc. ♾️10 Profiles for Free ♾️ Free registration link：https://www.bitbrowser.net/vi/?code=5df4f4ec WhatsApp service group : https://chat.whatsapp.com/FCQaHfHbR351GIje98OIA9 Technical service group : https://t.me/bitbrowser000

26/05/2025

Dịch Vụ Xây Dựng Hệ Thống Peering Với Internet Exchange (IXP)

Peering với Internet Exchange (IXP) là giải pháp quan trọng giúp tăng tốc độ kết nối, giảm độ trễ, tối ưu chi phí băng thông...

04/04/2025

Bài Viết Mới

Điều khoản dịch vụ” (Terms of Service)

Hướng dẫn fake ip bằng phần mềm SStap

VPS treo game là gì? Thuê VPS treo game giá rẻ, không lo giật lag

BitBrowser – Best Anti-Detect Browser!

Dịch Vụ Xây Dựng Hệ Thống Peering Với Internet Exchange (IXP)

Dịch Vụ Triển Khai VPN Site-to-Site & Remote Access

Dịch Vụ Thiết Lập Hệ Thống Tường Lửa (Firewall)

Dịch Vụ Triển Khai Hệ Thống Ảo Hóa & Cloud

Dịch Vụ Triển Khai Hệ Thống Ceph

Dịch Vụ Triển Khai Hệ Thống BGP Multi-Peer Cho ISP

Hotline/Zalo

09.016.19.525

Nhận chương trình khuyến mãi từ ONET IDC

72 Lê Thánh Tôn, P.Bến Nghé, Quận 1, TP HCM

1001 S MAIN ST STE 600 KALISPELL, MT 59901

Điện thoại: 09.016.19.525

Email liên hệ:

[email protected]

Install Tesseract OCR on Linux

Tesseract: A free OCR solution

Introduction