Training Tesseract OCR for Tamil

Tesseract is the great open source OCR tool. It works perfectly with English language. It also support many other Languages in the world. It needs training to understand the new language. Recently training has done for Tamil and training data added to the repository. But it is not working properly with all type of fonts.
So we are exploring Tesseract training methods and trying to easy the training task for Tamil as well as for other languages.

Here I am trying to recognize the following image content by using Tessearct 3.0 with default Tamil trained data given on following URL.
https://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.02.tam.tar.gz&can=2&q=

OCR_Input

file name is tam.Monospace.exp0.tiff

$ tesseract  tam.Monospace.exp0.tiff   output.txt   -l   tam

output.txt :

அறம்கிசயவிரும்பு
ஆறுவது சினம்
இயல்வது கர்சிவல்
ஈவது விலக்கேல்
உலடயது விளம்மபல்

As we see the characters are recognized with some errors. The default trained data is created for some specific font. It may not work perfectly for all the characters. Because it depends on many criterias,
1.) font characters
2.) clarity
3.) space between font and lines

So here we will train tesseract for the input image we have selected for recognization,

Step 1.) Keep sample training image – tam.Monospace.exp0.tiff

Step 2.) Create box file

$ tesseract  -l  tam  tam.Monospace.exp0.tiff  tam.Monospace.exp0   batch.nochop   makebox
Tesseract Open Source OCR Engine v3.02 with Leptonica
Page 0

The above command produces file tam.Monospace.exp0.box file. See few lines of this file,

அ 21 185 41 196 0
ற 49 182 60 196 0
ம் 68 187 82 201 0
6 88 183 103 203 0
ச 103 187 113 196 0
ய 120 187 134 196 0
வி 142 187 161 202 0
ரு 169 181 185 196 0
ம் 192 187 206 201 0
பு 213 182 225 196 0

This is automatically created box file. It would have errors with it. We should correct it manually.

It may have 2 types of error.
i. Characters might be selected with wrong coordinates. This happens mostly on image with less quality text.
Also if the characters touch each other it may assume 2 characters as 1 character.

ii.) Characters may be selected with right Coordinates but mapped with wrong letters.

To check the coordinates of the character use the following online tool,

http://pp19dd.com/tesseract-ocr-chopper/

Here we can upload the image and select each character to get the coordinates of the selected character.
Note:- This is online tool. It has some limitation for our usage. By default it uses the tesseract in the server side with english trained data to produce the boxes. So we can see some stupid recognition for out Tamil image input. We dont need to look all this. All we need to do is select the right coordinates for the characters. ( To avoid all this hessels we are working on creating editor to easy all the training steps).
Then map coordinates with right character.

Correct the box file – tam.Monospace.exp0.box
After correction it looks like follows:

அ 21 185 41 196 0
ற 49 182 60 196 0
ம் 68 187 82 201 0

 

Step 3.) Extract unicharset – This process the existing input characters with UTF-8 font characters,
$ unicharset_extractor tam.Monospace.exp0.box
Extracting unicharset from tam.Monospace.exp0.box
Wrote unicharset file ./unicharset.

Above command produces unicharset file,

30
NULL 0 NULL 0
அ 1 0,255,0,255,0,32767,0,32767,0,32767 NULL 1 0 0 #    # அ [b85 ]x
ற 1 0,255,0,255,0,32767,0,32767,0,32767 NULL 2 0 0 #    # ற [bb1 ]x
ம் 1 0,255,0,255,0,32767,0,32767,0,32767 NULL 3 0 0 #    # ம் [bae bcd ]x

Step 4.) Trigger the training process
$ tesseract tam.Monospace.exp0.tiff tam.Monospace.exp0 box.train
Tesseract Open Source OCR Engine v3.02 with Leptonica
Page 0
APPLY_BOXES:
   Boxes read from boxfile:       6
   Found 6 good blobs.
   Leaving 4 unlabelled blobs in 0 words.
TRAINING … Font name = Monospace
Generated training data for 2 words

Step 5.)
$ shapeclustering -F font_properties -U unicharset tam.Monospace.exp0.tr
Reading tam.Monospace.exp0.tr …
Building master shape table
Computing shape distances…
Stopped with 0 merged, min dist 999.000000
Computing shape distances… 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances… 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances… 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances… 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances… 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances… 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances…
Stopped with 0 merged, min dist 999.000000
Computing shape distances…
Stopped with 0 merged, min dist 999.000000
Computing shape distances… 0 1 2 3 4 5
Stopped with 0 merged, min dist 0.308725
Master shape_table:Number of shapes = 6 max unichars = 1 number with multiple unichars = 0

Step 6.)
$ mftraining -F font_properties -U unicharset -O tam.unicharset tam.Monospace.exp0.tr
Read shape table shapetable of 6 shapes
Reading tam.Monospace.exp0.tr …
Done!

Step 7.)
$ cntraining tam.Monospace.exp0.tr
Reading tam.Monospace.exp0.tr …
Clustering …
Writing normproto …

Step 8.)Change the names as mentioned like follow,
$ mv shapetable tam.shapetable
$ mv normproto tam.normproto
$ mv inttemp tam.inttemp
$ mv pffmtable tam.pffmtable

Step9.) Following command combines all and produces final trained data.

$ combine_tessdata tam.

(Note: language name should end with . (dot) )

Combining tessdata files
TessdataManager combined tesseract data files.
Offset for type 0 is -1
Offset for type 1 is 140
Offset for type 2 is -1
Offset for type 3 is 540
Offset for type 4 is 125521
Offset for type 5 is 125576
Offset for type 6 is -1
Offset for type 7 is -1
Offset for type 8 is -1
Offset for type 9 is -1
Offset for type 10 is -1
Offset for type 11 is -1
Offset for type 12 is -1
Offset for type 13 is 126478
Offset for type 14 is -1
Offset for type 15 is -1
Offset for type 16 is -1

Step 10.) Copy this tam.traineddata to the following directory
/usr/share/tesseract-ocr/tessdata

Note :-Keep the back up of existing tam.traineddata

Step 11.)Now try the recognition again with new trained data,
$ tesseract tam.Monospace.exp0.tiff output2.txt -l tam
Tesseract Open Source OCR Engine v3.02 with Leptonica
Page 0

Step 12.) less output2.txt

Now we can see proper recognition.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s