HTR Models


To get the massive amount of text within the 19th century renovated court records into a computer readable form, we use Transkribus to train HTR models to transcribe automatically the hand written text in the court records. Talking about HTR models can turn into a sea of acronyms pretty quickly, so it would probably be best to start with some key terminology, although some of them have already come up in previous blogs.

HTR, or Handwritten Text Recognition, is the technology used in generating the automated text recognition models in Transkribus. Unlike Optical Character Recognition (OCR), HTR processes the text by line of text and not by character, and it uses artificial neural network technology.

Ground Truth or GT refers to the set of human transcribed pages used in training of the HTR models.

CER and WER are used to measure the accuracy of the HTR model, CER meaning character error rate and WER meaning word error rate. These show the average percentage of characters or words the model has recognized incorrectly.

Now, handwriting is not uniform. Any given person has their own style of handwriting, some people write neatly, and others have super messy handwriting. On top of that, writing styles change over time; this is why reading old texts usually takes at least some amount of training. So, you need to train a specific HTR model for the material you’re working with, although there are efforts to generate more generalized models.

The renovated court records present multiple challenges to HTR models: the time span of almost a century, regional variations, and many different writers. In order to generate a reliable model to a material with a lot of variation in writers and hands, the set of ground truth must be considerably larger than, say, in models used to recognize only one person’s hand.

Transkribus’ user manual recommends starting with a GT of about 100 pages (20 000 words) to create an HTR model for one hand. The current models used for recognizing renovated court records are made with 500 images of GT (about 175 000 words, give or take) from the years 1850 to 1871. We are currently proofreading transcriptions for 500 images more to get even better generalized model of 19th century Swedish hand for the court records, especially concerning the first half of the century.

A HTR model's learning curve
Training a HTR model with Transkribus is a fairly straightforward process, or at least the part needing human work is. You fill in the basic information of the model (name, language, and description), you split your GT into a training set and testing set, and set the number of training epochs, or the times the neural network evaluates the training data (and learning rate, noise, and train size per epoch, if you’re working with the older CTIlab HTR engine). The test set portion of the GT is not used in the training itself, but it is used as a reference point to calculate the CER of the model during the training. After that, it’s all up to the computer. The training of the model can take anywhere between hours and days, depending on the amount of training material and training epochs.

We have models made with the whole set of 500 pages as well as tests with smaller sets with GT only from a certain decade. With the new HTR+ engine the models’ CER range from 5% to 3%. When we have run the models on pages from the renovated court records not used in the models’ GT, the CER for more general models vary from 3% to 8%, depending on the text they are run on. The models for certain decades have a CER of 6% to 9% on texts from their corresponding decades.

It is a bit hard to convey how well a model recognizes text using only error rates. I don’t think too many people can say from the top of their head how legible a text is if I tell them that the text has, say, approximately seven incorrect characters for every hundred characters. So as a more tangible demonstration, from left to right, the images below show human transcribed text, the same text recognized with one of the latest models for court records with a 3.5% CER, and the text recognized with an older model with a 9.1% CER.
Sample of HTR models run on a page from the court records from the rural district of Åland (click to view larger image)

This example is, of course, a very, very well performing and a fairly well performing model run on a page with extremely legible handwriting (the CERs calculated for this page are 3.3% and 12.9%, respectively). As a rule of thumb, if the page hard for a human to read it properly (be it because of the writing style or problems with the image itself), it is also going to be hard to decipher for the neural network, and vice versa.

It remains to be seen (hopefully not for too long, though) how the new set of GT affects the performance of our recognition models.


Kaisa Luhta

Comments