HTR Models - Update

Since our last post about our HTR models, we have continued our efforts to make the HTR models for the district court records better and more accurate. We have transcribed over 2000 more images of GT, broadened the GT to also include images from the main records, and have trained nine models that are more experimental and one final model.

Our final model is trained with approximately 2700 images, or 1 226 202 words, of training data. To put that word count into perspective, that word count equals the word count of the entire Harry Potter book series, and there would be enough words over to write The Philosopher’s Stone and one fourth of The Chamber of Secrets again.

Using Transkribus’ compare samples tool, which calculates the likely interval of the model’s CER in the collection the sample is of, we calculated the final model’s CER on the district court records falls between 5.1% and 8.4% (compared to the 9.3%-13.8% range of the model talked about in the previous blog post). The final model improved results the most on documents form the first half of the century, where some documents are written with a hand more typical to 18th century. The documents with the 18th century hand have been a big problem to our previous models with little or no training data for the older hand.

Sample of 18th century style hand recognized with the model from the previous blog post on the left, and with the final model on the right.


The final model will be used to process the remaining images of digitized district court records from the 19th century, and the model will hopefully result in more consistent recognition on documents across the century.

Kaisa Luhta

Comments