Keyword spotting – an effective search tool

The subject of making full text searches possible in district court records was touched on the blog about Turku book fair some time ago, and now it’s time to take a deeper dive into how exactly the full text searching works.

The key to the search tool is keyword spotting (KWS). Keyword spotting technology uses confidence matrices to find matching images to your search term. The confidence matrices are based on the HTR model run on the text you want to search: The HTR model assigns each letter in the alphabet a value of how certain the model is that that letter corresponds with that part of the image (ie. a section of a word on a row) when you run the model on a page. The transcription the model outputs is essentially the letters and words the model is the most confident match the image.

Unfortunately, the search result the model is most confident of is not always the correct one. This is why machine-made transcriptions are not 100% correct, even with models with low character …

