I'm looking for an OCR solution for about 200 pages of text.
It's handwritten German script from about 100 years old and I can barely read the handwriting myself.
Google Translate sometimes manages to OCR certain parts, but nothing useful (I don't need the translation part of GT).
Which solutions out there would be able to recognize old handwritten script?
You can train Tesseract to recognize Handwriting[1], but the first and most important step would be the preprocessing of your documents. I would recommend to start with a local adaptive thresholding algorithm[2] like Sauvola for binarization. The preprocessing steps would be[3]
Probably you are facing "Sütterlin"[4], which differs quite a bit from modern german handwriting.
In your case (only 200 pages) it might be easier to use template matching[5] to identify similar characters and just "transliterate" matches into modern printed letters (like an overlay over the original text). This way you would have a quick solution while still being accurate enough to just read it.
> Probably you are facing "Sütterlin"[4], which differs quite a bit from modern german handwriting.
My mother in law (and her mother) wrote everything in Sütterlin and I found it rapidly became pretty easy to read, though that also could be because I only encountered it from a small number of hands.
OTOH I find older Handschrift and Fraktur print (we have a bunch of old books in that) basically illegible.
The letters from [4] remind me of modern Russian cursive which has some similarly interesting changes to some letters to make them faster to write. I wonder if there's any research on Russian cursive OCR that could help
You probably want to put it in front of an actual person and get them to transcribe it for you. I don't think there's any off the shelf OCR that will work particularly well for it.
I have a close family member who is a historian and frequently read and transcribed mid 19th to early 20th century German handwriting for his work.
Many historians and archivists in Germany would have the ability to transcribe this for you if you reached out to them and paid for their time.
Look to the US not Germany. Many immigrant communities stayed with the old German and stript long after Germany had moved on. My dad was in high school when his church switched to English as the last German only menber died.
don't assume this is German just because everyone calls it German. There are/were seneral 'low German' dialects. my grandpa never could understand natives when he toured Germany because all he knew was 100 year old German. (based on stories since he died I suspect he would have had little problem understanding Dutch)
Archivists in Germany will know how to read various forms of Kurrent.
And yeah, of course it might not be Hochdeutsch. But anyone who can read the script will know pretty quickly if it's Platt, Swiss German, or something else entirely.
This is probably the best option. I can't find it now, but in 2020 on either Slashdot.org or here, there was a project trying to transcribe hundreds or thousands of old British rainfall records.
The researchers made digital scans and posted the images online and had random users around the world transcribe them. They didn't care if a user did one or hundreds. I did about 20-50 before they were finished. What would have taken a paid team years was completed in only a week.
Does anyone know a link to the article that announced it?
I've had surprisingly good results with https://readcoop.eu/transkribus/ I was going back in time with a family research until I couldn't identify a single word anymore. The 'AI' could.
I threw some German medical handwriting images into ChatGPT a while back and asked it to transcribe it and it worked pretty well. ChatGPT knows a lot about language so that helped in filling in the gaps.
As others have mentioned, Transkribus works pretty well for handwritten text recognition. You can also train your own model if you have enough source material.
If the documents you have are able to be made public, you could upload them to Wikimedia Commons and use https://ocr.wmcloud.org/ — you can use Transkribus via that. (Disclosure: I'm an engineer working on the Wikimedia OCR project.)
I don't know if it will be significantly different than what Google Translate does, but I would give the major cloud vendors (Google, Amazon, Microsoft and I guess OpenAI/ChatGPT) OCR services a shot. It's pretty simple and cheap to do (like, about a dollar for the whole thing). Last time I compared them, Google's OCR came out ahead, but it's task-dependent so in your case it might be different.
General purpose open-source OCR solutions like Tesseract, TrOCR, etc will probably not be as good as the cloud ones, based on my experience.
There's some specialized research work out there for antique manuscripts, but that will require some digging on your part with an uncertain outcome. I think at that point, I would also look into manual transcription - for 200 pages, it might be reasonably affordable.
A social option is to look around in ancestry study circles on facebook for your country. I know we have one or two pretty good groups in Swedish with mostly old folks helping younger decipher old handwriting.
I've been scripting GPT-4 Vision to extract structured recipe data from handwritten recipe cards, with very good success. Can't speak to the German language aspect.
I would try this but the downside is GPT-4 vision currently doesn’t like to extract large text blocks. You could try extracting line bounding boxes with PyMupdf and feeding it individual lines.
There are many many Germans alive who learned Sütterlin and similar scripts in school and used them for decades. It‘s not exactly Linear B.
Even I (below 45) read some small Sütterlin texts in school (mostly German or history books). Not fluent, but you can quite quickly get used to it and decipher things slowly.
100-year-old handwriting is probably not that obscure - it's not cuneiform or hieroglyphics. There are probably lots of people, especially older people, who could transcribe it. Finding them would be the issue.
Low hanging fruit when reading these old German scripts is to get used to distinguish the different forms of the letter s. That alone will get you far. Same for OCR, it needs to be capable of that. Otherwise the result will read as if someone without front teeth has written how they speak.
In your case (only 200 pages) it might be easier to use template matching[5] to identify similar characters and just "transliterate" matches into modern printed letters (like an overlay over the original text). This way you would have a quick solution while still being accurate enough to just read it.
[1]: https://tesseract-ocr.github.io/tessdoc/#training-for-tesser...
[2]: https://brandonmpetty.github.io/Doxa/WebAssembly/
[3]: https://towardsdatascience.com/pre-processing-in-ocr-fc231c6...
[4]: https://de.wikipedia.org/wiki/S%C3%BCtterlinschrift
[5]: https://docs.opencv.org/3.4/d4/dc6/tutorial_py_template_matc...