Ask HN: OCR for 100 year old (German) handwritten cursive script?

sandreas · on Jan 15, 2024

You can train Tesseract to recognize Handwriting[1], but the first and most important step would be the preprocessing of your documents. I would recommend to start with a local adaptive thresholding algorithm[2] like Sauvola for binarization. The preprocessing steps would be[3]

  1) Binarization
  2) Skew Correction
  3) Noise Removal
  4) Thinning and Skeletonization

Probably you are facing "Sütterlin"[4], which differs quite a bit from modern german handwriting.

In your case (only 200 pages) it might be easier to use template matching[5] to identify similar characters and just "transliterate" matches into modern printed letters (like an overlay over the original text). This way you would have a quick solution while still being accurate enough to just read it.

[1]: https://tesseract-ocr.github.io/tessdoc/#training-for-tesser...

[2]: https://brandonmpetty.github.io/Doxa/WebAssembly/

[3]: https://towardsdatascience.com/pre-processing-in-ocr-fc231c6...

[4]: https://de.wikipedia.org/wiki/S%C3%BCtterlinschrift

[5]: https://docs.opencv.org/3.4/d4/dc6/tutorial_py_template_matc...

nadermx · on Jan 15, 2024

This is great, I also did find this https://github.com/IgorMeloS/OCR/blob/main/7%20-%20template-... which is part of this https://github.com/IgorMeloS/OCR could be useful for this as well.

gumby · on Jan 15, 2024

> Probably you are facing "Sütterlin"[4], which differs quite a bit from modern german handwriting.

My mother in law (and her mother) wrote everything in Sütterlin and I found it rapidly became pretty easy to read, though that also could be because I only encountered it from a small number of hands.

OTOH I find older Handschrift and Fraktur print (we have a bunch of old books in that) basically illegible.

throwup238 · on Jan 15, 2024

The letters from [4] remind me of modern Russian cursive which has some similarly interesting changes to some letters to make them faster to write. I wonder if there's any research on Russian cursive OCR that could help

sneed_chucker · on Jan 15, 2024

You probably want to put it in front of an actual person and get them to transcribe it for you. I don't think there's any off the shelf OCR that will work particularly well for it.

I have a close family member who is a historian and frequently read and transcribed mid 19th to early 20th century German handwriting for his work.

Many historians and archivists in Germany would have the ability to transcribe this for you if you reached out to them and paid for their time.

bluGill · on Jan 15, 2024

Look to the US not Germany. Many immigrant communities stayed with the old German and stript long after Germany had moved on. My dad was in high school when his church switched to English as the last German only menber died.

don't assume this is German just because everyone calls it German. There are/were seneral 'low German' dialects. my grandpa never could understand natives when he toured Germany because all he knew was 100 year old German. (based on stories since he died I suspect he would have had little problem understanding Dutch)

sneed_chucker · on Jan 15, 2024

Archivists in Germany will know how to read various forms of Kurrent.

And yeah, of course it might not be Hochdeutsch. But anyone who can read the script will know pretty quickly if it's Platt, Swiss German, or something else entirely.

fhdkweig · on Jan 16, 2024

This is probably the best option. I can't find it now, but in 2020 on either Slashdot.org or here, there was a project trying to transcribe hundreds or thousands of old British rainfall records.

The researchers made digital scans and posted the images online and had random users around the world transcribe them. They didn't care if a user did one or hundreds. I did about 20-50 before they were finished. What would have taken a paid team years was completed in only a week.

Does anyone know a link to the article that announced it?

abdullahkhalids · on Jan 15, 2024

OP might also find graduate students in history or literature departments who are willing to work for cheaper than a professional.

herbst · on Jan 15, 2024

I've had surprisingly good results with https://readcoop.eu/transkribus/ I was going back in time with a family research until I couldn't identify a single word anymore. The 'AI' could.

sympeux · on Jan 15, 2024

My colleagues are mostly using transkribus for handwriting. I work at a library.

ebbes · on Jan 15, 2024

That‘s exactly what https://transkribus.ai/ was built for - works quite well in my experience, mainly transcribing Deutsche Kurrentschrift, c. 1980.

WalterBright · on Jan 15, 2024

I tried it on an incomprehensible German postcard from 1900, and it worked great! Not perfect, but darn good.

huijzer · on Jan 15, 2024

I threw some German medical handwriting images into ChatGPT a while back and asked it to transcribe it and it worked pretty well. ChatGPT knows a lot about language so that helped in filling in the gaps.

interesse · on Jan 15, 2024

Not OCR, but https://www.paul-riebeck-stiftung.de/stiftung/ueber-uns/koop... ?

freosam · on Jan 16, 2024

As others have mentioned, Transkribus works pretty well for handwritten text recognition. You can also train your own model if you have enough source material.

If the documents you have are able to be made public, you could upload them to Wikimedia Commons and use https://ocr.wmcloud.org/ — you can use Transkribus via that. (Disclosure: I'm an engineer working on the Wikimedia OCR project.)

robertknight · on Jan 15, 2024

You could try something like https://aws.amazon.com/textract/ or https://cloud.google.com/vision/docs/handwriting. Both have support for modern handwriting. I don't know if it will work with a script written a century ago though.

082349872349872 · on Jan 15, 2024

if it's https://en.wikipedia.org/wiki/Sütterlin I doubt anything trained on current script would make any more sense of it than we do

rolltrunhert · on Jan 15, 2024

There's models for this, see https://readcoop.eu/model/german-kurrent-and-sutterlin-17th-...

dimatura · on Jan 15, 2024

I don't know if it will be significantly different than what Google Translate does, but I would give the major cloud vendors (Google, Amazon, Microsoft and I guess OpenAI/ChatGPT) OCR services a shot. It's pretty simple and cheap to do (like, about a dollar for the whole thing). Last time I compared them, Google's OCR came out ahead, but it's task-dependent so in your case it might be different.

General purpose open-source OCR solutions like Tesseract, TrOCR, etc will probably not be as good as the cloud ones, based on my experience.

There's some specialized research work out there for antique manuscripts, but that will require some digging on your part with an uncertain outcome. I think at that point, I would also look into manual transcription - for 200 pages, it might be reasonably affordable.

dbish · on Jan 15, 2024

Fwiw, we've found the Azure AI OCR service to be pretty good, much better then anything we could get from Tesseract out of the box (no tuning).

WalterBright · on Jan 15, 2024

OCR and Translation are two entirely different endeavors.

BenoitP · on Jan 15, 2024

I worked in the same space as a company that does this with ML (and charges for it), using some form of Recurrent Neural Network IIRC. Maybe LSTMs?

They had a contract to index historical French archives composed of handwritten latin documents in elasticsearch.

Depending of the historical relevance of your documents (read: some academic funds), they may be able to help. Doesn't hurt to contact them:

https://teklia.com/

josefritz · on Jan 15, 2024

I've paid for manual transcription before. It's not that expensive. Technical solutions are cool, but that option is available today.

rolltrunhert · on Jan 15, 2024

A social option is to look around in ancestry study circles on facebook for your country. I know we have one or two pretty good groups in Swedish with mostly old folks helping younger decipher old handwriting.

AJJB_alt · on Jan 15, 2024

GPT-4 Vision. I have seen some examples of middly agy looking pages tried.

re5i5tor · on Jan 15, 2024

I've been scripting GPT-4 Vision to extract structured recipe data from handwritten recipe cards, with very good success. Can't speak to the German language aspect.

(Edited to clarify more than just transcribing)

serjester · on Jan 15, 2024

I would try this but the downside is GPT-4 vision currently doesn’t like to extract large text blocks. You could try extracting line bounding boxes with PyMupdf and feeding it individual lines.

jbverschoor · on Jan 15, 2024

I’ll look into that

lainga · on Jan 15, 2024

Does it look like Sütterlin? Are you familiar with it?

[] https://en.wikipedia.org/wiki/S%C3%BCtterlin

jbverschoor · on Jan 15, 2024

Not familiar with it, but it doesn’t look like that. I wish haha

ginko · on Jan 15, 2024

Does it look like Kurrentschrift?

https://de.wikipedia.org/wiki/Deutsche_Kurrentschrift

weinzierl · on Jan 15, 2024

100 years ago Sütterlin would be pretty likely. If your sample is not Sütterlin I would consider the possibility that it is older.

ginko · on Jan 15, 2024

Sütterlin was introduced in schools at the beginning of the 20th century. Kurrent was still widely used by adults well into the century.

sneak · on Jan 15, 2024

For only 200 pages, I’d farm it out to humans.

TillE · on Jan 15, 2024

"Humans" in this case means specialist historians, but yes.

Tomte · on Jan 15, 2024

There are many many Germans alive who learned Sütterlin and similar scripts in school and used them for decades. It‘s not exactly Linear B.

Even I (below 45) read some small Sütterlin texts in school (mostly German or history books). Not fluent, but you can quite quickly get used to it and decipher things slowly.

ginko · on Jan 15, 2024

For 100 year old German handwriting the historian I'd consult would by my grandmother. :)

telotortium · on Jan 15, 2024

100-year-old handwriting is probably not that obscure - it's not cuneiform or hieroglyphics. There are probably lots of people, especially older people, who could transcribe it. Finding them would be the issue.

jbverschoor · on Jan 15, 2024

It’s not obscure, but it’s like a doctor’s handwriting. It takes a lot of effort if possible at all for me

weinzierl · on Jan 15, 2024

Low hanging fruit when reading these old German scripts is to get used to distinguish the different forms of the letter s. That alone will get you far. Same for OCR, it needs to be capable of that. Otherwise the result will read as if someone without front teeth has written how they speak.

jackhack · on Jan 15, 2024

I ran a sample through my Apple Newton Messagepad: Iss Martha auf.