Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: OCR for 100 year old (German) handwritten cursive script?
39 points by jbverschoor on Jan 15, 2024 | hide | past | favorite | 42 comments
I'm looking for an OCR solution for about 200 pages of text. It's handwritten German script from about 100 years old and I can barely read the handwriting myself. Google Translate sometimes manages to OCR certain parts, but nothing useful (I don't need the translation part of GT). Which solutions out there would be able to recognize old handwritten script?



You can train Tesseract to recognize Handwriting[1], but the first and most important step would be the preprocessing of your documents. I would recommend to start with a local adaptive thresholding algorithm[2] like Sauvola for binarization. The preprocessing steps would be[3]

  1) Binarization
  2) Skew Correction
  3) Noise Removal
  4) Thinning and Skeletonization
Probably you are facing "Sütterlin"[4], which differs quite a bit from modern german handwriting.

In your case (only 200 pages) it might be easier to use template matching[5] to identify similar characters and just "transliterate" matches into modern printed letters (like an overlay over the original text). This way you would have a quick solution while still being accurate enough to just read it.

[1]: https://tesseract-ocr.github.io/tessdoc/#training-for-tesser...

[2]: https://brandonmpetty.github.io/Doxa/WebAssembly/

[3]: https://towardsdatascience.com/pre-processing-in-ocr-fc231c6...

[4]: https://de.wikipedia.org/wiki/S%C3%BCtterlinschrift

[5]: https://docs.opencv.org/3.4/d4/dc6/tutorial_py_template_matc...


This is great, I also did find this https://github.com/IgorMeloS/OCR/blob/main/7%20-%20template-... which is part of this https://github.com/IgorMeloS/OCR could be useful for this as well.


> Probably you are facing "Sütterlin"[4], which differs quite a bit from modern german handwriting.

My mother in law (and her mother) wrote everything in Sütterlin and I found it rapidly became pretty easy to read, though that also could be because I only encountered it from a small number of hands.

OTOH I find older Handschrift and Fraktur print (we have a bunch of old books in that) basically illegible.


The letters from [4] remind me of modern Russian cursive which has some similarly interesting changes to some letters to make them faster to write. I wonder if there's any research on Russian cursive OCR that could help


You probably want to put it in front of an actual person and get them to transcribe it for you. I don't think there's any off the shelf OCR that will work particularly well for it.

I have a close family member who is a historian and frequently read and transcribed mid 19th to early 20th century German handwriting for his work.

Many historians and archivists in Germany would have the ability to transcribe this for you if you reached out to them and paid for their time.


Look to the US not Germany. Many immigrant communities stayed with the old German and stript long after Germany had moved on. My dad was in high school when his church switched to English as the last German only menber died.

don't assume this is German just because everyone calls it German. There are/were seneral 'low German' dialects. my grandpa never could understand natives when he toured Germany because all he knew was 100 year old German. (based on stories since he died I suspect he would have had little problem understanding Dutch)


Archivists in Germany will know how to read various forms of Kurrent.

And yeah, of course it might not be Hochdeutsch. But anyone who can read the script will know pretty quickly if it's Platt, Swiss German, or something else entirely.


This is probably the best option. I can't find it now, but in 2020 on either Slashdot.org or here, there was a project trying to transcribe hundreds or thousands of old British rainfall records.

The researchers made digital scans and posted the images online and had random users around the world transcribe them. They didn't care if a user did one or hundreds. I did about 20-50 before they were finished. What would have taken a paid team years was completed in only a week.

Does anyone know a link to the article that announced it?


OP might also find graduate students in history or literature departments who are willing to work for cheaper than a professional.


I've had surprisingly good results with https://readcoop.eu/transkribus/ I was going back in time with a family research until I couldn't identify a single word anymore. The 'AI' could.


My colleagues are mostly using transkribus for handwriting. I work at a library.


That‘s exactly what https://transkribus.ai/ was built for - works quite well in my experience, mainly transcribing Deutsche Kurrentschrift, c. 1980.


I tried it on an incomprehensible German postcard from 1900, and it worked great! Not perfect, but darn good.


I threw some German medical handwriting images into ChatGPT a while back and asked it to transcribe it and it worked pretty well. ChatGPT knows a lot about language so that helped in filling in the gaps.



As others have mentioned, Transkribus works pretty well for handwritten text recognition. You can also train your own model if you have enough source material.

If the documents you have are able to be made public, you could upload them to Wikimedia Commons and use https://ocr.wmcloud.org/ — you can use Transkribus via that. (Disclosure: I'm an engineer working on the Wikimedia OCR project.)


You could try something like https://aws.amazon.com/textract/ or https://cloud.google.com/vision/docs/handwriting. Both have support for modern handwriting. I don't know if it will work with a script written a century ago though.


if it's https://en.wikipedia.org/wiki/Sütterlin I doubt anything trained on current script would make any more sense of it than we do



I don't know if it will be significantly different than what Google Translate does, but I would give the major cloud vendors (Google, Amazon, Microsoft and I guess OpenAI/ChatGPT) OCR services a shot. It's pretty simple and cheap to do (like, about a dollar for the whole thing). Last time I compared them, Google's OCR came out ahead, but it's task-dependent so in your case it might be different.

General purpose open-source OCR solutions like Tesseract, TrOCR, etc will probably not be as good as the cloud ones, based on my experience.

There's some specialized research work out there for antique manuscripts, but that will require some digging on your part with an uncertain outcome. I think at that point, I would also look into manual transcription - for 200 pages, it might be reasonably affordable.


Fwiw, we've found the Azure AI OCR service to be pretty good, much better then anything we could get from Tesseract out of the box (no tuning).


OCR and Translation are two entirely different endeavors.


I worked in the same space as a company that does this with ML (and charges for it), using some form of Recurrent Neural Network IIRC. Maybe LSTMs?

They had a contract to index historical French archives composed of handwritten latin documents in elasticsearch.

Depending of the historical relevance of your documents (read: some academic funds), they may be able to help. Doesn't hurt to contact them:

https://teklia.com/


I've paid for manual transcription before. It's not that expensive. Technical solutions are cool, but that option is available today.


A social option is to look around in ancestry study circles on facebook for your country. I know we have one or two pretty good groups in Swedish with mostly old folks helping younger decipher old handwriting.


GPT-4 Vision. I have seen some examples of middly agy looking pages tried.


I've been scripting GPT-4 Vision to extract structured recipe data from handwritten recipe cards, with very good success. Can't speak to the German language aspect.

(Edited to clarify more than just transcribing)


I would try this but the downside is GPT-4 vision currently doesn’t like to extract large text blocks. You could try extracting line bounding boxes with PyMupdf and feeding it individual lines.


I’ll look into that


Does it look like Sütterlin? Are you familiar with it?

[] https://en.wikipedia.org/wiki/S%C3%BCtterlin


Not familiar with it, but it doesn’t look like that. I wish haha



100 years ago Sütterlin would be pretty likely. If your sample is not Sütterlin I would consider the possibility that it is older.


Sütterlin was introduced in schools at the beginning of the 20th century. Kurrent was still widely used by adults well into the century.


For only 200 pages, I’d farm it out to humans.


"Humans" in this case means specialist historians, but yes.


There are many many Germans alive who learned Sütterlin and similar scripts in school and used them for decades. It‘s not exactly Linear B.

Even I (below 45) read some small Sütterlin texts in school (mostly German or history books). Not fluent, but you can quite quickly get used to it and decipher things slowly.


For 100 year old German handwriting the historian I'd consult would by my grandmother. :)


100-year-old handwriting is probably not that obscure - it's not cuneiform or hieroglyphics. There are probably lots of people, especially older people, who could transcribe it. Finding them would be the issue.


It’s not obscure, but it’s like a doctor’s handwriting. It takes a lot of effort if possible at all for me


Low hanging fruit when reading these old German scripts is to get used to distinguish the different forms of the letter s. That alone will get you far. Same for OCR, it needs to be capable of that. Otherwise the result will read as if someone without front teeth has written how they speak.


I ran a sample through my Apple Newton Messagepad: Iss Martha auf.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: