Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Rubbish. I built a pipeline to handle document classification that successfully took care of ~70TB of mostly unstructured and unorganized data, by myself, in a couple weeks, with no data engineering background whatsoever. This was quite literally impossible a couple years ago. The amount of work that saved was massive and is going to save us a shit ton of money on storage costs. Decades worth of invoices and random PDFs are now siloed properly so we can organize and sort them. This was almost intractable a few years ago.


Could you describe your stack and how its much more effective than two years ago? I heard of printed-table OCR and doc classification years back.


But LLM is obviously able to organize documents and data much more intelligently than any ML algorithm from the past.


Tagging with in house metadata like division, job code, who was the project manager etc.


Very interesting. If I may ask: how are you handling the correctness issue? What's the workflow there if even able to spot a mishap?


We came up with different categories of tags. I should clarify, the AI didn't actually do the sorting, it did tagging so sorting was tractable. After the tagging it's just a matter of grouping, either by algorithm or human.


Organising data even if it's not 100% perfect is much better than completely unorganized data.


That's fantastic. Congratulations!


I mean considering I did document classification back in 2010 using tesseract, I wouldn't say it was impossible.


But obviously it would be far from accuracy that LLM would be able to do. E.g. generate search keywords, tags, other type of meta data for a certain document.


Yup that's exactly it. By being able to tag things with all sorts of in house meta data we were then able to search and group things extremely accurately. There was still a lot of human in the mix, but this made the whole task going from "idk if we can even consider doing this" to "great, we can break this down and chip away at it over the next few months/throw some interns at it".


Yeah, I don't know - hearing arguments that this was already done by ML algorithms is to me hearing like "moving from place A to B existed already before cars". But it seems like a common sentiment. So much that simple ML attempted to be doing required massive amount of training and training data specific to your domain before you could use it, and LLM can do it out of the box, and actually consider nuance.

I think organizing and structuring data from unorganized data from the past is a massive use case that seems heavily underrated by so many right now. People spend a lot of time on figuring out where to find some data, internally in companies, etc.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: