Rubbish. I built a pipeline to handle document classification that successfully ...

aitchnyu · on Dec 17, 2024

Could you describe your stack and how its much more effective than two years ago? I heard of printed-table OCR and doc classification years back.

mewpmewp2 · on Dec 17, 2024

But LLM is obviously able to organize documents and data much more intelligently than any ML algorithm from the past.

bongodongobob · on Dec 18, 2024

Tagging with in house metadata like division, job code, who was the project manager etc.

musha68k · on Dec 17, 2024

Very interesting. If I may ask: how are you handling the correctness issue? What's the workflow there if even able to spot a mishap?

bongodongobob · on Dec 18, 2024

We came up with different categories of tags. I should clarify, the AI didn't actually do the sorting, it did tagging so sorting was tractable. After the tagging it's just a matter of grouping, either by algorithm or human.

mewpmewp2 · on Dec 17, 2024

Organising data even if it's not 100% perfect is much better than completely unorganized data.

mouse_ · on Dec 17, 2024

That's fantastic. Congratulations!

nwmcsween · on Dec 17, 2024

I mean considering I did document classification back in 2010 using tesseract, I wouldn't say it was impossible.

mewpmewp2 · on Dec 17, 2024

But obviously it would be far from accuracy that LLM would be able to do. E.g. generate search keywords, tags, other type of meta data for a certain document.

bongodongobob · on Dec 18, 2024

Yup that's exactly it. By being able to tag things with all sorts of in house meta data we were then able to search and group things extremely accurately. There was still a lot of human in the mix, but this made the whole task going from "idk if we can even consider doing this" to "great, we can break this down and chip away at it over the next few months/throw some interns at it".

mewpmewp2 · on Dec 18, 2024

Yeah, I don't know - hearing arguments that this was already done by ML algorithms is to me hearing like "moving from place A to B existed already before cars". But it seems like a common sentiment. So much that simple ML attempted to be doing required massive amount of training and training data specific to your domain before you could use it, and LLM can do it out of the box, and actually consider nuance.

I think organizing and structuring data from unorganized data from the past is a massive use case that seems heavily underrated by so many right now. People spend a lot of time on figuring out where to find some data, internally in companies, etc.