Hacker News new | past | comments | ask | show | jobs | submit | c_moscardi's comments login

We chatted a few months back -- congrats on launch! Looks like a great UX.

Ah yeah I remember! Great to hear from you and thanks :)

Amazing! GitHub actions to compute a giant skim matrix is an incredible hack.

I pretty regularly work with social science researchers who have a need for something like this... will keep it in mind. For a bit we thought of setting something like this up within the Census Bureau, in fact. I have some stories about routing engines from my time there...


Totally! I've been using this pattern a lot and recently wrote about it.

https://davidgasquez.com/community-level-open-data-infrastru...


Thank you for this excellent post! I've been developing [my own platform](https://github.com/MattTriano/analytics_data_where_house) that curates a data warehouse mostly of census and socrata datasets but I haven't really had a good way to share the products with anyone as it's a bit too heavyweight. I've been trying to find alternate solutions to that issue (I'm currently building out a much smaller [platform](https://github.com/MattTriano/fbi_cde_data) to process the FBI's NIBRS datasets), and your post has given me a few great implementations to study and experiment with.

Thanks!


Worth noting that there already exists an ecosystem of these sorts of contracting firms (nava, skylight, truss I believe, forgetting even more) that, basically, pitch themselves as the antidote to beltway banditry

People from the federal civic tech nexus started them up over the past decade as they termed out of 18F, USDS, and PIF


Came to post this — it’s the same underlying technology, just a lot more compute now.


Hi HN! I've spent a couple of months fiddling with OCR and wanted to share some of my findings.

The approach I share here (fine-tuning recent deep learning models) is the first one that's gotten me anything resembling high-quality OCR on these particular noisy historical documents. OCRing these has been something of a white whale for me for several years (except, a white whale that I have spent comparatively little time on).

At this point I think I am reasonably competent in OCR, but no expert... Curious for your thoughts.


Related reading; explains the same concept quite well IMO with NYC subway data. This is where I learned about this concept.

[1] https://erikbern.com/2016/04/04/nyc-subway-math

[2] https://erikbern.com/2016/07/09/waiting-time-math.html


Yeah, I think MS' is the best out there, but agree that the usability leaves something to be desired. 2 thoughts:

1. I believe the IR jargon for getting a JSON of this form is Key Information Extraction (KIE). MS has an out-of-the-box model for this. I just tried the screenshot and it did a pretty good (but not perfect) job. It didn't get every form field, but most. MS sort-of has a flow for fine-tuning, but it really leaves a lot to be desired IMO. Curious if this would be "good enough" to satisfy the use case.

2. In terms of just OCR (i.e. getting the text/numeric strings correct), MS is known to be the best on typed text at the moment [1]. Handwriting is a different beast... but it looks like MS is doing a very good job there (and SOTA on handwriting is very good). In particular, it got all the numbers in that screenshot correct.

If you want to see the results from MS on the screenshot in this blog post, here's the entire JSON blob. A bit of a behemoth but the key/value stuff is in there: https://gist.github.com/cmoscardi/8c376094181451a49f0c62406e...

[1] https://mindee.github.io/doctr/latest/using_doctr/using_mode...


That does look pretty great, thanks for the tip.

Sending images through that API and then using an LLM to extract data from the text result from the OCR could be worth exploring.


Figures, too! Yeah you could write some logic essentially on top of a library like this, and tune based on optimizing for some notion of recall (grab more surrounding context) and precision (direct context around the word, e.g. only the paragraph or 5 surrounding table rows) for your specific application needs.

Using the models underlying a library like this, there's maybe room for fine-tuning as well if you have a set of documents with specific semantic boundaries that current approaches don't capture. (And you spend an hour drawing bounding boxes to make that happen).


Funnily enough, this is another great tactic for getting emails returned (looping in someone with more leverage than you or asking them to follow up for you)!


We should talk! I do work on automatically coding products for a shipping survey at the Census Bureau. One of the earliest production uses of ML here at Census :)

5 minute deck: https://github.com/codingitforward/cdfdemoday2018/blob/maste...

Feel free to shoot me a message.


Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: