More

c_moscardi · 2025-06-23T18:39:10 1750703950

We chatted a few months back -- congrats on launch! Looks like a great UX.

adit_a · 2025-06-23T18:52:21 1750704741

Ah yeah I remember! Great to hear from you and thanks :)

c_moscardi · 2025-03-17T21:12:33 1742245953

Amazing! GitHub actions to compute a giant skim matrix is an incredible hack.

I pretty regularly work with social science researchers who have a need for something like this... will keep it in mind. For a bit we thought of setting something like this up within the Census Bureau, in fact. I have some stories about routing engines from my time there...

kalendos · 2025-03-18T09:16:03 1742289363

Totally! I've been using this pattern a lot and recently wrote about it.

https://davidgasquez.com/community-level-open-data-infrastru...

modriano · 2025-03-18T17:42:46 1742319766

Thank you for this excellent post! I've been developing [my own platform](https://github.com/MattTriano/analytics_data_where_house) that curates a data warehouse mostly of census and socrata datasets but I haven't really had a good way to share the products with anyone as it's a bit too heavyweight. I've been trying to find alternate solutions to that issue (I'm currently building out a much smaller [platform](https://github.com/MattTriano/fbi_cde_data) to process the FBI's NIBRS datasets), and your post has given me a few great implementations to study and experiment with.

Thanks!

c_moscardi · 2025-03-02T16:20:24 1740932424

Worth noting that there already exists an ecosystem of these sorts of contracting firms (nava, skylight, truss I believe, forgetting even more) that, basically, pitch themselves as the antidote to beltway banditry

People from the federal civic tech nexus started them up over the past decade as they termed out of 18F, USDS, and PIF

c_moscardi · 2024-11-14T21:36:15 1731620175

Came to post this — it’s the same underlying technology, just a lot more compute now.

c_moscardi · 2024-10-03T16:30:01 1727973001

Hi HN! I've spent a couple of months fiddling with OCR and wanted to share some of my findings.

The approach I share here (fine-tuning recent deep learning models) is the first one that's gotten me anything resembling high-quality OCR on these particular noisy historical documents. OCRing these has been something of a white whale for me for several years (except, a white whale that I have spent comparatively little time on).

At this point I think I am reasonably competent in OCR, but no expert... Curious for your thoughts.

c_moscardi · 2024-08-20T16:29:23 1724171363

Related reading; explains the same concept quite well IMO with NYC subway data. This is where I learned about this concept.

[1] https://erikbern.com/2016/04/04/nyc-subway-math

[2] https://erikbern.com/2016/07/09/waiting-time-math.html

c_moscardi · on April 22, 2024

Yeah, I think MS' is the best out there, but agree that the usability leaves something to be desired. 2 thoughts:

1. I believe the IR jargon for getting a JSON of this form is Key Information Extraction (KIE). MS has an out-of-the-box model for this. I just tried the screenshot and it did a pretty good (but not perfect) job. It didn't get every form field, but most. MS sort-of has a flow for fine-tuning, but it really leaves a lot to be desired IMO. Curious if this would be "good enough" to satisfy the use case.

2. In terms of just OCR (i.e. getting the text/numeric strings correct), MS is known to be the best on typed text at the moment [1]. Handwriting is a different beast... but it looks like MS is doing a very good job there (and SOTA on handwriting is very good). In particular, it got all the numbers in that screenshot correct.

If you want to see the results from MS on the screenshot in this blog post, here's the entire JSON blob. A bit of a behemoth but the key/value stuff is in there: https://gist.github.com/cmoscardi/8c376094181451a49f0c62406e...

[1] https://mindee.github.io/doctr/latest/using_doctr/using_mode...

simonw · on April 22, 2024

That does look pretty great, thanks for the tip.

Sending images through that API and then using an LLM to extract data from the text result from the OCR could be worth exploring.

c_moscardi · on April 8, 2024

Figures, too! Yeah you could write some logic essentially on top of a library like this, and tune based on optimizing for some notion of recall (grab more surrounding context) and precision (direct context around the word, e.g. only the paragraph or 5 surrounding table rows) for your specific application needs.

Using the models underlying a library like this, there's maybe room for fine-tuning as well if you have a set of documents with specific semantic boundaries that current approaches don't capture. (And you spend an hour drawing bounding boxes to make that happen).

c_moscardi · on May 26, 2019

Funnily enough, this is another great tactic for getting emails returned (looping in someone with more leverage than you or asking them to follow up for you)!

c_moscardi · on Dec 10, 2018

We should talk! I do work on automatically coding products for a shipping survey at the Census Bureau. One of the earliest production uses of ML here at Census :)

5 minute deck: https://github.com/codingitforward/cdfdemoday2018/blob/maste...

Feel free to shoot me a message.