Storing Scraped Data in an SQLite Database on GitHub

kristianp · 2024-07-09T21:23:12.000000Z

It's fun to test the boundaries of github's services, but if you're doing something useful I'd just hire a vps, they can be had from $5 a month. You could still upload the sqlite file to github via a check-in.

chatmasta · 2024-07-09T19:35:33.000000Z

Presumably you can bypass the artifact retention limit by uploading them as release artifacts (which are retained forever) rather than job artifacts.

(Not that I’d advocate for this in general, since ultimately you’re duplicating a bunch of data and will eventually catch the eye of some GitHub compliance script.)

jzebedee · 2024-07-10T22:21:26.000000Z

That's exactly what I did for scraping the USCIS processing time daily: https://github.com/jzebedee/uscis

Crier1002 · 2024-07-11T01:26:25.000000Z

out of curiosity: is there a specific reason to use robinraju/release-downloader@v1 over actions/download-artifact@v4 here at your 'Download previous DB' step in build_db.yml?

jzebedee · 2024-07-11T05:31:28.000000Z

It's used to download the previous day's database from the release for computing the diff. actions/download-artifact would only work for artifacts created during the current run, i.e., today's database.

ErikBjare · 2024-07-11T10:18:27.000000Z

Release != artifact

ngshiheng · 2024-07-10T00:25:45.000000Z

interesting! perhaps cleaning up the older data might help abit here

> since ultimately you’re duplicating a bunch of data and will eventually catch the eye of some GitHub compliance script

I suppose this could also be a concern with git scraping as we are bascially duplicating data through git commits (not trying to imply that one is better or worse). Having that said, I'm not sure if GitHub would be fine with any of these if more people were to do the same at a larger scale

chatmasta · 2024-07-10T02:17:03.000000Z

What would be interesting is if you could find a way to scrape only the deltas and then somehow reconcile them into the full scrape.