samtc's comments

samtc · on Feb 27, 2023

Fly.io is not built on top of AWS. see https://fly.io/docs/reference/regions/

onekorg · on Feb 27, 2023

Not sure if still the case, but remember watching a talk (around 2021) where they mentioned they were using different providers. Among them AWS for some regions.

samtc · on April 2, 2021

thttpd (http://acme.com/software/thttpd/), mini_httpd (http://acme.com/software/mini_httpd/) and micro_httpd (http://acme.com/software/micro_httpd/) from acme software worth taking a look at.

secondcoming · on April 2, 2021

We use microhttpd in a few places. It's certainly lean but its API is odd and how to handle POST bodies is just weird.

samtc · on Feb 6, 2021

Is XBRL any good? I see more filings on EDGAR/SEDAR using that.

jdsully · on Feb 6, 2021

About 5 years ago I made an XBRL viewer. In a ton of reports I tested the data did not match the paper reports. After ruling out bugs in my own software I went as far as emailing a few companies investor relations departments and never got a response.

I hope things have improved in the years since but I doubt it. Nobody looks at the XBRL its just a regulatory box tick.

breput · on Feb 6, 2021

It isn't a choice for any registrant that is filing to the SEC. Even the smallest filers will have to file most reports using XBRL after June.

The application of XBRL doesn't really affect the design of the reports. It ebbs and flows - it feels like more designed filings are coming back with better tools that allow more designed reports.

wombatpm · on Feb 6, 2021

The is an entire industry that does financial printing; 10k, quarterly reports, prospectuses etc. they handle the filing with SEC/Edgar. If there is a descrepency it arises there. Any one remember when RRDonnelley submitted Googles data a couple hours early during trading and they had to halt trading in the stock?

formercoder · on Feb 6, 2021

Unfortunately it’s very occasionally wrong, making it almost useless for me.

samtc · on July 5, 2019

When you have remote hands and need to tell, over the phone, what to edit in BIOS of a specific Lenovo PC?

samtc · on June 25, 2018

You can still access com.zone [1], you have to fill a form and have a phone call with verisign and you get access to a good old FTP with a daily dump.

Some other TLDs gives access through Centralized Zone Data Service [2]. Some deny access, like ca.zone.

But I get what you mean, it was really publicly available.

[1] https://www.verisign.com/en_US/channel-resources/domain-regi...

[2] https://czds.icann.org/en

samtc · on Nov 14, 2017

I maintain ~30 different crawlers. Most of them are using Scrapy. Some are using PhantomJS/CasperJS but they are called from Scrapy via a simple web service.

All data (zip files, pdf, html, xml, json) we collect are stored as-is (/path/to/<dataset name>/<unique key>/<timestamp>) and processed later using a Spark pipeline. lxml.html is WAY faster than beautifulsoup and less prone to exception.

We have cronjob (cron + jenkins) that trigger dataset update and discovery. For example, we scrape corporate registry, so everyday we update the 20k oldest companies version. We also implement "discovery" logic in all of our crawlers so they can find new data (ex.: newly registered company). We use Redis to send task (update / discovery) to our crawlers.

mapster · on Nov 14, 2017

Mind if I ask what info/data you are scraping and for what ends?

frik · on Nov 14, 2017

> We use Redis to send task (update / discovery) to our crawlers.

Some kind of queue implemented with Redis? How does it work?

samtc · on Nov 14, 2017

It's a simple redis list containing JSON task. We have a custom Scrapy Spider hooked to next_request and item_scraped [1]. It check (lpop) for update/discovery tasks in the list and build a Request [2]. We only crawl max ~1 request per second, so performance is not an issue.

For every website we crawl we implement a custom discovery/update logic.

Discovery can be, for example, crawl a specific date range, seq number, postal code.... We usually seed discovery based on the actual data we have, like highest_company_number + 1000, so we get the newly registered companies.

Update is to update a single document. Like crawl document for company number 1234. We generate a Request [2] to crawl only that document.

[1] https://doc.scrapy.org/en/latest/topics/signals.html

[2] https://doc.scrapy.org/en/latest/topics/request-response.htm...

thibaut_barrere · on Nov 14, 2017

See https://sidekiq.org for instance.

CGamesPlay · on Nov 14, 2017

Probably not what the GP uses, but Resque does this in Ruby land.

bdcravens · on Nov 14, 2017

Sidekiq has emerged as a better option to Resque

CGamesPlay · on Nov 14, 2017

I have a similar set up! How do you monitor for failures and deal with the scrape target changing?

samtc · on Nov 14, 2017

We monitor exceptions with Sentry. We store raw data so we don't have to hurry to fix the ETL, we only have to fix navigation logic and we keep crawling.

Launchr · on Nov 15, 2017

Sorry if it's a stupid question/example/comparison, just trying to understand better: You're storing the full html data instead of reaching into the specific div's for the data you might need? This way, separating the fetching from the parsing?

I'm a scraping rookie, and I usually fetch + parse in the same call, this might resolve some issues for me :) thanks!

jimsmart · on Nov 15, 2017

When I've done scraping, I've always taken this approach also: I decouple my process into paired fetch-to-local-cache-folder and process-cached-files stages.

I find this useful for several reasons, but particularly if you want to recrawl the same site for new/updated content, or if you decide to grab extra data from the pages (or, indeed, if your original parsing goes wrong or meets pages it wasn't designed for).

Related: As well as any pages I cache, I generally also have each stage output a CSV (requested url, local file name, status, any other relevant data or metadata), which can be used to drive later stages, or may contain the final output data.

Requesting all of the pages is the biggest time sink when scraping — it's good to avoid having to do any portion of that again, if possible.

samtc · on May 26, 2013

What's wrong about billing hourly? Sometimes a client ask me to check something that's not working as expected, it could take less than 10mins, I won't do it for free nor bill a week, I'll bill an hour. I'm happy and he is. If it takes longer I go with half-day, full-day increment.

I do document parsing/data grooming so it's a lot of tweaking/fix as the client do the Q&A on the data.

Actually I overbook myself and offload some work to reliable part-time employees (the client is happy to known that not only me but other people are working on the project).

redler · on May 26, 2013

It may be counterintuitive, but sometimes it's better to bill nothing at all for something like this than to get involved in the minutiae of billing in small increments for specific tasks. This enables you to remain psychologically anchored with that client as a daily or weekly high-value consultant, and doesn't undermine your ability to maintain an optimal billing rate and substantial minimum increment.

nfm · on May 26, 2013

I think that's a good scenario for a retainer. You have a long term maintenance relationship set up that typically doesn't require chunks of work at a time.

The overheads to doing a 10 minute fix are massive: they email/call you to make a request, you change work contexts, fix the issue, test and release it, notify them that you're done, keep track of the time you spent working, send an invoice at the end of the month, keep an eye out for payment, thank them for paying etc.

Rounding up to the hour mitigates this, but unless you're doing several maintenance requests per client per week, or are charging very high ($250+) hourly rates, your business is probably losing money by keeping this client on the books.

I think it's better to negotiate a monthly retainer that ensures making tiny updates is worth your while, and then just have a set-and-forget invoice that gets sent automatically every month for that amount. Even better if you can get paid by direct debit.