Not sure if still the case, but remember watching a talk (around 2021) where they mentioned they were using different providers. Among them AWS for some regions.
About 5 years ago I made an XBRL viewer. In a ton of reports I tested the data did not match the paper reports. After ruling out bugs in my own software I went as far as emailing a few companies investor relations departments and never got a response.
I hope things have improved in the years since but I doubt it. Nobody looks at the XBRL its just a regulatory box tick.
It isn't a choice for any registrant that is filing to the SEC. Even the smallest filers will have to file most reports using XBRL after June.
The application of XBRL doesn't really affect the design of the reports. It ebbs and flows - it feels like more designed filings are coming back with better tools that allow more designed reports.
The is an entire industry that does financial printing; 10k, quarterly reports, prospectuses etc. they handle the filing with SEC/Edgar. If there is a descrepency it arises there. Any one remember when RRDonnelley submitted Googles data a couple hours early during trading and they had to halt trading in the stock?
I maintain ~30 different crawlers. Most of them are using Scrapy. Some are using PhantomJS/CasperJS but they are called from Scrapy via a simple web service.
All data (zip files, pdf, html, xml, json) we collect are stored as-is (/path/to/<dataset name>/<unique key>/<timestamp>) and processed later using a Spark pipeline. lxml.html is WAY faster than beautifulsoup and less prone to exception.
We have cronjob (cron + jenkins) that trigger dataset update and discovery. For example, we scrape corporate registry, so everyday we update the 20k oldest companies version. We also implement "discovery" logic in all of our crawlers so they can find new data (ex.: newly registered company). We use Redis to send task (update / discovery) to our crawlers.
It's a simple redis list containing JSON task. We have a custom Scrapy Spider hooked to next_request and item_scraped [1]. It check (lpop) for update/discovery tasks in the list and build a Request [2]. We only crawl max ~1 request per second, so performance is not an issue.
For every website we crawl we implement a custom discovery/update logic.
Discovery can be, for example, crawl a specific date range, seq number, postal code.... We usually seed discovery based on the actual data we have, like highest_company_number + 1000, so we get the newly registered companies.
Update is to update a single document. Like crawl document for company number 1234. We generate a Request [2] to crawl only that document.
We monitor exceptions with Sentry. We store raw data so we don't have to hurry to fix the ETL, we only have to fix navigation logic and we keep crawling.
Sorry if it's a stupid question/example/comparison, just trying to understand better:
You're storing the full html data instead of reaching into the specific div's for the data you might need? This way, separating the fetching from the parsing?
I'm a scraping rookie, and I usually fetch + parse in the same call, this might resolve some issues for me :) thanks!
When I've done scraping, I've always taken this approach also: I decouple my process into paired fetch-to-local-cache-folder and process-cached-files stages.
I find this useful for several reasons, but particularly if you want to recrawl the same site for new/updated content, or if you decide to grab extra data from the pages (or, indeed, if your original parsing goes wrong or meets pages it wasn't designed for).
Related: As well as any pages I cache, I generally also have each stage output a CSV (requested url, local file name, status, any other relevant data or metadata), which can be used to drive later stages, or may contain the final output data.
Requesting all of the pages is the biggest time sink when scraping — it's good to avoid having to do any portion of that again, if possible.
What's wrong about billing hourly? Sometimes a client ask me to check something that's not working as expected, it could take less than 10mins, I won't do it for free nor bill a week, I'll bill an hour. I'm happy and he is. If it takes longer I go with half-day, full-day increment.
I do document parsing/data grooming so it's a lot of tweaking/fix as the client do the Q&A on the data.
Actually I overbook myself and offload some work to reliable part-time employees (the client is happy to known that not only me but other people are working on the project).
It may be counterintuitive, but sometimes it's better to bill nothing at all for something like this than to get involved in the minutiae of billing in small increments for specific tasks. This enables you to remain psychologically anchored with that client as a daily or weekly high-value consultant, and doesn't undermine your ability to maintain an optimal billing rate and substantial minimum increment.
I think that's a good scenario for a retainer. You have a long term maintenance relationship set up that typically doesn't require chunks of work at a time.
The overheads to doing a 10 minute fix are massive: they email/call you to make a request, you change work contexts, fix the issue, test and release it, notify them that you're done, keep track of the time you spent working, send an invoice at the end of the month, keep an eye out for payment, thank them for paying etc.
Rounding up to the hour mitigates this, but unless you're doing several maintenance requests per client per week, or are charging very high ($250+) hourly rates, your business is probably losing money by keeping this client on the books.
I think it's better to negotiate a monthly retainer that ensures making tiny updates is worth your while, and then just have a set-and-forget invoice that gets sent automatically every month for that amount. Even better if you can get paid by direct debit.