some relatively serious questions on the methodology:
- how did you define third party assets vs domain-managed assets? Is anything not hosted under example.com automatically third party? What about Twitter.com and t.co? I know this one is picky but would like a feel for the figures.
- how deep did you scrape the (million!) sites? If it's front page or similar Inwould not be surprised to see figures revised upwards significantly - once off the beaten track of even major sites the number of "let this one slide" decisions spikes a lot.
- how long did polling a million sites take?! What was the setup you used - very interested even if it has nothing to do with methodology :-)
Thank you - you have at least made me rethink my lack of blockers
I will release code + data for bootstrapping but until then here are my answers to your questions:
> how did you define third party assets vs domain-managed assets? Is anything not hosted under example.com automatically third party? What about Twitter.com and t.co? I know this one is picky but would like a feel for the figures.
That's based on the hosting domain being the same or a superset of the domain that the page originally came from.
> how deep did you scrape the (million!) sites?
Just the homepage.
> If it's front page or similar Inwould not be surprised to see figures revised upwards significantly - once off the beaten track of even major sites the number of "let this one slide" decisions spikes a lot.
That's true.
> how long did polling a million sites take?!
20 days. About 50K sites per day which significantly cramped my ability to do other work here.
> What was the setup you used - very interested even if it has nothing to do with methodology :-)
A simple laptop with 16G of ram and a regular (spinning) drive on a 200/20 cable connection. 40 worker threads concurrently with a simple php script to supervise the crawler and another script to do the analysis.
Most of the data was discarded right after crawling a page, only the URLS that were loaded as a result of loading the homepage were kept as well as the mime type of the result.
Two things leap out. Firstly I love the way you chose to do 1 million sites. I would have gone, hmm, maybe top thousand, and called it a representative sample :-) The scale of the modern world is still something I am grappling with.
Secondly, is that 200 Mbps down / 20 mbps up? I think the UK has some broadband access lessons to learn if that's true. My wet piece of string is getting threadbare.
It's maybe overkill to do it on the whole set instead of just a sample, probably the numbers would not change all that much.
The 200/20 is indeed 200 Mbps down and 20 up, this little trick saturated the line pretty good though. I probably could have saved some time and bandwidth by letting phantomjs abort on image content but I was lazy.
I'm slap bang in the commuter belt round London - and broadband availability is having an actual effect on house prices and decisions to move out of the area.
It's surprisingly low on the political agenda nationwide.
I'm about to get all English Middle class over this Sinai will stop now :-)
- how did you define third party assets vs domain-managed assets? Is anything not hosted under example.com automatically third party? What about Twitter.com and t.co? I know this one is picky but would like a feel for the figures.
- how deep did you scrape the (million!) sites? If it's front page or similar Inwould not be surprised to see figures revised upwards significantly - once off the beaten track of even major sites the number of "let this one slide" decisions spikes a lot.
- how long did polling a million sites take?! What was the setup you used - very interested even if it has nothing to do with methodology :-)
Thank you - you have at least made me rethink my lack of blockers