Yeah this might be handy for small stuff but it's way too naive for anything big...

dorfsmay · on Oct 24, 2017

Python Requests has a notion if "session" which takes care of cookies etc... Use it all the time when needing to automate tasks that require to sign in.

qrybam · on Oct 24, 2017

I've been through the rigmarole of writing my own crawlers and and find Scrapy very powerful. I've run into roadblocks with dynamic/Javascript heavy sites; for those parts selenium+chromedriver works really well.

As parent and others have said: this is a grey area so make sure to read the terms of use and/or gain permission before scraping.

anewhnaccount2 · on Oct 24, 2017

Notice how it's not a grey area when Google do it. The usually double standard apply I guess.

davvolun · on Oct 25, 2017

I don't understand what this comment is referring to. Google's spider respects robots.txt, just block all paths and google will not crawl your site. So too for Bing, Yahoo, Baidu (some complications though, I think), Yandex.... Most of the major spiders respect robots.txt.

Is there some major Google web scraping effort I'm not aware of?

wootie512 · on Oct 24, 2017

I am just getting into web scraping and have also been using Selenium with either Firefox or PhantomJS. Is there a better way to handle the javascript heavy sites? I found one library called dryscraping but haven't had the time to look too deep into it.

RhodesianHunter · on Oct 24, 2017

Take a look at the Google team's Puppeteer

https://github.com/GoogleChrome/puppeteer

brianwawok · on Oct 24, 2017

Splash runs in docker and does a decent job. From the scrapinghub team.