Hacker News new | past | comments | ask | show | jobs | submit login

Yeah this might be handy for small stuff but it's way too naive for anything bigger than couple pages. I recently had to scrape some pictures and meta-data from a website and while scripts like these seemed cool they really didn't scale up at all. Consider navigation, following URLs and downloading pictures all while remaining in the limits what's considered non-intrusive.

My first attempt, similar to this, failed miserably as the site employed some kind of cookie check that immediately blocked my requests by returning 403.

As mentioned in article I then moved on to Scrapy https://scrapy.org/. While seemingly a bit overkill once you create your scraper it's easy to expand and use the same scaffold on other sites too. Also it gives a lot more control on how gently you scrape and outputs nicely json/jl/csv with the data you want.

Most problems I had was with the Scrapy pipelines and getting it to output properly two json files and images. I could write a very short tutorial on my setup if I wasn't at work and otherwise busy right now.

And yes it's a bit of grey area but for my project (training a simple CNN based on the images) I think it was acceptable considering that I could have done the same thing manually (and spent less time too).




Python Requests has a notion if "session" which takes care of cookies etc... Use it all the time when needing to automate tasks that require to sign in.


I've been through the rigmarole of writing my own crawlers and and find Scrapy very powerful. I've run into roadblocks with dynamic/Javascript heavy sites; for those parts selenium+chromedriver works really well.

As parent and others have said: this is a grey area so make sure to read the terms of use and/or gain permission before scraping.


Notice how it's not a grey area when Google do it. The usually double standard apply I guess.


I don't understand what this comment is referring to. Google's spider respects robots.txt, just block all paths and google will not crawl your site. So too for Bing, Yahoo, Baidu (some complications though, I think), Yandex.... Most of the major spiders respect robots.txt.

Is there some major Google web scraping effort I'm not aware of?


I am just getting into web scraping and have also been using Selenium with either Firefox or PhantomJS. Is there a better way to handle the javascript heavy sites? I found one library called dryscraping but haven't had the time to look too deep into it.


Take a look at the Google team's Puppeteer

https://github.com/GoogleChrome/puppeteer


Splash runs in docker and does a decent job. From the scrapinghub team.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: