Hacker News new | past | comments | ask | show | jobs | submit login

Related question - what is a very fast and easy to use library for scraping static sites such as Google search results?



Google search isn't a static site, the results are dynamically generated based on what it knows about you (location, browser language, recent searches from IP, recent searches from account, and so on with all of the things they know from trying to sell ad slots to that device).

That being said there isn't anything wrong with using Scrapy for this. If you're more familiar with web browsers than Python something like https://github.com/puppeteer/puppeteer can also be turned into a quick way to scrape a site by giving you a headless browser controlled by whatever you script in nodejs.


I see. I am familiar with Python but I don't need something so heavy like Scrapy. Ideally I am looking for something that is very lightweight + fast and can just parse the DOM using CSS selectors.


I've had excellent luck with SerpAPI. It's $50 a month for 5,000 searches which has been plenty for my needs at a small SEO/marketing agency.

http://serpapi.com


As others have said, google isn't a static site, and in addition to that, they create a nightmare of tags and whatnot that make it utterly horrific to scrape.

After scraping tens of millions of pages, possibly hundreds of millions, i've fallen back to LXML w/ Python. It's not for all use-cases, but it works for me.

One thing I'll attempt to do before scraping the page is look to see if the page is rendered server side or client side. If it's client side, I'll see if I can just get the raw data and if that's the case, it makes it much much easier.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: