Amazon2csv: Amazon products scraper to CSV (no API token required)

yoaviram · on June 11, 2018

Why not use the API? Disclaimer: I'm the author of python-amazon-simple-product-api [1]

[1] https://github.com/yoavaviram/python-amazon-simple-product-a...

k__ · on June 11, 2018

Sometimes this isn't possible.

I wrote an app that is basically a new UI for the Amazon products. It runs entirely on the client. The Amazon API simply didn't work in that setup.

AznHisoka · on June 11, 2018

Are you referring to the Product Advertising API?

Doesnt that require you to have a quota of affiliate sales to keep using it? I can’t find where they state this requirement but I remembered they were very sneaky about disclosing this. If you dont have any affiliate sales after X months, your API key will stop working.

ZoomStop · on June 11, 2018

Currently you have to be a member of their affiliate program to get API access. To become a full "member" you have to be a prospect who generates three referral sales (iirc) within a 30 day period. So once in you have the API, but getting in isn't as easy as filling out a form. From there you can get your API rate limits increased from the default 1x call per second up to 10 based on your prior 30 day affiliate sales.

wumpus · on June 11, 2018

Last I looked that limit was per IP address.

raitucarp · on June 11, 2018

Man, looks great. I also build something similar in node.js. I implement everything what documentation said (complete implementation). ICYMI:

https://github.com/Ribhnux/piranhax

wdr1 · on June 12, 2018

The API comes with a TOS that severely restricts what you can do with the data.

amingilani · on June 11, 2018

Scraping Amazon is fun and all, but when you start overdoing it they rate-limit your IP and show you my worst nightmare: the Dogs of Amazon (a 500 page with pictures)

Why do I know this? Because I'm the CTO at Nazdeeq.com where we let users buy Amazon products from countries where they don't ship easily, like Pakistan.

Edit: totally open to partnerships in more countries

jeanlucas · on June 11, 2018

I'm from Brazil and what you said made me curious, not sure why, but Amazon here didn't catch. How did you solve problems like logistics and interest from the public?

amingilani · on June 11, 2018

I'm sorry, I have trouble understanding your question but if you mean how we ship from Amazon to Pakistan, and how we got people to use our service: we worked out a pipeline to get products from the US to Amazon, and advertising + word-of-mouth. Also:

+ There's no direct way to buy 90% of products from Amazon since they don't ship to Pakistan

+ Our service is the only in the country that gives a fixed price at checkout in PKR

+ Our customer service is excellent

+ We're one of the cheapest options available, as long as the competition imports products legally.

irfan · on June 11, 2018

I disagree with the last point. Services like MyUS, Viabox etc. are cheaper than Nazdeeq in many cases.

chx · on June 11, 2018

I do not know Nazdeeq but parent claims they deal with importing and last I checked neither MyUS nor any other random package forwarder deal with customs which can be a real PITA. (I have some experience with importing things into ... legally interesting ... countries, from 1990 till I left in 2006 I, as an individual have imported a lot of computers and parts into Hungary. It was ... fun.)

amingilani · on June 11, 2018

You covered it perfectly! :)

amingilani · on June 11, 2018

Hey Irfan! I'm sorry you feel that way, but as someone else here pointed out, we handle everything from clearing customs to last-mile delivery (to your doorstep), which is why our pricing may seem expensive because we declare all costs upfront.

A lot of our customers have similar horror stories where their goods get stuck in customs because they didn't realize clearing is a thing.

irfan · on June 20, 2018

Ok I had my first horror story with custom clearance today. Going to give nazdeeq.com a try next time :-)

amingilani · on June 22, 2018

Haha, we'd love to have you! Throw my name at the support channel, and I should appear to help you out if you have any trouble :)

yasoob · on June 11, 2018

Hi Amin, your platform seems nice. Just wanted to give you a heads-up that your website is being classified as ["phishing" by Avast](https://i.imgur.com/SmuuRfD.png). I think if you replace "Amazon" in the url with something else it should work fine. Best of luck!

always_good · on June 11, 2018

Reminds me of how nobody could see one of my user's avatars because the url (a hash) had started with an "ad" segment (for bucketing), as in "/avatars/ad/ad3adb33f". So adblockers blocked it.

My protest against such a ridiculous heuristic was to not fix it.

amingilani · on June 11, 2018

It makes sense why you would choose to do that and I can certainly empathize, but in the interest of user experience I try to fix these problems because my customers deserve a good user experience.

amingilani · on June 11, 2018

Thank you Yasoob! Dammit, again? I already had them white-label our site once but I'll look into this again. Thank you!

jploh · on June 12, 2018

In the Philippines there's something quite similar called Galleon. They've been recently acquired but I think they might be open to partner. They've expanded to Thailand, if I'm not mistaken.

dewey · on June 11, 2018

Are you using the API or web scraping? We never really had problems with IP banning if the traffic looks like a real user.

amingilani · on June 11, 2018

Neither, actually, we're using a heavily configured reverse-proxy.

This means that, unfortunately, all the traffic has to go through our own servers.

Jdam · on June 11, 2018

The issue with those tools is that Amazon changes the product layout very often and heavily conducts A/B tests. I’ve once even heard that computer vision is the most stable way to scrape Amazon. I guess this library will stop working rather soon.

RhodesianHunter · on June 11, 2018

>I’ve once even heard that computer vision is the most stable way to scrape Amazon

At a former employer we scraped Amazon many millions of times per day with very simple old tools that rarely needed updating.

mxvzr · on June 11, 2018

Are you able to share some details? How often did you have to get new IP addresses? What about user agent? Were the scapers "straight to the point" like amazon2csv (ie: make a request directly to the search page) or did they have randomized behavior (eg: re-use sessions from time to time; click a random link on the page; start from the homepage...)? Did the scrapers ever went against amz's robots.txt directives (eg: interacting with the cart page)? Ever heard from amz itself about your employer's activities on their site?

lapnitnelav · on June 11, 2018

There are services dedicated to scrapping which can take care of proxy-ing your requests so you don't have to worry about IP bans.

For example, Scrapinghub's Crawlera (the guys behind the Scrapy python lib)

AznHisoka · on June 11, 2018

Same here. Scraping their search results page is easy if you have a bunch of IPs. No manipulation or workarounds needed(ie headless browsers, ensuring your http headers look like a real user).

I have not scraped a ton of actual individual product pages though so cant testify about scraping that. I do remember it might have been harder.

mygo · on June 11, 2018

> I guess this library will stop working rather soon.

Don’t really see that as a dealbreaker. So the library will need maintenance. Normal for libraries to need updates in order to keep up with changes. It works today, and it will work whenever it’s updated. Better than nothing and for many use cases that’s good enough.

hobofan · on June 11, 2018

Search results scraping on Amazon is fairly stable.

What's more difficult is product page scraping, because there you have hundreds of different variations. Some from A/B testing and a lot just being specific things that show up for certain product categories (e.g. video games).

bufferoverflow · on June 11, 2018

I remember trying to build a scraper for Amazon. I quickly discovered that there are many types of item pages, and they change over time too. A/B testing probably. Just to get the price of the product out of their HTML markup reliably was a nightmare, I had to build a huge tree of if-this-then-maybe-that logic.

AdamRoberts · on June 11, 2018

The company I work for (zinc.io) has this: https://zincapi.com/

We brand it as an ordering API, but we also offer retrieving the product data (item details/pricing.) We put a LOT of engineering resources into data quality and maintenance, as the API is core to our flagship product, PriceYak. If you have questions or want a token, email adam@zinc.io and mention this post.

ikeboy · on June 11, 2018

If you're using this for anything serious, it's probably better to sign up for the keepa API at about $50/month and they scrape Amazon for you. Worth it to not need to deal with the complexities.

AdamM12 · on June 11, 2018

Nice. From my experience I've found Parsel [1] (used by scrapy) to be an easier to use HTML parsing library than Beautiful Soup. That's just imo.

[1] https://github.com/scrapy/parsel

microdrum · on June 11, 2018

Hm, another no-API option (at least if you are on WordPress) is: https://wpcommission.com

alex_sp · on June 11, 2018

So how many calls is one allowed before getting banned? Any guidelines on how to use this without breaching T&Cs?

staticautomatic · on June 11, 2018

Am I the only one who thinks this is rather weird, or at least unconventional code for a scraper in Python?

dec0dedab0de · on June 11, 2018

I just took a glance, but nothing seemed too off. Do you care to elaborate?

staticautomatic · on June 11, 2018

Sure. I'm not really trying to criticize the code, it's just that a lot of this looks foreign and unconventional to me.

1. requests.Session() is a class. IDK what request.session() invokes (see https://github.com/tducret/amazon-scraper-python/blob/master...).

2. Isn't one of the points of using Session() that it'll persist stuff like cookies and headers? So why is it re-defining the headers multiple times? (e.g. both GET and POST in the same session have their own respective but identical headers).

3. Is the use of `arg=""` idiomatic? For example in https://github.com/tducret/amazon-scraper-python/blob/master...

4. Using raw list indices without some kind of helper function to catch index and other errors when parsing is not really a good idea in scraping (e.g. `selection[0].text.strip()`.

_dcwr · on June 11, 2018

1. It just does return Session(), that's easy enough to find out[1].

2. It doesn't really matter and maybe it's so they are kept closer since they are modified, Session merges your call provided and its own headers (yours take precedence) and it still handles the cookies if you provide own headers. Session also has the benefit of connection pooling so it's quicker to do more than one request with it[2] (normal get, post, etc. in requests module go through request function in the end which actually constructs a Session for that single request).

3. What's wrong there? It's just a default argument. Strings aren't mutable so it avoids this pitfall[4]. Is the " quote a problem here? It's a matter of taste/style. PEP8 is silent on it[3] and just say to pick a convention and there seems to be one here. Some people (me too) also use single quotes for non-human readable strings and double quotes for human readable strings.

4. If you mean here[5] then there is a len just above it to catch the 'expected' error/missing element, just the .text part is unchecked. As for the general lack of checks - I don't put them into my GreaseMonkey or random Python one off scraping code either. Site layout is invariant of a certain version of a scraper script so if some field is missing or something like that then the website layout must have changed and the entire script probably needs to be reworked (or the field is not always present there in the first place so the script is also useless in that particular scraping case) and might as well crash (or if its used by someone they can catch the exception). Either way (crash or catch) when something you expected to surely be there is missing the results are not coming or might be wrong. That code as it is now anticipates that there might not be such an element but if there is it must have the expected field. If the site has been observed to always work like that (certain element might be missing but surely has that field when its there) then script just works like that and guards against the first expected possibility (missing element) but not the second (missing field) since if how site is laid out, how data is stored in elements, etc. changed significantly, then the script also needs changes or risks producing bad or incomplete output (arguably worse as a default than a loud failure would be, it also depends on what you're doing and what the scrape is for).

I'd assume most users and programmers would rather get an error than have script return an empty list (despite there being content up there) just because the layout changed. The only other solution (other than return a wrong result by design and hide the errors or log them somewhere where no one cares to read anyway) would be to catch such exceptions somewhere high and either pack them into a new exception that is thrown with more information (what URL, what element content was exactly, entire response text, etc.) but that's probably too much care/work for such a one off script OR throw your entirely own ones from some low place, but it's vanity then because Python exceptions point really strongly to where they were thrown and in what call stack so it's just as clear what was broken without the need to add lots of checks yourself and throw a "element X is missing field Y that should always be there" message.

[1] - https://github.com/requests/requests/blob/master/requests/se...

[2] - http://docs.python-requests.org/en/master/user/advanced/#ses...

[3] - https://www.python.org/dev/peps/pep-0008/#string-quotes

[4] - http://www.effbot.org/zone/default-values.htm

[5] - https://github.com/tducret/amazon-scraper-python/blob/master...

staticautomatic · on June 11, 2018

Thanks for humoring my post with a real answer.

1. I never realized there was a function that just returned an instance of the class. Should've just looked it up myself.

2. I was wrong and misread the header stuff.

3. There's nothing wrong with it. It's just a convention I'm not accustomed to seeing or using. Admittedly, there are lots of ways to skin a cat with optional and default args.

4. Yeah I understand what you're saying. I guess it's a fine "greasemonkey" approach. Just more susceptible to DOM changes and code errors than I'm comfy with even for a rudimentary implementation.

I agree you'd usually want to get an error than an empty list but I think it's a little more complicated than just whether you want an error or an empty list. Sometimes you want the error but don't get it, which is why I tend to write more code around checking stuff and catching exceptions. I think the best example is you might not see an index error but the list item that's returned for your specified index isn't actually what you wanted because the DOM changed or you wrote code against one page that broke on another one you thought would be identical.

rckclmbr · on June 11, 2018

Its a good thing its open source, submit a PR!

staticautomatic · on June 11, 2018

I know this is HN and all but I am not even entirely confident about my own remarks. I asked "Am I the only one..." earnestly, not as a way of softening criticism. I'm a self-taught amateur and have never submitted a PR before.

RobLach · on June 11, 2018

If it works...

kull · on June 11, 2018

It is also illegal to scrape AZ, since if you scrape it , it means you don’t own this content and you are just stilling products data added to the site by produsts proper owners.

zeusk · on June 11, 2018

why aren't Larry and Sergey behind bars, then? Scraping publicly available information is far from illegal.

Also, Interestingly only Alibaba's bots are completely blocked from crawling: https://www.amazon.com/robots.txt

stef25 · on June 11, 2018

> Scraping publicly available information is far from illegal.

The scraping itself may not be (although I'm pretty sure here in Belgium there is a law against collecting other people's data), but what you do with it may not be legal.

You could make a case for making any kind of profit generated from scraping data illegal. Don't get me wrong, I love scraping things myself.

Also find it amazing there are companies out there like Crawlera that can do serious scraping work and openly flaunt deploying tech to get around whatever scraping blockers are out there.

cosmie · on June 11, 2018

>

LinkedIn had multiple layers of scraping detection systems deployed, and went to significant efforts to block their data from getting scraped[1].

Last year, they were ordered by a Federal court explicitly to allow scraping of content and remove systems that were designed to impede and block scraping efforts[2].

There's no clear law (in the US) directly aimed at scraping, and repurposing anti-hacking laws brought up the murky definition of what is unauthorized access. If a judge clarifies explicitly that scraping is not unauthorized access (which happened in this case, although needs to stand up to appeal[3]), then entities that are interested in preventing scraping have lost one of their core legal underpinnings. It demonstrates why companies like Crawlera have been able to flaunt the type of serious scraping work they do, because it's hard to bully people with a legal argument that has been debunked and affirmed as debunked on appeal. So it's better to avoid the risk of setting that precedence entirely until you can't avoid it.

[1] https://techcrunch.com/2016/08/15/linkedin-sues-scrapers/

[2] https://www.reuters.com/article/us-microsoft-linkedin-ruling...

[3] https://www.courthousenews.com/linkedin-takes-data-scraping-...

kull · on June 11, 2018

Check amazon api T&C, also try to do the same with Craigslist and see how long you they will let you do it. scraping data is always a shady business if you do it without a permission of content owner

zeusk · on June 11, 2018

It is anything but shady. They can send you a C&D or file a suit and seek injunction but there is no way they can get you in trouble with the law for scraping publicly available data.

gsich · on June 11, 2018

I don't need to check the T&C to view the page.

rmwaite · on June 11, 2018

If you’re scraping.. you’re not using the API.

smt88 · on June 11, 2018

Why would the owner of a product want to keep their product info a secret?

kull · on June 11, 2018

Ex. People take products data and copy to eBay then try to dropship getting products from your fba. People pay big money for nice photos of products and then somebody just comes and takes it as their own

xenomachina · on June 11, 2018

> People pay big money for nice photos of products and then somebody just comes and takes it as their own

This raises an interesting question: if someone had a product on Amazon and had product photos they took, does Amazon still allow other sellers to use the same listing? In other words, does the seller agreement allow Amazon to reuse your (potentially expensive to produce) product photos on your competitor's product listing?

mxvzr · on June 15, 2018

Yes they allow it. It often leads to "product listing hijacking": another seller sends a counterfeit/similar product to amazon using the UPC of your (private label) product. Product & review pages are then shared between all these different products. AFAIK the only way to combat this is purchasing the other product, document how if differs from your product, report it all to amazon and hope they act before too much damage is done to your product listing.