Doesnt that require you to have a quota of affiliate sales to keep using it? I can’t find where they state this requirement but I remembered they were very sneaky about disclosing this. If you dont have any affiliate sales after X months, your API key will stop working.
Currently you have to be a member of their affiliate program to get API access. To become a full "member" you have to be a prospect who generates three referral sales (iirc) within a 30 day period. So once in you have the API, but getting in isn't as easy as filling out a form. From there you can get your API rate limits increased from the default 1x call per second up to 10 based on your prior 30 day affiliate sales.
Scraping Amazon is fun and all, but when you start overdoing it they rate-limit your IP and show you my worst nightmare: the Dogs of Amazon (a 500 page with pictures)
Why do I know this? Because I'm the CTO at Nazdeeq.com where we let users buy Amazon products from countries where they don't ship easily, like Pakistan.
Edit: totally open to partnerships in more countries
I'm from Brazil and what you said made me curious, not sure why, but Amazon here didn't catch. How did you solve problems like logistics and interest from the public?
I'm sorry, I have trouble understanding your question but if you mean how we ship from Amazon to Pakistan, and how we got people to use our service: we worked out a pipeline to get products from the US to Amazon, and advertising + word-of-mouth. Also:
+ There's no direct way to buy 90% of products from Amazon since they don't ship to Pakistan
+ Our service is the only in the country that gives a fixed price at checkout in PKR
+ Our customer service is excellent
+ We're one of the cheapest options available, as long as the competition imports products legally.
I do not know Nazdeeq but parent claims they deal with importing and last I checked neither MyUS nor any other random package forwarder deal with customs which can be a real PITA. (I have some experience with importing things into ... legally interesting ... countries, from 1990 till I left in 2006 I, as an individual have imported a lot of computers and parts into Hungary. It was ... fun.)
Hey Irfan! I'm sorry you feel that way, but as someone else here pointed out, we handle everything from clearing customs to last-mile delivery (to your doorstep), which is why our pricing may seem expensive because we declare all costs upfront.
A lot of our customers have similar horror stories where their goods get stuck in customs because they didn't realize clearing is a thing.
Hi Amin, your platform seems nice. Just wanted to give you a heads-up that your website is being classified as ["phishing" by Avast](https://i.imgur.com/SmuuRfD.png). I think if you replace "Amazon" in the url with something else it should work fine. Best of luck!
Reminds me of how nobody could see one of my user's avatars because the url (a hash) had started with an "ad" segment (for bucketing), as in "/avatars/ad/ad3adb33f". So adblockers blocked it.
My protest against such a ridiculous heuristic was to not fix it.
It makes sense why you would choose to do that and I can certainly empathize, but in the interest of user experience I try to fix these problems because my customers deserve a good user experience.
In the Philippines there's something quite similar called Galleon. They've been recently acquired but I think they might be open to partner. They've expanded to Thailand, if I'm not mistaken.
The issue with those tools is that Amazon changes the product layout very often and heavily conducts A/B tests. I’ve once even heard that computer vision is the most stable way to scrape Amazon. I guess this library will stop working rather soon.
Are you able to share some details? How often did you have to get new IP addresses? What about user agent? Were the scapers "straight to the point" like amazon2csv (ie: make a request directly to the search page) or did they have randomized behavior (eg: re-use sessions from time to time; click a random link on the page; start from the homepage...)? Did the scrapers ever went against amz's robots.txt directives (eg: interacting with the cart page)? Ever heard from amz itself about your employer's activities on their site?
Same here. Scraping their search results page is easy if you have a bunch of IPs. No manipulation or workarounds needed(ie headless browsers, ensuring your http headers look like a real user).
I have not scraped a ton of actual individual product pages though so cant testify about scraping that. I do remember it might have been harder.
> I guess this library will stop working rather soon.
Don’t really see that as a dealbreaker. So the library will need maintenance. Normal for libraries to need updates in order to keep up with changes. It works today, and it will work whenever it’s updated. Better than nothing and for many use cases that’s good enough.
Search results scraping on Amazon is fairly stable.
What's more difficult is product page scraping, because there you have hundreds of different variations. Some from A/B testing and a lot just being specific things that show up for certain product categories (e.g. video games).
I remember trying to build a scraper for Amazon. I quickly discovered that there are many types of item pages, and they change over time too. A/B testing probably. Just to get the price of the product out of their HTML markup reliably was a nightmare, I had to build a huge tree of if-this-then-maybe-that logic.
We brand it as an ordering API, but we also offer retrieving the product data (item details/pricing.) We put a LOT of engineering resources into data quality and maintenance, as the API is core to our flagship product, PriceYak. If you have questions or want a token, email adam@zinc.io and mention this post.
If you're using this for anything serious, it's probably better to sign up for the keepa API at about $50/month and they scrape Amazon for you. Worth it to not need to deal with the complexities.
2. Isn't one of the points of using Session() that it'll persist stuff like cookies and headers? So why is it re-defining the headers multiple times? (e.g. both GET and POST in the same session have their own respective but identical headers).
4. Using raw list indices without some kind of helper function to catch index and other errors when parsing is not really a good idea in scraping (e.g. `selection[0].text.strip()`.
1. It just does return Session(), that's easy enough to find out[1].
2. It doesn't really matter and maybe it's so they are kept closer since they are modified, Session merges your call provided and its own headers (yours take precedence) and it still handles the cookies if you provide own headers. Session also has the benefit of connection pooling so it's quicker to do more than one request with it[2] (normal get, post, etc. in requests module go through request function in the end which actually constructs a Session for that single request).
3. What's wrong there? It's just a default argument. Strings aren't mutable so it avoids this pitfall[4]. Is the " quote a problem here? It's a matter of taste/style. PEP8 is silent on it[3] and just say to pick a convention and there seems to be one here. Some people (me too) also use single quotes for non-human readable strings and double quotes for human readable strings.
4. If you mean here[5] then there is a len just above it to catch the 'expected' error/missing element, just the .text part is unchecked. As for the general lack of checks - I don't put them into my GreaseMonkey or random Python one off scraping code either. Site layout is invariant of a certain version of a scraper script so if some field is missing or something like that then the website layout must have changed and the entire script probably needs to be reworked (or the field is not always present there in the first place so the script is also useless in that particular scraping case) and might as well crash (or if its used by someone they can catch the exception). Either way (crash or catch) when something you expected to surely be there is missing the results are not coming or might be wrong. That code as it is now anticipates that there might not be such an element but if there is it must have the expected field. If the site has been observed to always work like that (certain element might be missing but surely has that field when its there) then script just works like that and guards against the first expected possibility (missing element) but not the second (missing field) since if how site is laid out, how data is stored in elements, etc. changed significantly, then the script also needs changes or risks producing bad or incomplete output (arguably worse as a default than a loud failure would be, it also depends on what you're doing and what the scrape is for).
I'd assume most users and programmers would rather get an error than have script return an empty list (despite there being content up there) just because the layout changed. The only other solution (other than return a wrong result by design and hide the errors or log them somewhere where no one cares to read anyway) would be to catch such exceptions somewhere high and either pack them into a new exception that is thrown with more information (what URL, what element content was exactly, entire response text, etc.) but that's probably too much care/work for such a one off script OR throw your entirely own ones from some low place, but it's vanity then because Python exceptions point really strongly to where they were thrown and in what call stack so it's just as clear what was broken without the need to add lots of checks yourself and throw a "element X is missing field Y that should always be there" message.
1. I never realized there was a function that just returned an instance of the class. Should've just looked it up myself.
2. I was wrong and misread the header stuff.
3. There's nothing wrong with it. It's just a convention I'm not accustomed to seeing or using. Admittedly, there are lots of ways to skin a cat with optional and default args.
4. Yeah I understand what you're saying. I guess it's a fine "greasemonkey" approach. Just more susceptible to DOM changes and code errors than I'm comfy with even for a rudimentary implementation.
I agree you'd usually want to get an error than an empty list but I think it's a little more complicated than just whether you want an error or an empty list. Sometimes you want the error but don't get it, which is why I tend to write more code around checking stuff and catching exceptions. I think the best example is you might not see an index error but the list item that's returned for your specified index isn't actually what you wanted because the DOM changed or you wrote code against one page that broke on another one you thought would be identical.
I know this is HN and all but I am not even entirely confident about my own remarks. I asked "Am I the only one..." earnestly, not as a way of softening criticism. I'm a self-taught amateur and have never submitted a PR before.
It is also illegal to scrape AZ, since if you scrape it , it means you don’t own this content and you are just stilling products data added to the site by produsts proper owners.
> Scraping publicly available information is far from illegal.
The scraping itself may not be (although I'm pretty sure here in Belgium there is a law against collecting other people's data), but what you do with it may not be legal.
You could make a case for making any kind of profit generated from scraping data illegal. Don't get me wrong, I love scraping things myself.
Also find it amazing there are companies out there like Crawlera that can do serious scraping work and openly flaunt deploying tech to get around whatever scraping blockers are out there.
LinkedIn had multiple layers of scraping detection systems deployed, and went to significant efforts to block their data from getting scraped[1].
Last year, they were ordered by a Federal court explicitly to allow scraping of content and remove systems that were designed to impede and block scraping efforts[2].
There's no clear law (in the US) directly aimed at scraping, and repurposing anti-hacking laws brought up the murky definition of what is unauthorized access. If a judge clarifies explicitly that scraping is not unauthorized access (which happened in this case, although needs to stand up to appeal[3]), then entities that are interested in preventing scraping have lost one of their core legal underpinnings. It demonstrates why companies like Crawlera have been able to flaunt the type of serious scraping work they do, because it's hard to bully people with a legal argument that has been debunked and affirmed as debunked on appeal. So it's better to avoid the risk of setting that precedence entirely until you can't avoid it.
Check amazon api T&C, also try to do the same with Craigslist and see how long you they will let you do it. scraping data is always a shady business if you do it without a permission of content owner
It is anything but shady. They can send you a C&D or file a suit and seek injunction but there is no way they can get you in trouble with the law for scraping publicly available data.
Ex. People take products data and copy to eBay then try to dropship getting products from your fba. People pay big money for nice photos of products and then somebody just comes and takes it as their own
> People pay big money for nice photos of products and then somebody just comes and takes it as their own
This raises an interesting question: if someone had a product on Amazon and had product photos they took, does Amazon still allow other sellers to use the same listing? In other words, does the seller agreement allow Amazon to reuse your (potentially expensive to produce) product photos on your competitor's product listing?
Yes they allow it. It often leads to "product listing hijacking": another seller sends a counterfeit/similar product to amazon using the UPC of your (private label) product. Product & review pages are then shared between all these different products. AFAIK the only way to combat this is purchasing the other product, document how if differs from your product, report it all to amazon and hope they act before too much damage is done to your product listing.
[1] https://github.com/yoavaviram/python-amazon-simple-product-a...