Hacker News new | past | comments | ask | show | jobs | submit login

Am I the only one who thinks this is rather weird, or at least unconventional code for a scraper in Python?



I just took a glance, but nothing seemed too off. Do you care to elaborate?


Sure. I'm not really trying to criticize the code, it's just that a lot of this looks foreign and unconventional to me.

1. requests.Session() is a class. IDK what request.session() invokes (see https://github.com/tducret/amazon-scraper-python/blob/master...).

2. Isn't one of the points of using Session() that it'll persist stuff like cookies and headers? So why is it re-defining the headers multiple times? (e.g. both GET and POST in the same session have their own respective but identical headers).

3. Is the use of `arg=""` idiomatic? For example in https://github.com/tducret/amazon-scraper-python/blob/master...

4. Using raw list indices without some kind of helper function to catch index and other errors when parsing is not really a good idea in scraping (e.g. `selection[0].text.strip()`.


1. It just does return Session(), that's easy enough to find out[1].

2. It doesn't really matter and maybe it's so they are kept closer since they are modified, Session merges your call provided and its own headers (yours take precedence) and it still handles the cookies if you provide own headers. Session also has the benefit of connection pooling so it's quicker to do more than one request with it[2] (normal get, post, etc. in requests module go through request function in the end which actually constructs a Session for that single request).

3. What's wrong there? It's just a default argument. Strings aren't mutable so it avoids this pitfall[4]. Is the " quote a problem here? It's a matter of taste/style. PEP8 is silent on it[3] and just say to pick a convention and there seems to be one here. Some people (me too) also use single quotes for non-human readable strings and double quotes for human readable strings.

4. If you mean here[5] then there is a len just above it to catch the 'expected' error/missing element, just the .text part is unchecked. As for the general lack of checks - I don't put them into my GreaseMonkey or random Python one off scraping code either. Site layout is invariant of a certain version of a scraper script so if some field is missing or something like that then the website layout must have changed and the entire script probably needs to be reworked (or the field is not always present there in the first place so the script is also useless in that particular scraping case) and might as well crash (or if its used by someone they can catch the exception). Either way (crash or catch) when something you expected to surely be there is missing the results are not coming or might be wrong. That code as it is now anticipates that there might not be such an element but if there is it must have the expected field. If the site has been observed to always work like that (certain element might be missing but surely has that field when its there) then script just works like that and guards against the first expected possibility (missing element) but not the second (missing field) since if how site is laid out, how data is stored in elements, etc. changed significantly, then the script also needs changes or risks producing bad or incomplete output (arguably worse as a default than a loud failure would be, it also depends on what you're doing and what the scrape is for).

I'd assume most users and programmers would rather get an error than have script return an empty list (despite there being content up there) just because the layout changed. The only other solution (other than return a wrong result by design and hide the errors or log them somewhere where no one cares to read anyway) would be to catch such exceptions somewhere high and either pack them into a new exception that is thrown with more information (what URL, what element content was exactly, entire response text, etc.) but that's probably too much care/work for such a one off script OR throw your entirely own ones from some low place, but it's vanity then because Python exceptions point really strongly to where they were thrown and in what call stack so it's just as clear what was broken without the need to add lots of checks yourself and throw a "element X is missing field Y that should always be there" message.

[1] - https://github.com/requests/requests/blob/master/requests/se...

[2] - http://docs.python-requests.org/en/master/user/advanced/#ses...

[3] - https://www.python.org/dev/peps/pep-0008/#string-quotes

[4] - http://www.effbot.org/zone/default-values.htm

[5] - https://github.com/tducret/amazon-scraper-python/blob/master...


Thanks for humoring my post with a real answer.

1. I never realized there was a function that just returned an instance of the class. Should've just looked it up myself.

2. I was wrong and misread the header stuff.

3. There's nothing wrong with it. It's just a convention I'm not accustomed to seeing or using. Admittedly, there are lots of ways to skin a cat with optional and default args.

4. Yeah I understand what you're saying. I guess it's a fine "greasemonkey" approach. Just more susceptible to DOM changes and code errors than I'm comfy with even for a rudimentary implementation.

I agree you'd usually want to get an error than an empty list but I think it's a little more complicated than just whether you want an error or an empty list. Sometimes you want the error but don't get it, which is why I tend to write more code around checking stuff and catching exceptions. I think the best example is you might not see an index error but the list item that's returned for your specified index isn't actually what you wanted because the DOM changed or you wrote code against one page that broke on another one you thought would be identical.


Its a good thing its open source, submit a PR!


I know this is HN and all but I am not even entirely confident about my own remarks. I asked "Am I the only one..." earnestly, not as a way of softening criticism. I'm a self-taught amateur and have never submitted a PR before.


If it works...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: