A look at some of Python's useful itertools

wting · on May 25, 2013

My intention is not to be snarky, but people post all the time about discovering the itertools or collections library. I notice it's a common gap in newer Python programmers.

Save yourself time and effort down the road and read through both libraries' documentation, they're well worth the effort:

http://docs.python.org/3.3/library/itertools.html

http://docs.python.org/3.3/library/collections.html

I tend to use defaultdict, deque (thread safe), namedtuple, imap, izip, drop/takewhile. In Python 3, map and zip have been replaced with their itertools equivalents.

I blame Haskell for all the lazy evaluation influence. :P

naiquevin · on May 25, 2013

Author here. I completely agree with you. Could have known all this before had I just read the itertools docs beyond permutations and combinations. It was only after using Scala's groupby while solving an assignment problem that I thought about finding out an equivalent in python. And thanks to having worked with the Stream class from Scala collections, for the first time the rest of the itertools also made sense. Better late than never I guess :-)

Aqueous · on May 25, 2013

Is there any reason that you don't switch over to Scala for your main programming language?

I'm in the process of converting a middle-sized PHP codebase completely over to Scala. It is not uncommon for me to see functions that run between 15 and 30 lines shrink to 3 or 4 line functions thanks to the Collections API. On top of that its performance is something you'll never see in PHP or Python. I love Python for its readability and wonderful syntax, but Scala is starting to have an even greater pull on me for its built in concision and the ability to use the entire Java ecosystem without submitting to java's imperative style.

naiquevin · on May 25, 2013

It's very tempting. Frankly, I had enrolled for the course mainly for functional programming and not Scala but towards the end I started liking it. But, at the moment my main languages at work are Python, Erlang and Javascript and I still have a long way to go with these so planning to stick to them for a while.

pjmlp · on May 25, 2013

>My intention is not to be snarky, but people post all the time about discovering the itertools or collections library. I notice it's a common gap in newer Python programmers.

Not only in Python, but programming languages in general.

I still find people writing Java or .NET code that aren't aware of all nice classes that are part of the runtime and end up creating their half baked solutions for their problems.

Nowadays developers seem to code without reading.

Silhouette · on May 25, 2013

Nowadays developers seem to code without reading.

When your standard library documentation is so vast that it would take weeks to read and understand it all, and you'd never remember most of it anyway without context and experience using it, I don't think "coding without reading" is really a fair complaint.

We as an industry need to get better at documentation, and in particular about separating tutorial/overview documentation that presents a map and summary of what's available from reference documentation, or we're going to keep reinventing wheels like this.

Python is a particularly unfortunate example, because while its documentation is vast, it has very little tutorial/overview material beyond the very basics. For example, given that a substantial proportion of Python's standard library actually doesn't work very well in practice, it would be helpful to have a deeper tutorial/map document somewhere that introduced the various areas of the standard library and that also promoted the good ones and suggested popular alternatives for the not so good ones where they exist.

Too · on May 25, 2013

Documentation discoverability is one problem. Willingness to learn and trust is another one. People simply want to use things that they themselves have proven to work before.

As an example an old colleague wanted to dump some data from python to a csv-file and did this by for-looping through each row and each item and concatenating each cell and a semicolon to a string. Even after pointing out to him that python already has a built in csv writer, that handles all issues of escaping etc, he didn't want to use it because he didn't know what it did and didnt want to learn anything new. His version didn't even do escaping inside the for-loop and he didn't see the issue of not doing it. To him the for-loop gave exactly the same result and didn't require any learning and was thus better, and why change something that works... My last suggestion was to at least use ";".join(...) but it was also a bit too magic so he stuck to his well known for-loop.

Usually standard libraries are quite reliable but in some cases, and especially if adding third party libraries, bugs and performance issues inside the library can really give you hell. If the library is supposed to just perform a simple task maybe you would rather implement it yourself as you then also have influence to fix those issues yourself later. Experiences like this can scare you away from even the most reliable libraries in the future.

Silhouette · on May 25, 2013

Usually standard libraries are quite reliable but in some cases, and especially if adding third party libraries, bugs and performance issues inside the library can really give you hell.

I think part of the problem is that the statement above is maybe not as true as it used to be.

Let's stick with Python as an example, though it's far from the only culprit so I hesitate to single it out here. I have a growing list of areas of the standard library that today I just assume won't work acceptably. I have tried to use them before, and I have found them to be either bug-ridden or not robustly portable or so slow as not to be worthwhile or missing enough basic functionality that you need to add something else anyway or just write everything from scratch. The everyday stuff in Python is pretty good, the basic data structures and common supporting functions like itertools, but when you start getting into the less common areas I have a very low opinion of the design and quality of the Python standard library, and that opinion is born of direct personal experience.

On top of the quality and robustness, there's also usability to consider. Even if some of Python's built-in libraries do work, there might be much neater, easier ways to achieve the same result that are only a `pip install` away. Libraries like Kenneth Reitz's Requests come immediately to mind; if I were teaching a newbie to program Python tomorrow, somehow I doubt urllib[N] would feature much.

I'm not sure how that hypothetical newbie is supposed to discover these things today without someone experienced to guide them, though. Whether it's Python and PyPI or Perl and CPAN or C++ and Boost or whatever other language and library repository you like, there's a lot of collective wisdom about the easiest/safest/fastest ways to get things done, but it lives in the combined experience of veterans rather than in comprehensive tutorials to follow once you've got the basics down. And that's only when there is already a recognisable place to look for general use third party libraries, not even considering all the third party libraries that might be out there but for whatever reason aren't incorporated into any de facto standard repository to make discovery (relatively) easy if you at least know what you're looking for.

Is it any wonder that newbies reinvent wheels under these conditions? It seems almost inevitable to me.

acjohnson55 · on May 26, 2013

I haven't come away with the same impression of the Python standard library. Besides urllib, what are the biggest offenders in your mind?

Silhouette · on May 26, 2013

From a few recent projects:

The subprocess system is fairly awful in both usability and portability.

The shutil filesystem tools had bugs and documentation issues the only time I ever tried to use them.

The various compression libraries had horrible performance problems last time I tried them; shelling out to various command-line equivalents was around 4-5x faster.

The command-line parsing tools are OK if you want to write a *nix-style command line tool, but not quite flexible enough for more advanced/customised uses.

I have yet to discover any decent GUI library for Python, standard or otherwise, so I'm not sure whether this one counts.

Logging is flexible but can be awkward to configure, particularly across an application that wants various logging itself but also uses libraries that offer to log.

acjohnson55 · on May 26, 2013

Yeah, I definitely see your point. I haven't used all of those things extensively, but I've got to really agree with you on subprocess and GUI stuff. I'd love to be able to write Python instead of Bash for pretty much all scripting, but working with and connecting subprocesses is a massive pain. I think logging and command line parsing work well, but each seem probably less intuitive than they have to be. But I think the library is pretty awesome for the most part.

SEJeff · on May 26, 2013

You can build a cli parser exactly like git uses (positional and short/long) using argparse. What is difficult about that? Opt parse perhaps, but if you're talking about argparse, it seems like you're just whining. The rest of your comments I (overall) agree with

Silhouette · on May 27, 2013

You can build a cli parser exactly like git uses (positional and short/long) using argparse.

But what if I want something that isn't like Git? I'm slightly amused that anyone would suggest Git as some sort of example of a good CLI, but in any case, not all platforms share the command line conventions of *nix shells.

Suppose I'm running on Windows (where options conventionally start with '/') and I don't want all the magic that argparse does with initial '-' characters. If I set prefix_chars to '/', does that also disable the '--' pseudo-argument? We were originally talking about documentation, and as far as I'm aware, the documentation for argparse doesn't actually specify this either way.

Suppose I want to have a set of basic choices, each setting a flag to say it's there. What if I also want some shortcut choices that represent combinations of the basic ones and set all of the corresponding flags? As far as I'm aware, you can't quite do this with any of the standard actions, so you have to start writing an entire new class to define a custom action instead. At least you can do that, but what was wrong with accepting a simple function, and where does anything say how argparse.Action is actually defined and why it's necessary instead?

Suppose I want to present the same data as the automatic help option, but reformat it in some completely different way that makes more sense for my program before it gets printed? There are assorted functions to display or return formatted help strings, but nothing seems to just give back a neat bundle of the relevant information for further processing. Collecting the data and rendering it for output are conflated.

Argparse, like much of the Python standard library, has a lot of power as long as you want to do things exactly its way, but it's not designed in a way that is particularly easy to extend. IMHO, a better strategy for designing standard libraries for languages is to create templates/frameworks/whatever you want to call them, and then to provide some specific implementations for basic cases. This way, when inevitably someone needs to go beyond the out-of-the-box functionality, they can still fit in with established conventions instead of starting over from scratch, which is generally better both for compatibility and for minimising the amount of extra logic that much be built on top of the tried and tested standard library. Of course you do have to be careful not to go too far and make simple cases look artificially complicated, but no-one ever said designing good APIs was easy. :-)

MostAwesomeDude · on May 26, 2013

asyncore is almost certainly the worst module in the standard library.

acjohnson55 · on May 26, 2013

That just sounds like laziness as a coder to me. Would he also output to JSON from scratch?

I taught high school for a while, and I had kids that refused to stop using their fingers for addition, regardless of the fact that it was preventing them from learning how to do more abstract math. What you're describing is pretty much the same attitude.

brandnewthrow · on May 26, 2013

I've taken math classes up to Linear Algebra, (so basically what was required for a CS degree) and I still count on my fingers sometimes. In fact I think that math got more intuitive and "mentally pliable" the more abstract it got, but for some reason I'm still pretty hopeless with arithmetic. I also have trouble with telling right from left. Is it really the case that using fingers for arithmetic can hold a person back from learning higher math? Not trying to be snarky, genuinely curious.

acjohnson55 · on May 26, 2013

I'd say the real problem is not the tool itself (finger counting, in this case) but a stubborn reliance on that tool even when it obviously can't scale. Some of these students were more concerned with just executing the algorithm than understanding process of addition, and consequently, some struggled with multi-digit addition, negative numbers, fractions, and so on. They coped with their inability to solve more complicated problems simply by not ever attempting to do the work. The attitude was "I don't get it immediately so I'm not even going to attempt to understand it".

I'm really not trying to bash the students. As a teacher, my job was to invest the students in wanting to learn, and I admittedly wasn't always effective.

pjmlp · on May 25, 2013

I consider it a fair complaint, because young developers seem not willing to learn.

I am old enough to remember the days the only way to learn how to program was to go through, sometimes very dry, books and manuals. There was no Internet on those days.

Young developers seem like spoiled kids that want to do something right away, without setting the time to learn how to do it properly.

Silhouette · on May 25, 2013

I am old enough to remember the days the only way to learn how to program was to go through, sometimes very dry, books and manuals. There was no Internet on those days.

Join the club. We're getting T-shirts made. :-)

The thing is, in those days we really could learn all the commands of an operating system shell by reading the manual cover to cover in an afternoon, or play with graphics demos or write low-level system utilities after reading the Pink Shirt Book.

Today's systems are so vast and complicated that anything offering similar coverage in book form would be the size of an encyclopaedia, so the way we were able to learn doesn't scale to modern needs.

The trend over the years has definitely been towards writing glue code and joining up ready-made components for a lot of professional work rather than reinventing things from scratch, and in some ways that's no bad thing. However, I think it only works if you know what you've got available in your toolbox, and so does being the person who understands and creates new components. Either way, it comes back to needing a way to navigate the vast amounts of information now available and pick out the bits you need to achieve whatever it is that you're trying to do.

pjmlp · on May 26, 2013

Size M please. :)

d0mine · on May 26, 2013

Have you read the tutorial http://docs.python.org/tut ?

Silhouette · on May 26, 2013

Sure. It's a textbook example of the problem I'm talking about, in fact.

naiquevin · on May 26, 2013

> Nowadays developers seem to code without reading.

As someone who started programming 4 year back (which from the conversation seems to be pretty much "nowadays" :-)), I think it is a bit of a generalization. I am not saying it's not true. There are certainly places such as StackOverflow, mailing lists etc that attract newbies early on because they provide quick answers or even code, but at some point every developer who is serious about computer programming as a long term profession does need to start reading the docs. There is no other alternative and one eventually comes to realize that it's much faster than arbitrarily hunting for code and asking questions on mailing lists and IRC.

It also depends upon the style of programming language (imperative/functional) and the previous experience of the developer IMO. For eg. I find my self reading the docs significantly more in Erlang/Scala than in Python than in PHP/JS. It is also the reverse order in which I learnt these languages. Of course that is my personal experience.

shavenwarthog2 · on May 25, 2013

Seconded. I use itertools constantly for data-wrangling on network server and web applications. Specifically:

- processing files with imap and ifilter to rapidly grab data, find a subset of it, then process it with a function

- defaultdict(list) is incredibly useful for collecting data, arranging it by a certain key (like date or object id), then collecting into a list

- namedtuple is occasionally useful for efficiently stuffing data into an object with a few named attributes.

masklinn · on May 25, 2013

    def flatmap(f, items):
        return itertools.chain(*map(f, items))

1. in Python 2 `map` is eager which — as with the previous `even` filter — may lead to unnecessary work if you only need part of the list (or a dead process if the input is infinite...). itertools.imap (or a generator comprehension) would be better. This is "fixed" in Python 3 (where the `map` builtin has become lazy and `itertools.imap` has been removed) but

2. it's being eagerly unpacked through *, itertools.chain also provides a from_iterable method which doesn't have that issue (and can be used to flatten infinite streams), introduced in 2.6

So `flatmap` would probably be better as:

    def flatmap(f, items):
        return itertools.chain.from_iterable(
            itertools.imap(
                f, items))

naiquevin · on May 25, 2013

Thanks for the corrections. I have made an edit (although not sure how long it will take to clear the github-pages cache)

serjeem · on May 25, 2013

I wrote my favorite function ever last semester with itertools! It (roughly) lazily generates a list of dictionaries that map players to their moves for all possible moves. It turns out you can do that with a chain of combinations, two cartesian products, and an imap: https://github.com/shargoj/acquire/blob/master/gametree.py#L...

davvolun · on June 3, 2013

On the other hand, I suspect some early programmers might get ahold of this and perform a lot of premature optimizations. A piece of code that runs 20 loops instead of 8 once every couple of hours probably doesn't need to be optimized. A piece of code that does two checks when one would suffice that runs 1000 times every second might need optimization. Profile first, then optimize.