Hacker News new | past | comments | ask | show | jobs | submit login
Parallelism in one line (medium.com/building-things-on-the-intern...)
142 points by adamnemecek on Jan 26, 2014 | hide | past | favorite | 31 comments



The trouble with Pool.map is that the function you pass in has to be picklable, i.e. defined at the top level in the module you are currently in.

So you can't use it in interactive sessions:

  In [1]: import multiprocessing
  In [2]: pool = multiprocessing.Pool(8)
  In [3]: def hello(x):
  ....:     return x+1
  ....:
  In [5]: stuff = range(10)
  In [4]: pool.map(hello, stuff)
  # cryptic errors
You also can't use it with lambdas or functions defined within functions; suppose I put this in test.py:

  import multiprocessing
  def run_stuff():
      pool = multiprocessing.Pool(8)
        def hello(x):
          return x+1
      stuff = range(10)
      return pool.map(hello, stuff)

  print(run_stuff())
... I get the same errors.

This is sad because pool.map seems to be faster than anything I have written myself which might replace it without the above limitations. Unless someone out there has done better than me? :)


It works on the REPL as long as you create the pool after the function has been defined:

    >>> import multiprocessing
    >>> def hello(x):
    ...   return x + 1
    ... 
    >>> pool = multiprocessing.Pool(8)
    >>> pool.map(hello, range(10))
    [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
It looks like the pool takes a peek at the module during instantiation. Maybe could be worked around by having a lazy pool that only inspects when you call map.


(Firstly, holy moly, something I wrote is on the front page of Hacker News!? I couldn't figure out why I was suddenly getting twitter messages on my otherwise dormant account.)

Yeah, the requirement that things be picklable in order to send to another process is one of the worst parts of Python in my opinion. I can write the bulk of my code without running into the limitation, but when I do hit it? Holy crap. Rage.

The only work around I've found, which isn't really a work around at all, is throwing out the idea of small, side effect free functions. By which I mean, give the process everything it needs to get the resources on "its" side (thus avoiding the pickling), and then let it process stuff until it's gotten back to some serializable base object that it can finally return. Not ideal, but that's how I generally work around it.

There's also `multiprocessing.Array` and `Value` which avoid the pickle requirement completely, as they're shared memory, but limit you to very base datatypes. Really only good if your operating on some homogeneous array (which for me isn't too often).


Define the function you're passing to the pool before you create the pool, and make sure you use an 'if __name__ == "__main__: main()" line at the end if you're writing a script, since it needs to be importable.

If functions within functions are not supported I don't see the problem, because the thing with multiprocessing is that you need to be conscious of what you're passing between processes, so explicitly passing everything as arguments or through an initializer is the way to go.


A solution to the irritating pickling behaviour of multiprocessing.Pool is to use Pathos.multiprocessing https://github.com/uqfoundation/pathos


I realize this is useful for cpu-bound computations; but for network-bound here is a REAL two-liner that will instantly make your program concurrent (not parallel):

    from gevent import monkey
    monkey.patch_all()
This is a lot lighter than processes. You can combine the two- use gevent to handle network, and then multiprocessing to handle cpu stuff!


There's even a gevented version of requests called grequests.

The only thing about gevent is that the docs kind of suck, but their apis are pretty intuitive and the code itself is well documented.


I've heard that using gevent with multiprocessing can cause problems, is that still the case?


I've used gevent in multiprocess situations, and it worked just fine. I did rig up my own gevent-based IPC, though. Just create a socketpair before forking, and attach some JSON-RPC (or protobuf/msgpack/etc) endpoints after.


You don't want multiprocessing to fork your process after gevent is loaded. If you can avoid that (like hold off on monkey patching) then it works great.


The reddit comments[1] on this article are very interesting as well. In particular, /u/driftingdev suggested it would be better to use concurrent.futures. To quote him/her:

In the last example:

    pool = Pool()    
    pool.map(create_thumbnail, images)
    pool.close()
    pool.join()
would be 2 fewer lines with concurrent.futures:

    with ThreadPoolExecutor(max_workers=4) as executor:
        executor.map(create_thumbnail, images)
[1] https://pay.reddit.com/r/Python/comments/1tyzw3/parallelism_...


I have a gigantic difference in execution time between the two. Also, you can have the same behavior importing contextlib.closing;

   from multiprocessing import Pool
   from contextlib import closing
   with closing(Pool(15)) as p
     p.map(f, myiterable)
takes <30 sec vs

   from concurrent.futures import ThreadPoolExecutor as Pool
   with Pool(15) as p
     p.map(f, myiterable)
that takes > 1min.

whereas ProcessPoolExecutor hangs, and I've no clue why.


What I don't understand is that neither exceptions nor segfaults are handled gracefully when using multiprocessing. The former do not have the full traceback, while the latter aren't even detected and cause a deadlock. This requires running the code again in single processing mode to figure out where the bug is. The new concurrent.futures module has the same problem.


If you follow the monadic route and return a Maybe...

Stop. Let me reformulate. If you follow the standard AJAX route and return either a {'success': True, 'payload': ...} or {'success': False, 'error': ...} from your workers, your error handling will be fine.


That would not give me a traceback, and it would not catch segfaults or ctrl-C. It's a different style of programming that would work around part of the issue (only the exceptions, and only by checking for errors manually), whereas what perplexes me is that multiprocessing and concurrent.futures are not designed to handle these error conditions themselves.


Why, it would give you a traceback if your worker process collects it, as a reasonable worker would. Since workers are probably non-interactive, they can't exactly receive a Ctrl+C. The dispatcher process, OTOH, will. I suppose a pool should kill all processes on SystemExit.

When you think about more details, you grow more and more sophisticated solution that best suits your needs. But the point of the post is different: it's a dead simple way to start. Starting is often the hardest part.


Collecting tracebacks is the sort of plumbing you would expect a 'batteries included' library to do. I didn't expect the workers to receive the ctrl-C, I'd expect the main process to terminate, along with its children; none of that happens.

I think it's clear that these sorts of details belong in the library. Everybody could hack together their own traceback collector, segfault handlers etc., but it's difficult to get right and in a scripting language you should be able to forget about such things.


Check out the combo of sys.exc_info and traceback.format_tb, I've used these two functions to generate stack traces for failed worker jobs in the past with pretty good success.

edit: I guess I didn't really answer the segfault issue. Beyond checking the return code of the child process to see if it segfault'ed (so you can log it or retry it) I'm not sure of how you could get more informative failing info.


Indeed, you can pass the traceback manually as a string. With the segfaults you can use the 'faulthandler' module to dump information about them, and I see that starting with Python 3.3 they correctly detect child processes that have died: http://hg.python.org/cpython/rev/6d6099f7fe89


see concurrent.futures for a threadpooled map :

http://www.reddit.com/r/Python/comments/1w6h6g/parallelism_i...


One liner in Java 8:

    List<String> contents = Arrays.stream(urls).parallel().map(UrlLoader::getContent).collect(Collectors.toList());
but as far as I know the API doesn't provide any way to get a String from an URL, so you need an helper class that also workaround checked exceptions ...

    public class UrlLoader {
      private static String getContent(String url) {
        try {
          return new BufferedReader(new InputStreamReader(new URL(url).openConnection().getInputStream())).lines().collect(Collectors.joining("\n"));
        } catch (IOException e) {
          throw new IOError(e);
        }
      }
    }
so not a real one liner :(


Does anyone seriously think that the difficulty with parallelizing code is the amount of boilerplate? The real problem is synchronizing multiple threads; being too lenient means deadlocks; being too strict means under utilization. The map() function doesn't really help with either of those.

In addition, it looks like you are unfairly picking on Java. The producer consumer model presented isn't anywhere near what a competent programmer would do. Building your consumer from within a method called Producer? Calling isinstance (or instanceof) to check if the "poison pill" is put in the queue? These are the signs of a crappy programmer, not a crappy language.


If you do a lot of data munging, then boilerplate is definitely an annoying obstacle. Plenty of problems are "embarrassingly parallel", and the map function may be fine for those cases.


I wasn't trying to solve the problem of parallel processing, nor was I suggesting that its core problem is boilerplate. The point of the article was simply trying to show a part of the language that I think is neat and elegant to people who may not be familiar with it. As I mention in the opening paragraph, the top tutorial hits on Google make no mention of the map function.

Secondly, I wasn't really "picking" on Java. A one line comment about Java liking classes is pretty far from being "unfair," me thinks. Again, in the context of the article, I was simply (attempting (though may have failed)) to use it as a (hopefully mild-chuckle worthy) example of the different ways things can be done in a language.

Finally, I'm not entirely sure why your complaining about stripped down example code... It's example code, man.


Just as an example from julia...

http://docs.julialang.org/en/latest/manual/parallel-computin...

1) Put the addresses of your remote workers (via ssh) in a file, and point Julia at the file.

2) Julia connects to remote workers on each host.

3) Use one of the convenient macros or functions to run the code (this evenly distributes a map reduce to throw 200 million coins and count heads). > nheads = @parallel (+) for i=1:200000000 > int(randbool()) > end

4) You can also do all of the pool stuff, manual message passing, etc, if you need to.


Sadly is very easy to leave zombie processes. Usually I put the Pool in a with statement, but is not bulletproof.

But yeah, multiprocessing.Pool.map,imap, imap_unordered are neat.


I never used map for parallelization in Python, but I in .NET II used PLINQ. It's fantastic for exploiting data parallelism and very easy to use (or refactor to from sequential code).


People that don't understand the problem pretending their incantations work.

In the words of a great man, it's a jolly hard problem.


Can you clarify what you mean, please?


Looks like a relatively decent implementation of .NET's Parallel extensions.


tl;dr: if you run operations in parallel, they go faster, and if you put this code in a library, you can parallelize stuff in one line of code.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: