@ the OP - not to sound hostile, but you write code (like in the example here [1...

luispedrocoelho · on Feb 7, 2018

That is the _FAST_ version of the code (people keep saying "of course, it's slow", when it's the fast version).

Here is an earlier version (intermediate speed): https://git.embl.de/costea/metaSNV/commit/ff44942f5f4e7c4d0e...

It's not so easy to post the data to reproduce a real use-case as it's a few Terabytes :)

*

Here's a simple easy code that is incredibly slow in Python:

    interesting = set(line.strip() for line in open('interesting.txt'))
    total = 0
    for line in open('data.txt'):
        id,val = line.split('\t')
        if id in interesting:
           total += int(val)

This is not unlike a lot of code I write, actually.

proto-n · on Feb 7, 2018

I've also found that loops with dictionary (or set) lookups are a pain point in python performance. However, this example strikes me as a pretty-obvious pandas use-case:

    interesting = set(line.strip() for line in open('interesting.txt'))
    total=0
    for c in chunks: # im lazy to actually write it
        df = pd.read_csv('data.txt', sep='\t', skiprows=c.start, nrows=c.length, names=['id','val'])
        total += df['val'][df['id'].isin(interesting)].sum()

I'm not exactly sure, but pretty sure that isin() doesn't use python set lookups, but some kind of internal implementation, and is thus really fast. I'd be quite surprised if disk IO wasn't the bottleneck in the above example.

luispedrocoelho · on Feb 7, 2018

`isin` is worse in terms of performance as it does linear iteration of the array.

Reading in chunks is not bad (and you can just use `chunksize=...` as a parameter to `read_csv`), but pandas `read_csv` is not so efficient either. Furthemore, even replacing `isin` with something like `df['id'].map(interesting.__contains__)` still is pretty slow.

Btw, deleting `interesting` (when it goes out of scope) might take hours(!) and there is no way around that. That's a bona fides performance bug.

In my experience, disk IO (even when using network disks) is not the bottleneck for the above example.

proto-n · on Feb 8, 2018

Ok, I said I wasn't sure about the implementation, so I looked it up. In fact `isin` uses either hash tables or np.in1d (for larger sets, since according to pandas authors it is faster after a certain threshold). See https://github.com/pandas-dev/pandas/blob/master/pandas/core...

aldanor · on Feb 7, 2018

Could you give a hint of how the data ("sample1", "sample2") looks like, or how to randomly generate it in order to benchmark it sensibly? I guess these are similarly-indexed float64 series where the index may contain duplicates? Maybe you could share a chunk of data (as input to genetic_distance() function) as an example if it's not too proprietary and if it's sufficient to run a micro benchmark.

There's also code in genetic_distance() function that IIUC is meant to handle the case when sample1 and sample2 are not similarly-indexed, however (a) you essentially never use it, since you only pass sample1 and sample2 that are columns of the same dataframe (what's the point then?), and (b) your code would actually throw an exception if you tried doing that.

P.S. I like the part where you've removed the comment "note that this is a slow computation" :)

onuralp · on Feb 7, 2018

Have you checked out scikit-allel? It is fairly comprehensive in terms of calculating basic population stats, and the developer is highly active.

scikit-allel: http://scikit-allel.readthedocs.io/en/latest/index.html

scikit-allel example: http://alimanfoo.github.io/2015/09/21/estimating-fst.html

zarr: https://github.com/zarr-developers/zarr

BerislavLopac · on Feb 7, 2018

The speed could possibly be improved by using map. Also, not related to speed if this is all of the code, but might affect it in a larger programs: you should make sure your file pointers are closed. Something like:

    with open('interesting.txt') as interesting_file:
        interesting = {line.strip() for line in interesting_file}
    with open('data.txt') in data_file:
        total = sum(int(val) for id, val in map(lambda line: line.split('\t'), data_file) if id in interesting)

man-and-laptop · on Feb 7, 2018

`map` is not going to make it faster. `map` is a loop. Only vectorized code is faster.

deckiedan · on Feb 7, 2018

Have you tried using Cython to compile code like the above? Python's sets / maps / reading data etc should be fairly optimised, so Cython might let you bypass boxing counter variables instead using native C ints or whatever.

Also, if the data you're reading is numeric only - or at least non-unicode / character data - you might be able to get a speed boost reading the data as binary not as python text strings.

peterwaller · on Feb 7, 2018

> you write code (like in the example here [1]) that is bound to be slow, just from a glance at it > [1] https://git.embl.de/costea/metaSNV/blob/master/metaSNV_post.....

Given his code you referenced, could you elaborate on what makes it look slow at a glance, and how you might speed it up? :)

anaptdemise · on Feb 7, 2018

ln 221:

    if snp_taxID not in samples_of_interest.keys():#Check if Genome is of interest

Tracing through, looks like samples_of_interest is a dict. `snp_taxID not in samples_of_interest` would make membership check constant time.

ihnorton · on Feb 7, 2018

> Also, have you tried Numba?

Numba does not support dictionaries and has limited support for pandas dataframes (only underlying arrays, when convertible to NumPy buffers, if I understand correctly). This limits usefulness for many non-array situations, as well as some existing code-bases (the dictionary is fundamental in Python and typically used everywhere -- often for performance).

pbowyer · on Feb 7, 2018

Brian Moore's quip [0] about mod_rewrite comes to mind every time I use Numba:

"Despite the examples and docs, Numba is voodoo. Damned cool voodoo, but still voodoo"

0. https://httpd.apache.org/docs/2.0/rewrite/