The article doesn't mention a hidden gem in Python's standard library - typed arrays!
Those are in package `array` and they are a very barebones version of numpy's ndarray - with just one (explicit) dimension and no overloaded operators. But if you just want to keep a bunch of numbers in a contiguous array, they can save you tons of memory.
(I know the purpose of the article is to describe a more complex data structure, but arrays can still get you very far.)
That's how you deal with that problem with Java as well: arrays of primitives. If you need to operate on million of points, you can either use Point[] which will incur something like 16 MB additional memory or use two int[] arrays which won't incur any extra overhead short of few bytes. Your code won't be pretty, but it'll be fast and you can always hide this weirdness behind pretty API.
"Struct of Arrays", to be contrasted with "Array of Structs", because in Java it's actually "Array of pointers to Struct". In languages where "Array of Structs" is actually possible, the decision of which to use is less clear-cut and depends on how big the Struct is, what the access patterns are like, and whether you're trying to perform SIMD operations on multiple Structs at once.
I was bored and had time on my hands, so I played around a bit with some of the suggestions in this thread, but it seems as though a naive ctypes approach is the worst possible approach in the two cases I tested (arrays of integers, arrays of x,y struct).
Is there an approach that you use that produces better results than this? Because my naive approach is unmanageably worse for both performance (not recorded) and impact on Python's memory usage.
Depends on the algorithms that use that code. Two separate arrays is not a bad way to represent a vector-of-tuples. And so if you think of everything as a vectorised operation (and if that makes sense for your use case) then the code can be clean enough.
Efficiency-wise it depends on how the locality of reference falls out. There are cases where it's more cache-efficient to store the pairs next to each other, but again for big vectorised operations you might lose nothing by using two cache-ways instead of one.
Or you can just use a more memory efficient language like Java. Python and JS use tons of memory compared to strongly typed VM languages like C#, Go, Java. People lament Java memory usage but Python/Node use several times more
>People lament Java memory usage but Python/Node use several times more
The people who lament Java's memory usage never complain about Python or Ruby because... if you're worried about the JVM you would never consider Python!
I'm mostly referring to those that say the JVM is too "resource heavy" then go on to use JS/Python/Ruby instead, which have much higher runtime overhead.
In my experience many dev think the scripting language heritage means these languages have smaller/lighter runtimes
Not that I want to go down the rabbit hole, but I probably fall into the category of “JVM is heavy, but I use Python”
Sure, memory consumption and CPU perf are both pretty bad in Python, but latency and memory footprint of the runtime itself are pretty good, so it’s ideal for tooling, crons, lambdas etc. JVM is comparatively efficient once you have the JVM ready, but that sure takes tons of resources.
I’m hoping Graal changes this. I don’t like JVM-based languages, but I do like technological progress.
Python startup time can be significant for non-trivial programs as well, and it's been a big problem for Mercurial and other projects. For example, see these threads:
If your script is expected to be run interactively on a frequent basis, then Go, Rust, C++, or even Bash (for simple stuff) will give you much lower user-perceived latency than Python.
It's fairly easy to use Class Data Storage and include an archive in Java 12 where startup time is important. This reduces start time significantly.
JVM starts really quick (~100ms on my machine) and doesn't use much resources as long as your app is small. that's... Uncommon in Java land though. Even simply apps pull in Guava/Apache Commons and a few client libraries. This can easily be thousands of classes. Nobody thinks about it because runtime cost for loading shitloads of code is so low. But you can improve this a ton by using ProGaurd and stripping out stuff you don't need
If I can do it in numpy, I can probably get a lighter, faster implementation with Python than in Java. More generally, if I can do it with a Python package that's actually a fairly thin wrapper around a C, C++ or Fortran library, then Python also has a decent chance of being the easy winner.
If none of those situations apply, then yeah, typically Java ends up being more efficient.
I have seen completely the opposite in the wild. In fact I've only seen it so far in the opposite direction (Java consuming far more memory than Python) that I have to ask what on Earth you are doing with Python to have found yourself in such a position.
(I've never successfully used Node so I have no idea what that's like in memory use.)
Until you end up rediscovering the information storage skills used by databases and lower level programs: buffers with fixed sizes and implicit type associations.
I would say that Python is the right tool for some things, not for others. Why not use Go, or Nim or C to store/access/return these data in a data structure that transforms the values from/to python?
In the article consider aproach with the `dataobject` from `recordclass` library as a base object in class definition. This seem can produce less memory than PyPy.
Any type that you have "a lot" of, is a bad candidate for a class/object. Your OO program should typically never instantiate thousands of heap allocated objects at once.
A triangle mesh is a good object candidate. One triangle or vertex is not. A device, a render system API handle and an image is a good candidate, one pixel is not, and so on.
So don't use a lot of objects. In C# you use a struct and arrays of them to avoid creating an object on the heap per array entry. In java you have to resort to SoA, in Python it's the same. Just because you have objects doesn't mean everything should be an object. Even languages that follow an "everything is an object"-design usually have an escape hatch such as plain value arrays.
It's not about using objects or not, it's how you compose them. A C# array of integers is an object. Each integer in that array is also an object (though granted, it's a special case). Numpy arrays and javascript typed arrays as well.
What’s relevant for performance is whether an array of 1000 “things” require 1 or 1001 allocated objects, whether accessing thing N in the array requires dereferencing one or two pointers, and whether the item has a storage of 32 bits without overhead for headers. Which ones to call “objects” is semantics.
For efficient access, the array must be a consecutive array of primitives. This is the case in both C# and java for integers. In C# it’s also the case for an array of Vector2 with 2 primitives each, which isn’t the case in java.
My point is this: avoid heap allocating many things in collections. They must be raw (primitive, consecutive) data without per instance overhead and of course without heap alloc/GC cost.
While everyone haggles about the internals I have an interesting anecdote about the costs of tools libraries.
A small service that landed in my lap needed to read(only) a data source that was roughly 10k lines of yaml. No way that was going to be in any way efficient so I asked for suggestions. All the work-a-day devs(I am not) instantly said the same thing without a single real thought about it: Make it a database duh!
Long story slightly less long, Loading up the libraries to interface with a database ate between 2 and 3 times the memory(depending on the DB and lib) that simply loading the entire 10k line yml ate and offered slower performance and required more code.
SQLite was pretty darn close but in the interest of saving developers from themselves vis a vis parameterized queries, or the need to queries all together for that matter, increased the required code for zero benefit.
The service still hums along with a 10k line yaml in memory. "Worse is better" indeed.
Well, the SQLite might have been more scalable if you needed to deal with much larger YAML files. I've been in a similar situation with XML. It's fine to hold in memory until you end up running out of memory, at which point you need to look at other approaches. There's definitely an overhead to changing, but it might be worth paying if the tradeoffs make sense.
However, "better" is dependent on the situation. If the data is unlikely to change, stay the course. If not, then while a database might be more maintainable over the long term, even if not as efficient.
In the pure Python cases the 8 bytes per attribute are just pointers. The x, y and z are themselves full-blown objects with all the extra memory overhead that comes with it and this is not counted in the article. For example, an int object uses 28 bytes, so three of them already use up more than each of the described container objects.
The Cython and Numpy cases directly store the actual data and this has the larger effect to reduce memory.
Slightly pedantic correction: This is a performance optimization in CPython. I wouldn't be surprised if other implementations have something similar, but to my best knowledge, this behaviour isn't part of the standard.
The standard is documented in PEPs[1]. CPython is the flagship implementation, and many PEPs discuss it directly, but if you're looking to write another Python, that's where you'd go.
As effectively many Python workloads are usually object oriented business logic and objects are mostly pointers, setting up x32 user space "halved" the memory usage. It also made execution performance faster because of better CPU cache utilisation.
Sadly, x32 was "very custom" and very hard to support. Last I heard x32 is being phased out from Linux kernel.
Another option is to use the standard 64 bit ABI, but store object pointers compressed into 32 bits when on the heap. This lets you address perhaps 32 GB in a 32 bit value.
Java for example, can use "compressed OOPs" -- ordinary object pointers. This is a 32-bit stored pointer but shifted, as pointers to objects always point to something on multiple of 8 bytes. So the shifted address is stored, but then shifted left before being dereferenced, allowing addressing of 32GB memory with 32-bit references.
There are variations on this method that can be applied to other memory management pools. For example, let's say are allocating many objects for processing, need them only within a certain span of time and have to free all of them when it's over. Well, rather than individually tracking allocations you can allow a single perhaps growing memory area, store the start offset somewhere and reference them with a shorter reference relative to that area.
Do you need to address every byte in your memory? If all your Python objects are at least two words in size, or more, then no, you only need to address every 16 bytes or so. So shift your whole pointer right a bit and you can address more than 4 GB of objects in a 32 bit pointer.
> If there has to be base + offset translation on every pointer access it is way too slow.
It does do this, but it's not too slow - the overhead of the translation is lower than the benefit of reduced memory transfer, increased cache space, etc. Obviously - otherwise people wouldn't be doing it.
Wow, this is really amazing. I've always felt like 64-bit pointers were a huge waste of memory, since many small tools need well below 4 GB. Cache is always going be relatively small (unless semiconductor/transistor fabrication costs drop dramatically), and smaller pointers result in better cache usage, and consequently fewer cache misses.
> Sadly, x32 was "very custom" and very hard to support. Last I heard x32 is being phased out from Linux kernel.
This is really sad to hear. Do you happen to know of any kernel discussion threads on this besides this one[1]? (I'd love to chime in, in support of the X32 ABI.) I'm a little surprised honesty, as Linus has mentioned multiple times how important it is to him that the user-facing ABI remain stable.
I am a C/C++ programmer mostly. Is there good documentation for python development when I care about performance and memory footprint out there as a book or something that anyone can recommend? It was fun reading this but I like to know more.
I'm a Python "snob", going on 24 years and If I ever "I care about performance and memory footprint" I'd not use Python.
Python is good enough 90%. You get faster 2x 10x, etc code by picking better algorythms or solutions. In Python, you should not be caring about 10% or 30% speed improvement. It's not worth it, It's not Python's strength.
Or go for [1]pypy or [2]graalpython. I know that equivalent code in graal java ee produces equivalent assembly/performance to C+current gcc, which really shows that yes changing to a JIT can pay off. The python in graalvm is early stages but it shows that the old lesson of just write the hot spot in C is no longer true. Chris Seaton shows that for Ruby/C the language Ruby interpreter/C switch is expensive. I think the same is true for Python interpreter/C switches. Something that GraalVM+Python|Ruby can optimize out.
LuaJIT has great performance through the C FFI. It’s more that traditional JITs can’t optimize across FFI boundaries. Hopefully Chris will correct me if I’m full of shit.
There are other approaches which can work too, like compiling C to LLVM IR using Clang so a function can be inlined by an LLVM based JIT at runtime.
Like with JS, you might not have a choice. I mostly use Python for extending programs that use it as a embedded scripting language. Performance is still important.
High Performance Python by Ian Ozsvald and Micha Gorelick - fairly outdated at this point, but still quite relevant (I believe the authors are working on a new edition).
I don't think so. This is quoted from Ian's latest mailing list [0] posting:
Second Edition of High Performance Python in the works
I'm very pleased to say that Micha and I have started work on a Second Edition of High Performance Python with O'Reilly, planned for early 2020. This book will use Python 3.7 and will add tools that barely existed 4 years ago including Dask and probably some Tensorflow, amongst lots of other goodies. I'll let you know how the book progresses via this list. So far I've updated the Profiling chapter, added some advice on 'being a highly performant developer' and have rebuilt the Cython/Numba/PyPy code for the Compiling chapter.
Not always necessary. I do a lot of ETL type work in python, and generally, the idea is to never pull everything into memory at once if you don't need to. This means leveraging generators quite a bit. Read a row, process a row, generate a row, pass it along the pipeline. I've written scripts that can process a 1k line csv file with the same memory footprint as a 10M line csv.
If you have to read everything into memory, because you're doing some sort of transforms like a list to a dict or such, explicitly deleting variables can help, as well, rather than waiting for them to go out of scope, but this should be pretty rare.
See Mike Mueller's tutorials from PyCon at https://www.youtube.com/watch?v=DGrS0uwMuHY. That's the 2018 one, but I'm guessing there is a similar one. There always seems to be a high perf tutorial of some kind.
>A significant reduction in the size of a class instance in RAM is achieved by eliminating __dict__ and__weakref__. This is possible with the help of a "trick" with __slots__:
What is the downside to the __slots__: trick? In what cases do you need the dict and weakref?
One big upside of using __slots__, which the docs tell you not to use it for, is that it will throw an error if you try to access or assign to a member that is not one of the slots. Very handy for catching bugs.
It’s not all roses! In my experience at three separate organizations, Flask apps tend to become horrific to maintain beyond a certain size, unless the original developers really know what they’re doing and are disciplined about it.
A lot of it is due to the standard patterns for small Flask apps (e.g. instantiating your app directly at the module level rather than in a factory function, putting all of your business logic into ORM classes) causing increased complexity and unintended side effects down the line. Flask apps also struggle quite a bit with performance, and the only way to do things asynchronously is with something like celery.
For Python, aiohttp is also a really interesting web framework, and the asynchronicity should be easy to grok for JS devs.
Personally, as someone who has worked with Python/Flask professionally for years, I’d pick TypeScript and any JS web framework any day!
Also, slots somewhat contributes to obfuscation since slotted objects don't work in the exact same ways as regular objects do so developers may run into weird stack traces they might not immediately recognize.
>> developers may run into weird stack traces they might not immediately recognize
Truth.
To me anyway, debugging code is massively harder than writing it in the first place. Nothing worse than trying to fish out some odd issue that plagues you after spending two hours writing some code, and then 4 hours sorting it out.
I have written code like that in the past, and would consider doing so again in the future. When, it's appropriate, it can be hugely beneficial.
When I last did it, I was wrapping a C++ API that I needed to compare against another dataset for a merge. The C++ API (which, admittedly, I also wrote), didn't have equality or hash operators defined on the Python objects. So, I monkey-patched them in for the keys I needed. It was actually the most elegant solution I could come up with as I could then naturally use the objects in sets and easily get differences to update the appropriate dataset.
As an aside, when I wrote the python wrapper around my C++ API, I purposely didn't define equals and hash operators for the objects, despite having full information to do so on the natural keys, because I wanted the flexibility to do the monkey-patching and override how the objects were compared depending upon circumstance.
A downside of __slots__ is that the way they are set up, you can only have slots come from one branch of an inheritance hierarchy, so they don't work out very well for fancy OO interface / mixin type stuff.
Inheritance. Even Guido has said that people should avoid use of _slots_. Some people think that the space saving trumps the inheritance issue/type checking not inherited or some such. It's almost anti Python-like. People want to use it so they don't have to go down to something like C. I personally would not use _slots_ but then again, I'm just a nobody who writes semi-decent code for semi-decent money. YMMV.
The structure instance has a pointer to its type, followed by a numeric ID (which is also in the type, but is forwarded to the instance for faster access). The ID is combined with a slot symbol to perform a cache lookup to get the offset of a slot. The numeric ID is a fixnum, which leaves a few spare bits for a couple of flags:
If someone wanted to shrink this, they could patch the code so that the dirty flag support, and lazy instantiation of structs is made optional (as in compiled out), and so is the forwarding of the inst->type->id to inst->id. This struct type is not known outside of struct.c, which is only some 1700 lines long. If you take out the id member from struct_inst, the C compiler will find all the places that have to be fixed up; literally a 15 minute job.
I can also think of a more substantial refactoring that would eliminate the type pointer also. There is no room in the heap object handle to store it directly: heap handles have four words in them; the COBJ ones used for structs have the type field, a class symbol, a pointer to an operations structure, and a pointer to some associated object (in this case struct_inst). All structures share the same operations structure. However, if we dynamically allocate that operations structure for each struct type, we could stick the type pointer in there, taking it out of the instance. Thus an instance size could literally just be sizeof(pointer) * no-of-instance-slots.
I am paying US$5 a month for python's memory hogging.
Running Mailman it will not run in 1GB of ram (the US$5 VPS) so I had to give it 2GB. 2GB? For a mail list server?
Had the same problem running motioneye. It will not run on a Raspberry PI Zero. It is advertised to run on a Zero (Apparently I could squeeze it on by doing some Python magic....)
What total crap is Python! What waste, what hubris, what technical failure!
I suspect the problem is actually Django - what hubris! NIH! lighttpd is running sweetly, with some rust templating in a acceptable fraction of the memory.
But moving up your stack of desire, what alternate to mailman would you live with? Five a month is $60 a year. Assuming you value your own time at professional wages, this job is worth an hour of your time at most before the opportunity cost of complaint is cheaper than replacement.
You have completely missed the point. The US$5/month is not a hardship. What is a hardship is taking what used to be done in 40k Perl and require 2G. It is a affront!
What is a waste is writing a poor quality greedy web server when Apache/Lighttpd/Nginx all exist and are much much better.
What really cost me was the two weeks I spent trying to make motioneye (python) and motioneyeos (not) work as advertised. As advertised! It is not only that the python is too bloated that stops it working, it does but there are other reasons. Very clearly no body has really tested it. It is a symptom of python culture. Shoddy and arrogant.
I feel terribly burnt by it. Twice. So my stack of desire in this domain is quite small and simple: Never to hear again from python! A very personal POV, be happy in your python hacking. I saw the opportunity to have a public rant and I took it!
Is this trading space for speed or is there actually a speedup as well in some cases? With interpreters it's possible you be small and fast if you become perhaps obscure or more rigid in your structured definitions
I always hate how these things are phrased. The idea of “using a lot of memory” doesn’t exist in a vacuum. “A lot” relative to what? Do you require run-time dynamic typing features? Do you want to leverage the Python data model? If yes, then this is the memory cost of your desires.
Advice in many of the comments about other languages sounds so tone deaf. The question is not about using less memory. It’s about getting _exactly_ the feature set of Python while using less memory.
this functions u will also see in the case of JAVA s well.this problem arises due to large no. of objects are active in RAM durig the execution of a progrm especially if there ae restrictions on the total amount of availablle memory
Those are in package `array` and they are a very barebones version of numpy's ndarray - with just one (explicit) dimension and no overloaded operators. But if you just want to keep a bunch of numbers in a contiguous array, they can save you tons of memory.
(I know the purpose of the article is to describe a more complex data structure, but arrays can still get you very far.)