Oh yeah! When I showed up at my current employer, I had a laptop with 32gb and 2 PCI SSDs in RAID0. Almost immediately, I had to upgrade to 64GB.
I'm a data scientist and regularly work with multiple datasets simulataneously that require the RAM usage. Both Python and R rely on in-memory processing. Loading on/off disk is substantially slower and does not fit with what I am trying to do. For really large datasets I also have a 28 core Xeon with 196GB that I can remote into, but it is nice to not have constraints on my laptop.
Of course, you could go with Hadoop or Spark to process some of these datasets, but that requires quite a bit of overhead and its easier (and cheaper) to just buy more RAM
Same story for me, give or take a few percent. I have to recommend dask for python though, it made out of memory errors largely disappear for me. It allows parallel processing of disk-size datasets with the convenience of in-memory datasets (almost).
Really? It's been a while since I've used it, and I remember a good portion of the documentation talking about how they replicate some, but not all of Pandas' API's (because of the sheer number of them).
I'm a C++ programmer working in games and I run out of 64GB of ram in my workstation daily. I can't wait until we finally get all upgraded to 128 or 256GB of ram as standard.
Well, thats why we consumer have to buy better hardware and more ram? As an old former Amiga programmer I have always till this day been a less is more kind of guy. Make code run faster and make the program use less ram.
Good in theory unless you need all of the data at once. There are things we do now that wouldn't have been possible (in the same sense) 25 years ago without a lot of work. We might use languages that are 200x slower, but they might be 10x more productive. That's a winning tradeoff for many people.
Nope, it has nothing to do with what you as a customer get as a final product. Loading the main map of the game uses about 30GB of ram in the editor + starting the main servers in a custom configuration will use that amount again. Systems like fastbuild can use several gigabytes when compiling. None of this has anything to do with the client, which will run with as little as 4GB of ram.
Once your datasets go out of the bounds of single reasonable machine, it's time to switch to Apache Spark cluster (or similar).
You can still write your data analysis code in Python, but you get to leverage multiple machines and intelligent compute engine that knows how to distribute your computation across nodes automatically, keeping data linkage and parentage information, so computation is moved closest to where data is located.
You know, sometimes you are in that uncomfortable spot where you have too much data for a single laptop but too little to justify running a whole computing cluster.
That is the kind of spot where you max out everything you can max out and just go take a break when something intensive is running.
This - honestly depending on the task hundreds of GB can be still the "single computer" realm because it's just not worth it to set up a cluster in terms of time and money and also administration overhead. However parallel + out of core computation doesn't necessarily imply a cluster: single-node Spark or something like dask works fine if you're in the python world.
Setting up ad hoc (aka standalone) Spark cluster with a bunch of machines you have control over is ridiculously trivial task though. It's as easy as running spark --master=x where you designate one machine as master. All others started with --master=x become slaves of x. Then you just submit jobs to x and that's all.
Running distributed like that always has a cost, both in inefficiency of the compute and in person-time.
If you still can run on one machine, it's almost always a win. 32Gb is a perfectly reasonable amount of memory to expect. 64Gb isn't outlandish at all for a workstation.
Cloud is an option for really large memory requirements. You can provision machines with nearly 2TB of RAM in AWS, and its pretty cost effective if you only spin them up when you actually need them.
I'm a data scientist and regularly work with multiple datasets simulataneously that require the RAM usage. Both Python and R rely on in-memory processing. Loading on/off disk is substantially slower and does not fit with what I am trying to do. For really large datasets I also have a 28 core Xeon with 196GB that I can remote into, but it is nice to not have constraints on my laptop.
Of course, you could go with Hadoop or Spark to process some of these datasets, but that requires quite a bit of overhead and its easier (and cheaper) to just buy more RAM