Hacker News new | past | comments | ask | show | jobs | submit login

Oh yeah! When I showed up at my current employer, I had a laptop with 32gb and 2 PCI SSDs in RAID0. Almost immediately, I had to upgrade to 64GB.

I'm a data scientist and regularly work with multiple datasets simulataneously that require the RAM usage. Both Python and R rely on in-memory processing. Loading on/off disk is substantially slower and does not fit with what I am trying to do. For really large datasets I also have a 28 core Xeon with 196GB that I can remote into, but it is nice to not have constraints on my laptop.

Of course, you could go with Hadoop or Spark to process some of these datasets, but that requires quite a bit of overhead and its easier (and cheaper) to just buy more RAM




Same story for me, give or take a few percent. I have to recommend dask for python though, it made out of memory errors largely disappear for me. It allows parallel processing of disk-size datasets with the convenience of in-memory datasets (almost).


Holy cow, dask looks useful


The most impressive is how easy it is to use. Basically just reuse of the numpy API.


They replicate the Pandas API, not the numpy one. Just for clarity's sake.


What a peculiar comment. Their homepage has examples of both - with Numpy first.


Really? It's been a while since I've used it, and I remember a good portion of the documentation talking about how they replicate some, but not all of Pandas' API's (because of the sheer number of them).

They've probably updated it since then...


yup it really is


I'm a C++ programmer working in games and I run out of 64GB of ram in my workstation daily. I can't wait until we finally get all upgraded to 128 or 256GB of ram as standard.


Well, thats why we consumer have to buy better hardware and more ram? As an old former Amiga programmer I have always till this day been a less is more kind of guy. Make code run faster and make the program use less ram.


Good in theory unless you need all of the data at once. There are things we do now that wouldn't have been possible (in the same sense) 25 years ago without a lot of work. We might use languages that are 200x slower, but they might be 10x more productive. That's a winning tradeoff for many people.


Nope, it has nothing to do with what you as a customer get as a final product. Loading the main map of the game uses about 30GB of ram in the editor + starting the main servers in a custom configuration will use that amount again. Systems like fastbuild can use several gigabytes when compiling. None of this has anything to do with the client, which will run with as little as 4GB of ram.


Buy a 2 x Xeon workstation with 768GB ram?


Once your datasets go out of the bounds of single reasonable machine, it's time to switch to Apache Spark cluster (or similar).

You can still write your data analysis code in Python, but you get to leverage multiple machines and intelligent compute engine that knows how to distribute your computation across nodes automatically, keeping data linkage and parentage information, so computation is moved closest to where data is located.


You know, sometimes you are in that uncomfortable spot where you have too much data for a single laptop but too little to justify running a whole computing cluster.

That is the kind of spot where you max out everything you can max out and just go take a break when something intensive is running.


This - honestly depending on the task hundreds of GB can be still the "single computer" realm because it's just not worth it to set up a cluster in terms of time and money and also administration overhead. However parallel + out of core computation doesn't necessarily imply a cluster: single-node Spark or something like dask works fine if you're in the python world.


Setting up ad hoc (aka standalone) Spark cluster with a bunch of machines you have control over is ridiculously trivial task though. It's as easy as running spark --master=x where you designate one machine as master. All others started with --master=x become slaves of x. Then you just submit jobs to x and that's all.


Spark is slow though. On the other hand, Pandas is also extraordinarily slow :D


Then you remote into a workstation as some one else in this thread said they did.


Running distributed like that always has a cost, both in inefficiency of the compute and in person-time.

If you still can run on one machine, it's almost always a win. 32Gb is a perfectly reasonable amount of memory to expect. 64Gb isn't outlandish at all for a workstation.


Really depends on the computation.. what you say only make sense for some niches of computations.


What laptop are you using, I think my thinkpad W540 is capped at 32GB.


Thinkpad P50


Cloud is an option for really large memory requirements. You can provision machines with nearly 2TB of RAM in AWS, and its pretty cost effective if you only spin them up when you actually need them.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: