Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Heap fragmentation hasn't been a big problem for me. Using multiple JVMs means to reimplement all data structures in shared memory and create my own memory allocator or garbage collector for that memory. It's a huge effort.

Many applications can distribute work among multiple processes because they don't need access to shared data or can use a database for that purpose. But for what I'm doing (in-memory analytics) that's not an option.



You've probably since moved on from this converstation, but I wonder if Tuple Space might help [1]. It provides a distributed memory feel to applications. Apache River provides one such implementation.

Another question about in-memory analytics is do you have to be in-memory? I'm currently working on an analytics project using Hadoop. With the help of Cascading [3] we're able to abstract the MR paradigm a lot. As a result we're doing analytics across 50 TB of data everyday once you count workspace data duplication.

1 - https://en.wikipedia.org/wiki/Tuple_space 2 - http://river.apache.org/index.html 3 - http://cascading.org


Thanks for the links. The reason why we decided to go with an in-memory architecture for this project is that we have (soft) realtime requirements and complex custom data structures. Users are interactively manipulating a medium size (hundereds of gigs) dataset that needs to be up-to-date at all times.

The obvious alternative would be to go with a traditional relational database, but my thinking is that the dataset is small enough to do everything in memory and avoid all serialization/copying to/from a database, cache or message queue. Tuple Spaces, as I understand it, is basically a hybrid of all those things.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: