Amazon FSx for Lustre

davidmr · on Nov 28, 2018

This was inevitable, and I’m sure the engineering from Amazon’s side is impressive because Lustre is an absolute beast to run well at scale, but I’m not sure how great an idea it is for most people.

Coming from an academia HPC background and then moving into the private sector, I’ve mostly come to believe that parallel filesystems (especially POSIX-compliant ones) are rarely the right solution outside of MPI simulations. Like NFS, it makes it extremely easy and attractive to implement anti-patterns like using the filesystem for IPC or generating a bazillion files and then needing to reduce them to move to the next stage of the pipeline. In my experience, it’s rare that people don’t regret doing that sort of stuff in the long run.

That said, I’m sure the AWS team knows their customers and what they’re doing better than I do!

pinewurst · on Nov 28, 2018

That would have to be the worst job in the world - keeping Lustre going as an Amazon service, with management that utterly lacks understanding and sympathy.

davidmr · on Nov 28, 2018

They’d have to pay me a lot of money to do it, that’s for sure. I’d love to see the disaster recovery plans. Every major Lustre site I’m aware of has had a data loss “incident” at some point in their history. It’s possible AWS has it all figured out with background backups and block device replication and whatnot, but I’m skeptical.

pinewurst · on Nov 28, 2018

I have doubts based on my experience with Lustre and a certain understanding of AWS operations. I'm guessing they're going for the Iranian minefield clearing technique - get a mob of kids, hand them plastic keys to heaven (or RSUs), and march them through the field.

mbreese · on Nov 28, 2018

Given that they call the non-S3 linked version 'ephemeral', I'm not sure there is a plan. I think S3 is the plan.

pinewurst · on Nov 28, 2018

'Ephemeral' was/is the original Lustre design model. It was intended for high performance swap/scratch at Livermore with a short data lifespan - your higher priority bomb sim forces mine to roll out to disk and back in later, and that's it. Lustre, even today, isn't long term stable. The longer you leave data on it, the greater the probability of corruption - even silently.

mbreese · on Nov 28, 2018

I've seen Lustre backed with ZFS listed a few places. Is the idea here to help mitigate the possibility of corruption?

agapon · on Dec 1, 2018

LLNL is the core force behind ZoL and it's primarily them who use ZFS-backed Lustre.

pinewurst · on Dec 1, 2018

I think ZoL is LLNL’s attempt to make up for inflicting Lustre on us.

notacoward · on Nov 28, 2018

Same thought here. I spent two-plus years debugging Lustre issues for a very small set of customers. It was an absolute beast. Build process was a compatibility-killing license-violating nightmare. Fell over at the slightest provocation, with little info to help figure out why. Provided no metrics to speak of, and the thicket of inter-related settings (especially timeouts) made effective tuning almost impossible. I'd guess that Amazon spent many engineer-years removing or rewriting significant pieces, and even more establishing the safe configuration envelope for what remained. Even then, it's probably a nightmare for the SREs (or whatever Amazon calls them) who have to keep it running.

fold_left · on Nov 28, 2018

finally, something that can handle node_modules! :)

tardismechanic · on Nov 28, 2018

Ayyy!

damnhungry · on Nov 28, 2018

I don't mean to be cynical but for the last few days some one is bombarding with lot of amazon news

colmmacc · on Nov 28, 2018

AWS re:Invent, AWS's annual conference, is currently happening in Las Vegas. There are quite a number of headline announcements each day to go along with each keynote (there are 4 total).

SSilver2k2 · on Nov 28, 2018

This week is the AWS re:Invent conference, so every day new services are being introduced.

damnhungry · on Nov 28, 2018

okay, my apologies :)

SSilver2k2 · on Nov 28, 2018

it's all good :)

dschuetz · on Nov 28, 2018

[flagged]

bovermyer · on Nov 28, 2018

Posting the same comment on every single one of these doesn't help anything.

Basically, you're just spamming.

dschuetz · on Nov 28, 2018

That's exactly my point. Apparently Amazon is being spammend all over HN. What the heck?!

bovermyer · on Nov 28, 2018

Welcome to AWS Re:invent week, friend.