Do messages get lost when Erlang modules are upgraded?

rubyrescue · on Feb 11, 2014

Hot code reloading is really amazing in Erlang. We use it to patch a method without restarting on a regular basis. Getting reltool properly configured to deploy the new version of your application though is a MAJOR pain in the ass.

for the non-erlangers:

the 'simple' way (ie i have a tiny change i want to make, so i'll just ssh in and do it) is to just edit your code, then attach to an existing node, and do make:all([load]) to pick up the code changes. we added a makefile "make attach" that does this for us.

the 'reltool' way is you package your application into releases. We use jenkins build numbers as the third digit in our releases, so each one is packaged into a zip, encrypted, and then is ready to deploy. at deploy time, the zip is copied to all machines, then we use some escript to attach to the running nodes, and use the 'release_handler' bring up the N+1 version of our app. it's a pain but once it's configured it's amazing...

_nato_ · on Feb 11, 2014

I am excited about a new kid on the block for releases: http://relx.org/

tel · on Feb 11, 2014

I would love to see more information about how you guys configured reltool. I really loved using hot code reloads in Erlang, but always did them manually during development. I could get releases to work, but never felt confident in how to use them properly.

rubyrescue · on Feb 11, 2014

sounds good. i'm about to start a whisper tech blog and we can do that as our first post...

vladimirralev · on Feb 11, 2014

Do you have some sort of guidelines when and when not to do hot release upgrade?

I still have to find anybody shipping whole updatable releases with erlang - couchdb, rabbitmq, any big open source project, nobody is doing it. I think it's a bit overrated, because realistically nobody will want to do this in production, unless it is audited and tested in every possible way. If you have to update external resources with the code_change callback it can fail miserably. I have no problem using it in production for small changes that I fully understand, but big release - wouldn't do it. Too many things can go wrong at the same time to react. You have to audit and test the upgrade and the rollback, which means extra code in code_change to handle the rollback too. Quite heavy.

blankverse · on Feb 11, 2014

Yes, I could not find any examples. Would be great to find out real world examples of people using erlang's hot code reloading in production.

mulligan · on Feb 11, 2014

Mochi Media uses hot code loading in production.

blankverse · on Feb 11, 2014

Could you talk a little bit about what happens to processes which are running different versions of the code? Do you make your code backward compatible so that old processes continue to work with the new process? Or do you just let the old processes crash?

yuubi · on Feb 12, 2014

If you use OTP (gen_server, etc), your code is a set of callbacks that return promptly, so it never stays "old" until the next upgrade. Your callbacks produce and use a "state" object (opaque to OTP) that OTP passes around[1], and OTP calls a code_change method whenever your module changes, so you can change the format of this object between versions.

The gen_server itself passes its module name to sys:handle_system_msg[2] when it receives a system message like a code-change notification, which ends up being the Module:Function type of call that always refers to the new version of the module, so it always ends up running the latest version when you upgrade gen_server.

[1] http://www.erlang.org/doc/man/gen_server.html#Module:init-1

[2]file:///C:/Program%20Files/erl5.9.1/lib/stdlib-1.18.1/doc/html/sys.html#handle_system_msg-6

saurik · on Feb 11, 2014

If you set things up right, the old process is upgraded to use the new code upon processing the next message (and has an opportunity to upgrade its state from the state the old code was using to the state the new code was using, if required); like, the idea is that you are trying to upgrade the running code. (In fact, leaving old processes lying around is actually problematic as you can only have two versions of any module loaded, the obsolete one and the current one; if you upgrade again all processes running the old version crash.)

strmpnk · on Feb 11, 2014

You can upgrade state in many cases if the changes are trivial, though depending on the ephemerality of a process you might let it finish or crash and restart instead of doing this.

blankverse · on Feb 11, 2014

So, in the worst case, for achieving basic code reloading in other erlang like systems(ex.- cloud haskell), we could let every process in the cluster crash and restart with the new version?

strmpnk · on Feb 11, 2014

It's rarely "only crash" as that eventually bubbles up to being equivalent to a reboot. It's simply that you can have subsystems that do that in effective isolation if the effects of a reboot are minimal.

On the code state upgrade side there are usually a few different issues that can arise and I'd be very curious to hear if there are ways Haskell would handle some of these.

The first one is type changes. I might have a record that has a new field added. Now it's not necessarily pretty to upgrade on call with a pattern match or using a code upgrade protocol but it's easily expressed dynamically.

Another is in the interface, like adding new arguments or changing from a synchronous call to an asynchronous one. These are a bit easier to handle via indirection though they show that you'll need to plan your entry/exit points for upgrades carefully (again, OTP has things like gen_server which make this much easier).

If Haskell can manage to get past they type boundary issue then it's really a matter of supporting at least 2 simultaneous versions of code so each process can be scheduled and upgraded in natural course. Handling more than 2 could be of use depending on how aggressively you want to purge, for example, a local rather than fully qualified call can be caught in a closure and passed around as in some value to be called later. These long lived references will need to be handled carefully or you might get some delayed surprises.

pointernil · on Feb 11, 2014

In my very early years when discovering coding I remember fantasizing about a system where I could start my creation as very simple endless loop and add to it without having to stop it what ever would make up an application (... more a coding experiment in logo or basic it was at that time ;)

Fast forward almost 30 years: in Erlang light-weight processes are (most of the time) tail recursive functions handling messages... endless loops. Those light-weight processes are running those endless loops.

The described module upgrade functionality allows for uninterrupted system upgrades of servers/services in the back-end which btw. is often serious business in production and the cause of the reltool complexities.

BUT: the same hot code reloading system allows for very sleek development experience. You start your first version of the service and from there on update the source and the service changes its behavior most of the time with no restart, no lost state etc. (with the help of some monitoring tools source changes can be picked up automagically...)

This kind of dev-environment is simply flow-inducing.

rdtsc · on Feb 11, 2014

This is pretty far from what other frameworks and systems can even dream about. Not everyone needs this but when they do need it, I only know about Erlang that can handle it.

drdaeman · on Feb 11, 2014

Well, I did Erlang-inspired runtime patching and even hot module updates (reloads) with Python and gevent.backdoor. However, it was a hack (loading new module and [non-atomically] replacing all references to old one), especially if compared to Erlang approach.

jeffdavis · on Feb 12, 2014

With PHP, you upload the code as a new file, "mv" it over the old one, and the next request will pick up the new code.

Not quite as advanced as erlang, but in PHP's target use cases, it is quite effective.

thesz · on Feb 11, 2014

I have to say "when one REALLY-REALLY-REALLY-REALLY BADLY need it". Because the same effect can be achieved by different means most of the time.

rdtsc · on Feb 11, 2014

How? You have a C++ or Java object instance running in a process how do you upgrade that code without restarting the process?

alanning · on Feb 11, 2014

I imagine you would do what erlang does for you, just manually - isolate individual server instances, do the update, then make them public again.

Say for example you are updating code in your service's web-tier. Have your LBs not send any more new traffic to the server instance, wait some reasonable length of time until existing connections are completed, deploy, restart, give LBs the A-OK to start traffic up again. Repeat until all web-tier instances are updated.

rdtsc · on Feb 11, 2014

Ok but what if that server you isolated was holding a long run process. Yes for a web service that serves quick http request and responses you could do that. Not everything is short lived request and response messages. Some sessions and processes are long lived. So you have a socket open and data streaming into it, it is not easy to isolate it. You could say send it a message that says -- start isolating.

How do you hotswap with data held in the stack or the heap? Even more if two part of your code need updated instance data, how does it ensure that update is synchronous or happens in the right order? In Java they have a $transformer() method. Ok how do you regulate the order in which $transformer() gets called. Otherwise a new version of one instance will call an old version of another.

I am not saying it is impossible to do it, there might be a way, but it is just usually working against the framework and against the default setup of the system.

alanning · on Feb 12, 2014

Yes, it can get very complicated trying to handle all the edge cases as you described. I would guess doing it the same way erlang does it is actually the simplest: let it crash.

This gets into coding mindset, my impression is that most erlang programmers expect their code to crash whereas most other languages seem to lend themselves to expecting the program _not_ to crash.

I am not very well versed in erlang but my readings so far imply that the graceful handling of crashes is really where erlang/OTP shines. Regardless of the framework, I would say it comes down to proper queuing, being able to safely retry work, and to some extent having some smarts on the clients.

stonemetal · on Feb 11, 2014

According to the wiki the OpenJDK supports Hotswap https://wiki.openjdk.java.net/display/mlvm/HotSwap .

As for C or C++ it is platform specific, but shared objects and DLLs have been runtime loadable/unloadable for as long as I can remember. Building a hot swap feature on top of it would be challenging but doable.

rmgraham · on Feb 11, 2014

The tricky part is transferring state between hot reloads, and that's part of what Erlang was designed around. State has to be passed in and owned by the callee. And if the format that state is stored in changes, you need to handle that translation in the new version.

It's definitely doable in C and C++. The question is whether it is easier to re-implement a bunch of what Erlang does for you, or to just use something that supports it directly.

Good ol' "build or buy?"

_random_ · on Feb 11, 2014

Wouldn't a reliable message queue + a script execution engine be enough?

blankverse · on Feb 11, 2014

How would a "reliable message queue + a script engine" be equivalent to a dynamically hot swappable code reloading? Could you elaborate?

cdcarter · on Feb 11, 2014

The message queue delivers messages to the script engine, running the script new every time. No code is running hot to be swapped into. At the next message, the new code is loaded. Similar to what Erlang does, but much more brittle, less tested, and more fail-prone.

_random_ · on Feb 12, 2014

nox_ · on Feb 13, 2014

No.

How do you handle state migration?

_random_ · on Feb 13, 2014

Like in-memory state? Why would you have it?

segmondy · on Feb 12, 2014

Lisp does this and has been doing this way before Erlang.

nox_ · on Feb 13, 2014

You have no idea what you are talking about, right?

strmpnk · on Feb 11, 2014

I was a little confused by the title as it implied a relationship between a process's message box and some module which is not the case.

The article does at least demonstrate that processes can transition between two versions of the same code w/o resetting state, which is, at its core, the very thing that makes code upgrades remotely practical.

Other comments mention some more sophisticated machinery like release upgrades. Erlang also has many code upgrade options baked into the OTP as well. I make use of many of these features both during development but also in production with some careful review. I'm always disappointed when going back to a system that has to "reboot" itself after getting used to hot upgrades and distributed erlang (version discrepancies in a cluster can present a similar problem if you don't want to pause your system).

blankverse · on Feb 11, 2014

The title is confusing! It should have been "Do messages get lost during code reloading in erlang?"

In a prod env, how do you make sure that different versions of your code coexist peacefully?

banachtarski · on Feb 12, 2014

You can have multiple sets of instructions for the same function but indicate that only one should be active. This makes it easy to rollback a deploy for example. Also, when you do a live upgrade, there must be at least two versions exposed to the VM at one point before the switch occurs.

Easy hot code reloading is one of the great benefits from CSP.

strmpnk · on Feb 11, 2014

Related to other comments, there are some mechanisms that ensure that two versions of all code can remain active (if you chose) in Erlang. No extra effort is required.

In terms of correctness, I'm not aware of any work that does type checking or serious analysis across version upgrades. Even Erlang's dialyzer will only check each version in isolation. So in practice this means you either test the upgrades with things like continuous integration or even manually, OR do so in very careful and small steps that are easy to reason about locally (very useful for critical applications that can't spare downtime).