I don't know if I missed it, but I would have liked to have seen a performance/a...

argonaut · on June 30, 2016

Yeah, this is a pretty good point. IMO it's a major flaw in the paper that they didn't empirically compare their new model to other models (like just an ensemble, as you mention), though they do discuss it.

I wouldn't be surprised if a simple ensemble performs better!

plusepsilon · on June 30, 2016

In many cases the simple ensemble is more fragile because you have to keep track of multiple things at once. The winning solution for the Netflix Kaggle competition was an intractable ensemble which never got used in production. Also when you update your models you'll have to tune them individually, then manually tune the ensemble weights.

Another advantage of joint learning (which the authors mentioned) is that the individual models need not be as big when trained independently since they complement each other. Though the joint model will surely be bigger than each of the individual models.

argonaut · on July 1, 2016

They (reasonably) claim the joint model doesn't have to be as big, but, for example, it would be interesting the see an ensemble of 2 models: a wide model of the same size as the wide half of the joint model, and a hierarchical model of the same size as the hierarchical half of the joint model.