> The basic idea is as follows: there is evidence that today's proteins emerged out of an ancient peptidic soup, one that may have left its mark on the evolutionary record. I.e., the proteins we see today may in some sense be formed out of primordial peptides. As proteins grew in size and complexity, it would have been advantageous to reuse existing components, to build bigger proteins from existing protein parts. We already know this is true on the level of protein domains, in that larger proteins are often comprised from chaining together smaller globular domains. But the phenomenon of reuse may go further, where even smaller protein fragments (handful of residues to dozens) may reflect an underlying evolutionary pressure to reuse working parts, fragments that fold in tried-and-tested ways (from the perspective of evolution.) If this is the case, then the space of naturally occurring proteins may occupy a very special "manifold", one that exhibits a hierarchical organization spanning small fragments to entire domains. Other evolutionary pressures could further drive the reuse phenomenon. For example, once a protein-protein or protein-DNA interface is established, presumably through some sort of structural motif, reusing that motif would present an efficient way for the cell to rewire its cellular circuitry. The end result of all this would be the emergence of something resembling a linguistic structure, a grammar that defines the reusable parts and how these parts can be combined to form larger assemblies. Given that this is biology, it’s unlikely to be rigid or minimal. It would be messy and hacky, with many exceptions and ad hoc evolutionary optimizations. But the manifold would be there, potentially discoverable and learnable.
Instead of characters -> 'byte-pair-encoding'-like sequences -> words -> sentences, think primordial peptides -> simple protein parts -> more complicated protein components -> proteins. If this "protein linguistic hypothesis" is correct, I see no reason why the manifold wouldn't be discoverable and learnable with modern SGD techniques.
> So are RGNs a panacea? Not at all. This is very much a 1.0 release. They are raw and unpolished. Training them can be quite challenging, like I already mentioned. They do comparatively well on novel protein topologies, but that’s because everyone else does so poorly. They do silly things like predict pretty awful secondary structure, and their predictions can have steric clashes and the like.
If we accept his comparison to other results it seems RGNs have an unreasonable effectiveness for topologies...
I'm sure there's a predictable set of interactions, with a minimum, finite set of required loops to support cellular life as we know it. Above the minimum set of operations and repeatable cycles, there are almost certainly specialty routines, and perhaps no fixed limits on diversity of optional interactions, at the cellular/chemical level.
But for sure, there is also a boundary layer, for interactions between cells. This would have to represent an almost entirely different set of chemical interaction rules for signaling, with its own constraints, minimum requirements, and optional expressions.
So, it's useful to conceptualize in terms like this, but problems solved within the context of intracellular operations will only offer clues about tissue organization, and indeed, tissue requirements may drive the optional intracellular interactions more often than not, rather than the reverse. In cases where intracellular interactions drive extracellular organization, it's essentially leaky abstractions dictating the details of higher level implementation.
Instead of characters -> 'byte-pair-encoding'-like sequences -> words -> sentences, think primordial peptides -> simple protein parts -> more complicated protein components -> proteins. If this "protein linguistic hypothesis" is correct, I see no reason why the manifold wouldn't be discoverable and learnable with modern SGD techniques.