Could this approach be used for media compression? I've wondered how compressible a popular-music track could be if you had a sufficiently rich language to describe it. This seems like a method to answer that question.
Or sheet music. It always amazed me that humans came up with any solution at all to "here's a piece of paper, tell me what your song sounds like" to say nothing of one that actually works to some degree.
I've always wondered how much classical music sounds the way it does because sheet music is the way it is.
An example of this is Chinese guqin tablature. It can be centuries old and includes a lot of detail on where to place fingers and how to strike the strings, which can give you hints about pitch and timbre when combined with knowing the tuning, strings, etc. But the tablature has almost nothing to say about the LENGTH of each note, so rhythm has to be inferred by the performer from what they know about the culture.
Program change + general MIDI instrument set is implementation-dependent but was pretty common in the 90s, and encodes timbre in an extremely limited way.
Now of course nobody outside of fringe artists really use it.
It calls to mind the old joke about how someone wrote a compressor that turns Microsoft Word from a 20MB file into a 1 byte file, except the compressor is 20MB. (Adjust the file name and size until it's funny. When I first heard it, 20MB was an extraordinarily large size.)
But in this case you could imagine the right balance where it does end up with a significant savings.
Would anything approaching typical bitrates used in audio codecs imply an enormous dictionary? Also I wonder if any statement could be made about the learnability of codecs, e.g., are Fourier transforms something deep networks can arrive at?