Probably quite a bit higher: POTS was band-limited to something like 300-3000 Hz, which covers speech pretty well but not much else (hence the tinny hold music).
If this is true, then a brief 30s audio signal would yield more then a A4 photo capture of high density visual encoding, like microsoft hccb barcodes, interesting
A .wav file is 16 bits/sample x 44,100 samples/sec ≈ 700 kbps. Thirty seconds of that is about 2.5 Mb, which is a reasonable size for a photo. However, you wouldn't be able to send nearly that much data, since you'll be limited by various kinds of noise.
You're now nyquest-limited to sending fewer symbols per second (baud), but you might be able to use a larger "vocabulary" of symbols.
This was one of the big changes in modem design. The first 300 baud modems used one 1 bit/symbol, but the V34 modems used a bigger symbol set that could send 6-10 bits/symbol (and also sent the symbols faster).