Shouldn't the voice be personal identifiable information? Just as a picture is.
They could of course store a machine translated transcript instead (assuming no personally identifiable information is uttered (though almost always there is, the customer is expected to give a customer ID or something)).
Also, considering how bad computers are at understanding spoken words the usefulness would be debatable depending on context.
Depends on the content and the information stored alongside it, the medium is irrelevant.