There’s no SSML. The model that came up with the text knows what it’s saying in theory and therefore would know that it’s a question, if the mood should be sombre or excited and then can pass this information as SSML tags to the text to speech synthesizer. The problem I’ve been seeing is that pretty much all of these models are just outputting text and the text is being shoved into the TTS. It’s on my list to look into projects that have embedded these tags so that on the one hand you have like open web UI that’s showing a user text, but there’s actually an embedded set of tags that are handled by the TTS so that it sounds more natural. This project looks hackable for that purpose.