You should incorporate the metre of the music itself to help wtih identifying rhymes. For example, the middle chunk of your example is missing the full picture of the rhyming from the 'reality' part onwards, specifically where it's missed 'he won't have it, he' with 'have it, he' being on the same beat as 'reality' before it.
I think the timing will help you a lot rather than just trying to notice punctuation. After all, music is about the rhythm.
The reason 'have it' wasn't matched is because the vowel phoneme in the word 'have' is: "AE" whereas the example in the screenshot was matching combos with successive vowels "AH - IH". So 'have' didn't match that scheme. I agree that these should match, and have been throwing around the idea of matching similar sounding phonemes together.
I completely agree though, the metre/rhythm plays a huge part in a poet or rapper's flow and you don't get the full story without incorporating it. The issue is that I haven't found a way to programmatically pull the metre from a song, and rapper's don't generally keep track of their metre, let alone put it online in a machine readable format (though I'm sure you could find ones for hugely popular songs like 'Lose Yourself' online!).
I am interested in figuring this out though, and have been throwing around ideas for people to simply generate metre for songs by having a tool that simply allows users to match words to times in a song. Though I'm not sure how scalable that is, or how to create such a tool that is drastically simple and fast to use, because otherwise it defeats the point. My hope is that there'd be a way to algorithmically parse the audio and look for inflection points in a song for where words might lie but I've done no research towards that end. digression: This kinda tech would probably be useful for generating 'sing-a-longs'
Could you hack some sort of karaoke system? I mean, it ties word display to time, so you could then use that alongisde tying time to beats (DJ tools do this already) to link the words with the beats of the song. I would expect it to be manually generated but it'd give you a data set to work with for more popular stuff?
That said, I'm not sure how fine-grained karaoke systems get - whether they just display the lyrics for a whole bar of the song and linearly interpolate between the start and end, or whether they are a bit more intelligent than that.
I think the timing will help you a lot rather than just trying to notice punctuation. After all, music is about the rhythm.