How well are you able to handle live speaker diarization? I've been tinkering with building similar solutions, but unless you have previously labeled speakers things tend to go haywire once you have multiple speakers + crosstalk.
Have a short introduction session before the live translation where each speaker says a couple of words like “hi, I am John”
These then can be used to pick up on the current speaker.