It's in fact using Silero via RealtimeSTT. RealtimeSTT tells when silence starts. Then a binary sentence classification model is used on the realtime transcription text which infers blazingly fast (10ms) and returns a probability between 0 and 1 indicating if the current spoken sentence is considered "complete". The turn detection component takes this information to calculate the silence waiting time until "turn is over".
This is the exact strategy I'm using for the real-time voice agent I'm building. Livekit also published a custom turn detection model that works really well based on the video they released, which was cool to see.