When a user starts recording, websocket continuously sends audio and receives probable transcriptions. You can observe this process in logs live as well. Whenever the program detects a word match within a currently active window it moves the current position to the position of this match and then revisits previously processed words to see if there are any updates.
A drawback of this method is that if a user pronounces all the words wrong it won't move forward and a user might think the program simply lags. But basing the current word position on a transcription input length is too chaotic from my experience.
Or did you ask about the general idea of the site? If so - it's for automatic pronunciation check. There are many apps that do this with predefined words and phrases. With this site a user can check and practice any text. I think it's a good fit for detecting mistakes in pronunciation while preparing to make a public speech or to record an audio/video to put it on the internet. It's aimed primarily at non-native speakers, but a native speaker can find it useful for polishing their enunciation as well.