If you classify say 50% of people as AI's and 50% of AI's as people then those AI's passed the Turing test.
So, you are limited to questions that most people will answer correctly. Further, if you find some unusual question that works today someone can just add it to the program for next time.
The original test specifically had exactly one human and one AI. So, if the judge is forced to do a coin flip that really is success. If the judge does a coin flip because they are lazy then that's not a Turning test.
Did you check the link about the Winograd schema challenge? The questions test common sense reasoning, and are very easy for humans to answer. An example:
The trophy doesn't fit into the brown suitcase because it's too large. What is too large? A: The trophy B: The suitcase
So, you are limited to questions that most people will answer correctly. Further, if you find some unusual question that works today someone can just add it to the program for next time.