They do not turn the actions into text that is then tokenized, but generate tokens directly. So the action token 128 doesn't necessarily correspond to the tokenization of the number 128 when it appears in text input. (Except for PaLI-X they make use of the fact that integers up to 1000 have unique tokens and do use those for the actions. But for PaLM-E, they hijack the 256 least
frequently used tokens instead.)