An LLM is a regressive generator. It can only take the output of the execution as input and generate tokens based on it. You got the impressionist, under-water refracted version of the output.
I'd argue that if the system cannot be trained output the results of system calls faithfully (this was not an adversarial test), then it simply has been badly trained, but even ignoring that, as you mention, system call output execution becomes prompt input. There is no excuse for it to not be rendered properly, except in the case where the LLM output is allowed to fake grounded output (or processes false system input). Again, ignoring poor tuning/training, why isn't the output filtered? Why isn't grounded output rendered uniquely? Remember, this is a non-adversarial scenario - no token smuggling or other hijinx. While the latter is an unsolved problem, the former is not and is a very basic design choice.
That the Bard team continues spending their time adding "features" but never fixing basic problems of presenting trustworthy output is pretty confounding for me. Since launch (I've been trying it out since March), it's consistently been worse than the competition, and it seems to be falling further behind as time goes on. ¯\_(ツ)_/¯
An LLM is a regressive generator. It can only take the output of the execution as input and generate tokens based on it. You got the impressionist, under-water refracted version of the output.