You cannot heavily quantise models and assume they didn't degrade significantly.
To run it properly you need a lot more than a Mac Studio, and then comparisons need to be done more or less seriously, not just a few random prompts, because anything in a black box will "cheat" and will be fine tuned to do well at popular benchmarks.
The output is unremarkable; it’s not significantly better than the 13B model for most uses.
GPT 3.5 is an order of magnitude better at least.