I'll add my own Bayesian analysis to the fray. Assuming a binomial, in Julia:
using Distributions
b_old = Beta(66+1, 6392-66+1)
b_yel = Beta(83+1, 6362-83+1)
N = 1000000
# Sample from both distributions, count the fraction of samples that are better
sum(rand(b_old, N) .> rand(b_yel, N)) / N
This yields 7.7% chance that the old one is better than "yellow". It's fascinating to see how we can get such different answers to a simple question.
I did an A/B test on an older framework which didn't automate statistical significance at all, but the website was getting more than 2000-3000 orders per day, so after a single week we had enough data to determine that sales had increased by 36% (reduced the page's load time by almost half, changed the checkout to use Ajax, and a few other small changes.) without the need to quantify things. In fact, at the time, I didn't even know what "statistical significance" was... not that I know too much more about statistics now than I did then.
Anyway, in theory all of the exact p values, etc. matter, but in practice, the bottom line is all that matters, because the p values can change in a moment based on something I haven't factored in. That's where, at least for the time being, intuition still plays a great part in being actually correct, which is why we still have people with repeated successes.