How did you handle difference in personal scale. An A- from two people are not always on same scale. If a person looking at these reviews has an internal scale different from the existing reviews, they may not be making an optimal choice. Sometimes the issue is minimized by simply very large number of reviews..
A million years ago, during the "Netflix Challenge", I built a purely statistical model that briefly scored high enough to go onto their leader board. The (temporary) success of that model was large due to handling this question reasonably.
I theorized that reviewers would have their own scale, both in median (maybe they only give 4s and 5s), as well as range (a lot of people only give 1s and 5, losing all the resolution of the middle). My solution was to continuously keep each reviewer's ratings normalized both to the centerpoint (each review would have a delta added to it to re-center) as well as to standard deviation (each review would be scaled so that their overall stdev would be 1.0). The result is that everyone would look like a nice normal curve around 3.0.
This has a nice side effect of diluting ratings spam, too. If somebody spams 5s, they wind up being just average 3s. Even if they spam 1s and 5s, they're at least normalized to a lower spread.
Did I correctly understand: You are converting everyone's internal scale to a common scale. The conversion is parameterizing by additive delta such that overall distribution remains normal with std 1.