The reason of the apparent paradox: a) in this case the model is a mixed model b...

vcdimension · 2024-07-04T08:57:55.000000Z

@justk What you're talking about might make sense if there were more independent variables to consider, but in this case there's only one, state. So in fact you could say that there are two conditional linear models in the example; one for the first state (state=0), and one for the second (state=1). The model does the best job with the information available (state).

justk · 2024-07-04T09:34:44.000000Z

Sorry, I edited my post several times and finally choose a short form with links other sources. If you fix state=1 then there are no more random variables so the R^2 doesn't have any meaning. Just for fun, what the model should predict for state = 0.5?, that corresponds to a person that is 50% in the red state and 50% in the blue state, I think a mixed model is appropriated here when the state variable is discrete, so that each value of the state variable represents a different part of the population, the other model should be used when people move a lot and change frequently the state where they vote in, but in that case you should have to consider the fluctuations in the total population in each state at the time of voting.

vcdimension · 2024-07-04T13:36:07.000000Z

@justk The R^2 value of 0.01 calculated on that webpage uses both states, not just one: the variance of the predicted values across both states is 0.55^3+0.45^3 - (0.55^2+0.45^2) ≃ 0.497 ≃ 0.5 I don't think it makes sense to use a mixed model in this case since the variance is the same for each state. A mixed model is used when the observations have some structured heteroskedasticity, i.e. different variances for different values of the independent variables.

fjkdlsjflkds · 2024-07-04T10:36:02.000000Z

> The R^2 used with a linear model requires a constant term, in this case the constant term or bias explains a lot about preferences (almost 50/50) so there is less information available for the slope term.

This explains the paradox, basically. When you take the null model "preference = 50%" (i.e. intercept-only model), there simply isn't much residual variance left for the linear model to explain.

That's why you get an R^2 = 1 if you use the "R^2 = rho(state, preference)^2" formula (you are ignoring the role of the intercept in explaining most of the variance, and exploiting the translation-invariance of the Pearson correlation) vs. you getting an R^2 = 0.01 when you use the (more correct) "R^2 = explained variance / total variance" formula.

TL;DR: It makes sense to get a very low R^2 when it is the intercept and not the predictor that is explaining most of the variance.

kgwgk · 2024-07-04T10:47:31.000000Z

> there simply isn't much residual variance left for the linear model to explain.

I'd say that there is still quite a lot of residual variance to explain. You need a baseline - the worst choice would be to predict 0 (or 1) and the mean squared error would be 0.5. Using 0.5 as baseline halves the mean squared error to 0.25.

fjkdlsjflkds · 2024-07-04T11:11:35.000000Z

This, of course, will depend on how you code your variables, but if you try to fit a null, intercept-only and predictor-only model, you get this as residual variance:

> data <- data.frame(state = c(0, 1), pref = c(0.45, 0.55))

> sum(residuals(lm(pref ~ 0, data = data))^2) # null model

[1] 0.505

> sum(residuals(lm(pref ~ 1, data = data))^2) # intercept-only model

[1] 0.005

> sum(residuals(lm(pref ~ state + 0, data = data))^2) # predictor-only model

[1] 0.2025

So, it seems clear that you only get a "perfect" prediction with the full (intercept + predictor) model mostly because of the intercept (which explains (0.505-0.005)/0.505 = 0.99 = 99% of the variance).

Thus, it makes sense that the predictor is only explaining the rest (i.e. 1%) of the variance... hence, the R^2 = 0.01

justk · 2024-07-04T12:17:51.000000Z

This is a little strange, you are using a data.frame with only two points so any linear model with two different parameters will be 100% accurate. This is the line that connect two points.

fjkdlsjflkds · 2024-07-04T12:31:06.000000Z

An affine model, yes (as I mentioned in my comment). A linear model, no ;)

But, anyway, seems like I interpreted things incorrectly.

kgwgk · 2024-07-04T12:09:52.000000Z

Your calculation is not directly related to the model (and associated R²) discussed in the article which are about the prediction of individual votes using the state as predictor - not state averages using the state as predictor.

Maybe I'm completely missing your point but the calculations in the blog post are, adapting your code (I think you meant mean where you wrote sum):

  > data <- data.frame(state = rep(c(0, 1), each=20), pref = c(rep(0, 11), rep(1, 9), rep(0, 9), rep(1, 11)))

  > mean(residuals(lm(pref ~ 0, data = data))^2) # null model [NOT IN THE BLOG POST]
  [1] 0.5

  > mean(residuals(lm(pref ~ 1, data = data))^2) # BASELINE intercept-only model
  [1] 0.25

  > mean(residuals(lm(pref ~ state + 0, data = data))^2) # predictor-only model [NOT IN THE BLOG POST]
  [1] 0.34875

  > mean(residuals(lm(pref ~ state, data = data))^2) # MODEL
  [1] 0.2475

  > summary(lm(pref ~ state, data = data))$r.squared # MODEL
  0.01

The blog post is about what you call "intercept-only" model (MSE 0.25) and the full model (MSE 0.2475), the R² is (0.25-0.2475)/0.25=0.01. His calculation is slightly different: instead of 0.25-0.2475 he calculates directly 0.05^2 which is the variance of the predictions (in this case the total variance 0.25 can be decomposed as the variance of the errors 0.2475 plus the variance of the predictions 0.0025).

fjkdlsjflkds · 2024-07-04T12:34:24.000000Z

(After re-reading the blog post with more care...) you are right, and thanks for the correction.

Either way, the point stands... the improvement in using a full linear model (that predicts 0.45 or 0.55, depending on state) is marginal compared to the baseline model that always predicts 0.50, as you demonstrate with your code.

To me, this doesn't seem paradoxical... the predictor is indeed providing little information over the "let's flip a coin to predict someone's voting preference" null/baseline predictor, since people's preferences (in aggregate) are almost equivalent to "flipping a coin".

note: I meant "sum", but it's the same, since the ratio between sums of squares is equivalent to the ratio between mean squares

kgwgk · 2024-07-04T13:04:22.000000Z

> Either way, the point stands... the improvement in using a full linear model (that predicts 0.45 or 0.55, depending on state) is marginal compared to the baseline model that always predicts 0.50

Yes, I think we don't disagree. I was just puzzled by the "little variance left to explain" remark.

> note: I meant "sum", but it's the same, since the ratio between sums of squares is equivalent to the ratio between mean squares

You're right, sum of squares made sense if it was just for the ratio.

fjkdlsjflkds · 2024-07-04T13:19:47.000000Z

Thanks for taking the time to clarify my confusion.

It's not that there is "little variance left to explain", but actually that (no matter what) there will always be too much variance left to be explained, when the response is Bernoulli-distributed and the parameter is not too far from 0.5 (i.e., the data generating process is like flipping a slightly loaded coin).

If you use the expected value to predict the Bernoulli variable, you will always be somewhat wrong (0.45 and 0.55 are both far from 0 and from 1, which are the only possible responses).

If you use a binary response to predict, you will quite often be very wrong, even if you are right on average, and even if your prediction is to generate Bernoulli-distributed samples from the exact same distribution (i.e., you know exactly how the coin is loaded/biased and you can exactly replicate its data generation process).

So... yeah... no "paradox" ;)