(Who came up with R squared anyway? Seems like someone wanted to remove the poss...

justk · 2024-07-04T08:20:55 1720081255

The reason of the apparent paradox: a) in this case the model is a mixed model b) second the variable are nominal so you have to select one of the pseudo R^2 models. For more information: (1) Pseudo R-squared: https://en.wikipedia.org/wiki/Pseudo-R-squared (2) R squared for mixed models – the easy way https://ecologyforacrowdedplanet.wordpress.com/2013/08/27/r-...

c) The R^2 used with a linear model requires a constant term, in this case the constant term or bias explains a lot about preferences (almost 50/50) so there is less information available for the slope term.

Hope this helps.

vcdimension · 2024-07-04T08:57:55 1720083475

@justk What you're talking about might make sense if there were more independent variables to consider, but in this case there's only one, state. So in fact you could say that there are two conditional linear models in the example; one for the first state (state=0), and one for the second (state=1). The model does the best job with the information available (state).

justk · 2024-07-04T09:34:44 1720085684

Sorry, I edited my post several times and finally choose a short form with links other sources. If you fix state=1 then there are no more random variables so the R^2 doesn't have any meaning. Just for fun, what the model should predict for state = 0.5?, that corresponds to a person that is 50% in the red state and 50% in the blue state, I think a mixed model is appropriated here when the state variable is discrete, so that each value of the state variable represents a different part of the population, the other model should be used when people move a lot and change frequently the state where they vote in, but in that case you should have to consider the fluctuations in the total population in each state at the time of voting.

vcdimension · 2024-07-04T13:36:07 1720100167

@justk The R^2 value of 0.01 calculated on that webpage uses both states, not just one: the variance of the predicted values across both states is 0.55^3+0.45^3 - (0.55^2+0.45^2) ≃ 0.497 ≃ 0.5 I don't think it makes sense to use a mixed model in this case since the variance is the same for each state. A mixed model is used when the observations have some structured heteroskedasticity, i.e. different variances for different values of the independent variables.

fjkdlsjflkds · 2024-07-04T10:36:02 1720089362

> The R^2 used with a linear model requires a constant term, in this case the constant term or bias explains a lot about preferences (almost 50/50) so there is less information available for the slope term.

This explains the paradox, basically. When you take the null model "preference = 50%" (i.e. intercept-only model), there simply isn't much residual variance left for the linear model to explain.

That's why you get an R^2 = 1 if you use the "R^2 = rho(state, preference)^2" formula (you are ignoring the role of the intercept in explaining most of the variance, and exploiting the translation-invariance of the Pearson correlation) vs. you getting an R^2 = 0.01 when you use the (more correct) "R^2 = explained variance / total variance" formula.

TL;DR: It makes sense to get a very low R^2 when it is the intercept and not the predictor that is explaining most of the variance.

kgwgk · 2024-07-04T10:47:31 1720090051

> there simply isn't much residual variance left for the linear model to explain.

I'd say that there is still quite a lot of residual variance to explain. You need a baseline - the worst choice would be to predict 0 (or 1) and the mean squared error would be 0.5. Using 0.5 as baseline halves the mean squared error to 0.25.

fjkdlsjflkds · 2024-07-04T11:11:35 1720091495

This, of course, will depend on how you code your variables, but if you try to fit a null, intercept-only and predictor-only model, you get this as residual variance:

> data <- data.frame(state = c(0, 1), pref = c(0.45, 0.55))

> sum(residuals(lm(pref ~ 0, data = data))^2) # null model

[1] 0.505

> sum(residuals(lm(pref ~ 1, data = data))^2) # intercept-only model

[1] 0.005

> sum(residuals(lm(pref ~ state + 0, data = data))^2) # predictor-only model

[1] 0.2025

So, it seems clear that you only get a "perfect" prediction with the full (intercept + predictor) model mostly because of the intercept (which explains (0.505-0.005)/0.505 = 0.99 = 99% of the variance).

Thus, it makes sense that the predictor is only explaining the rest (i.e. 1%) of the variance... hence, the R^2 = 0.01

justk · 2024-07-04T12:17:51 1720095471

This is a little strange, you are using a data.frame with only two points so any linear model with two different parameters will be 100% accurate. This is the line that connect two points.

fjkdlsjflkds · 2024-07-04T12:31:06 1720096266

An affine model, yes (as I mentioned in my comment). A linear model, no ;)

But, anyway, seems like I interpreted things incorrectly.

kgwgk · 2024-07-04T12:09:52 1720094992

Your calculation is not directly related to the model (and associated R²) discussed in the article which are about the prediction of individual votes using the state as predictor - not state averages using the state as predictor.

Maybe I'm completely missing your point but the calculations in the blog post are, adapting your code (I think you meant mean where you wrote sum):

  > data <- data.frame(state = rep(c(0, 1), each=20), pref = c(rep(0, 11), rep(1, 9), rep(0, 9), rep(1, 11)))

  > mean(residuals(lm(pref ~ 0, data = data))^2) # null model [NOT IN THE BLOG POST]
  [1] 0.5

  > mean(residuals(lm(pref ~ 1, data = data))^2) # BASELINE intercept-only model
  [1] 0.25

  > mean(residuals(lm(pref ~ state + 0, data = data))^2) # predictor-only model [NOT IN THE BLOG POST]
  [1] 0.34875

  > mean(residuals(lm(pref ~ state, data = data))^2) # MODEL
  [1] 0.2475

  > summary(lm(pref ~ state, data = data))$r.squared # MODEL
  0.01

The blog post is about what you call "intercept-only" model (MSE 0.25) and the full model (MSE 0.2475), the R² is (0.25-0.2475)/0.25=0.01. His calculation is slightly different: instead of 0.25-0.2475 he calculates directly 0.05^2 which is the variance of the predictions (in this case the total variance 0.25 can be decomposed as the variance of the errors 0.2475 plus the variance of the predictions 0.0025).

fjkdlsjflkds · 2024-07-04T12:34:24 1720096464

(After re-reading the blog post with more care...) you are right, and thanks for the correction.

Either way, the point stands... the improvement in using a full linear model (that predicts 0.45 or 0.55, depending on state) is marginal compared to the baseline model that always predicts 0.50, as you demonstrate with your code.

To me, this doesn't seem paradoxical... the predictor is indeed providing little information over the "let's flip a coin to predict someone's voting preference" null/baseline predictor, since people's preferences (in aggregate) are almost equivalent to "flipping a coin".

note: I meant "sum", but it's the same, since the ratio between sums of squares is equivalent to the ratio between mean squares

kgwgk · 2024-07-04T13:04:22 1720098262

> Either way, the point stands... the improvement in using a full linear model (that predicts 0.45 or 0.55, depending on state) is marginal compared to the baseline model that always predicts 0.50

Yes, I think we don't disagree. I was just puzzled by the "little variance left to explain" remark.

> note: I meant "sum", but it's the same, since the ratio between sums of squares is equivalent to the ratio between mean squares

You're right, sum of squares made sense if it was just for the ratio.

fjkdlsjflkds · 2024-07-04T13:19:47 1720099187

Thanks for taking the time to clarify my confusion.

It's not that there is "little variance left to explain", but actually that (no matter what) there will always be too much variance left to be explained, when the response is Bernoulli-distributed and the parameter is not too far from 0.5 (i.e., the data generating process is like flipping a slightly loaded coin).

If you use the expected value to predict the Bernoulli variable, you will always be somewhat wrong (0.45 and 0.55 are both far from 0 and from 1, which are the only possible responses).

If you use a binary response to predict, you will quite often be very wrong, even if you are right on average, and even if your prediction is to generate Bernoulli-distributed samples from the exact same distribution (i.e., you know exactly how the coin is loaded/biased and you can exactly replicate its data generation process).

So... yeah... no "paradox" ;)

dxbydt · 2024-07-04T15:31:03 1720107063

using just the ABS gives you the LAD. least absolute deviation is not used for a couple of reasons. LAD doesn’t have a unique solution - you can code it up and your program will give you a set of solutions which will all minimize the LAD. this looks like a feature, but for various pedagogical reasons, statistians insist on BLUE. LAD can give you the LUE ie linear unbiased estimator, but not the B ie best. also, LAD doesn’t have a closed form. so you can’t write an equation and say this is the LAD, then differentiate it and derive interesting properties about the estimator. since there’s no closed form, you can use numerical methods to find something that’s good enough like error below 1e-6 and declare that to be the LAD. but that will happen at a bunch of points so no uniqueness.

carlob · 2024-07-04T07:08:52 1720076932

I don't think it's about not knowing the abs function, more about the fact that the first derivative would be discontinuous and the second doesn't exist in 0. Variance had much nicer properties mathematically than absolute deviation.

mattalex · 2024-07-04T16:01:28 1720108888

You can solve L1 regression using linear programming at fantastically large scales. In fact in many applications you do the opposite: go from squared to absolute because the latter fits into in lp

jncfhnb · 2024-07-04T12:30:22 1720096222

When you’re making a linear regression you’re minimizing the sum of squared error which is 1:1 with R2. That is, you’re getting the best possible R2 achievable with a line.

The reason we do that is because we are assuming the errors are normally distributed and finding the slope that gets the best possible R2 is equivalent to getting figuring out how to fit the line with the maximum likelihood estimator of the error (aka mean of the distribution).

So ultimately it’s about curves. If you wanted to get a sense for why this is strongly desirable you should try to fit a linear regression using absolute error instead of squared error.

jvanderbot · 2024-07-04T12:38:25 1720096705

Fun story: We once had a robot at a very big company you've all heard of that kept getting too close to walls when moving down hallway-like areas. The controls engineer swore he had tuned the controller to death and no further improvements could be had.

A perception engineer took one look at the controller and saw linear error terms for distance left + distance right (distance to walls), changed it to distance left^2 + distance right^2, and the whole thing magically worked beautifully. Exercise for the reader: What position in the hallway minimizes sum of distances squared, vs what position(s) in the hallway minimize sum of distances without square.

This is essentially the same problem you pose.

jncfhnb · 2024-07-04T12:48:41 1720097321

Kind of funny but wouldn’t you want to just maximize(distance_left, distance_right) if you wanted the center?

Edit: no, derp, just walked into the same problem lol. Maximize the min should work though

jvanderbot · 2024-07-04T14:06:28 1720101988

That has a unique solution, so it is infinitely better than the linear error, but is not a nice differentiable signal suitable for controls, I'd bet. But that wasn't in the problem statement.

CamperBob2 · 2024-07-04T17:42:17 1720114937

Seems like you'd want to minimize the distance to the midpoint, but I'm probably missing something.

jvanderbot · 2024-07-04T20:40:50 1720125650

Fun fact: that's what sum of square does

snarkconjecture · 2024-07-04T08:00:03 1720080003

R squared has extremely nice properties. If you add a bunch of linearly uncorrelated variables together, then compute R² between each variable and the sum, the R² values will sum to 1.

cubefox · 2024-07-04T08:26:27 1720081587

R (ρ) has also a nice property:

> if X,Y,Z are "generic" random variables such that ρ(X,Y)=a and ρ(Y,Z)=b, we should on average expect that ρ(X,Z)=ab.

https://www.lesswrong.com/posts/vfb5Seaaqzk5kzChb/when-is-co...

QuesnayJr · 2024-07-04T18:34:56 1720118096

The idea of using the square is due to Carl Friedrich Gauss when calculating the orbit of Ceres. I suspect he was familiar with the idea of absolute value.

kgwgk · 2024-07-04T09:34:25 1720085665

Not sure about the "R squared" name, but the use of this quantity as "coefficient of determination" (the proportion of the variance in the dependent variable that is predictable from the independent variables) can be traced back to Sewall Wright's "Correlation and causation" (1921).