The math is correct, but I think the model used is not correct since it doesn't ...

mtts · 2024-07-04T10:23:16

This.

R2 is not the correct measure to use.

This article is a perfect example of the principle that simply doing math and getting results is not necessarily meaningful.

kgwgk · 2024-07-04T10:39:41

R² is a measure like any other. In this case it measures the relative reduction in MSE - which is low because the prediction of individual votes remains quite bad even if the state is taken into account.

Does another measure give substantially different results?

justk · 2024-07-04T11:33:03

I think that you are using here a different definition of R^2 for example the way you are thinking of R^2 doesn't allow for an interpretation of the constant term used in the linear model for the formula of the R^2 to be true. What you are thinking is R^2 = 1 - mean(the variance in each state)/(total variance), but that is not the definition of R^2 for a linear model.

As the user fskfsk.... says in another comment, here the constant term explains a lot of the variance so that the slope terms contains less information, that is not available using your definition or idea of R^2

kgwgk · 2024-07-04T12:14:51

> I think that you are using here a different definition of R^2

Different from what?

According to wikipedia:

The most general definition of the coefficient of determination is R^2 = 1 - SS_res / SS_tot ( = 1 - 0.2475 / 0.25 = 0.01 in this case)

Edit to clarify the definition above:

SS_res is the sum of squares of residuals (also called the residual sum of squares) ∑( y_i - predicted_i )^2

SS_tot is the total sum of squares (proportional to the variance of the data) ∑( y_i - ∑y_i/N )^2

justk · 2024-07-04T12:49:08

The most general definition of R^2 can produce a result that is negative, and we are talking about a paradox related to values of R^2 that one should expect. So it is common to use linear models and linear regression. I don't know if the variance of the total population can be computed as the sum of the variances in each state, and state is not a continuous variable.

The population variance is the sum of the Between Group Variance and the Within Group Variance weighted by the number of elements in each group.

kgwgk · 2024-07-04T13:08:21

I don't understand what you mean. I'll just note that the value of R² in this case is 1% as the blog post explains and the code below confirms.

  > data <- data.frame(state = rep(c(0, 1), each=20), pref = c(rep(0, 11), rep(1, 9), rep(0, 9), rep(1, 11)))
  > summary(lm(pref ~ state, data = data))$r.squared
  0.01

justk · 2024-07-04T13:29:49

The math is correct, I am referring to your comment: >> R² is a measure like any other. In this case it measures the relative reduction in MSE - which is low because the prediction of individual votes remains quite bad even if the state is taken into account.

I may be reading too much from your comment, but it seems that you relate R^2 to the reduction in the prediction error in each state, so it seems you are thinking about the formula of computing the R^2 as the (average variance in each state)/(total variance), that I think is not correct in general since at least it should require the total variance to be the sum of the variances in each state. If you based your ideas in that formula then your intuition is not correct, that is my point. When I apply R^2 I am thinking in a multivariable linear model with continuous variables, and this is not the case. I should measure this problem by how the entropy change when we apply the information about the state, something like the cross entropy using the total distribution and the distribution by states.

kgwgk · 2024-07-04T13:41:04

When I wrote "In this case it measures the relative reduction in MSE" I meant exactly that.

The mean squared error of the baseline model which doesn't include the state as a regressor is 0.25 (it predicts always 0.5 - it's off by 0.5 in every case).

The mean squared error of the model which includes the state as a regressor is 0.2475 (it predicts 0.45 or 0.55 depending on the state - in both cases it's off by 0.45 with 55% probability and it's off by 0.55 with 45% probability).

The mean squared error is directly related to variance when the predictor is unbiased. The ratio of the sum of squares is the same as the ratio of the mean square errors.

Edit: http://brenocon.com/rsquared_is_mse_rescaled.pdf

"R2 can be thought of as a rescaling of MSE, comparing it to the variance of the outcome response."

https://dabruro.medium.com/you-mention-the-average-squared-e...

"Also it is worth mentioning that R-squared (coeff. of determination) is a rescaled version of MSE such that 100% is perfection and 0% implies the same MSE that you would get by simply always predicting the overall mean of the dataset."

justk · 2024-07-04T14:23:32

Let d1 = data[state==0] and d2 = data[state==1], then var(d1$pref) = 0.26, var(d2$pref)= 0.26 and var(d$pref)= 0.256 (using R and one of your dataframes), so the intuition is that knowing the state does not give information about the preferences of the voters, so this suggests that any model based on state should give poor results and so having R^2=1 is not a big paradox in this case.

There must be a formula to compute R^2 from variances both among states and inside states but anyway, when the variances inside any state are bigger that the total variance that should imply that the feature that divides the population in groups is of little value for prediction so it should have a small R^2 value.

kgwgk · 2024-07-04T14:30:32

That seems more or less what I said in the comment you replied to: the prediction of individual votes remains quite bad even if the state is taken into account. That's why the relative reduction in MSE is low. That's why the R² is low. I don't think there is any paradox.

I was replying to someone who claimed that "R2 is not the correct measure to use. This article is a perfect example of the principle that simply doing math and getting results is not necessarily meaningful." I've not seen any comment from anyone getting "different results" with a different measure.

Edit: You used var(...) which includes a factor N/N-1 and doesn't give exactly the total sum of squares.

The example dataframe contains 40 observations (20 per state) and you get higher variance estimate for the subsamples than for the aggregate sample but if you put toghether a few copies of the data (for example doing "data <- rbind(data, data, data, data, data)") even the adjusted (unbiased) estimator of the variance is lower for the states.

You can calculate the "exact" values yourself doing (x-mean(x))^2 or undoing the adjustment:

  > var(data$pref)*39/40
  [1] 0.25
  > var(data[data$state==0, "pref"])*19/20
  [1] 0.2475
  > var(data[data$state==1, "pref"])*19/20
  [1] 0.2475

> when the variances inside any state are bigger that the total variance

They are not. But you're right in that a small difference shows that dividing the population in groups is of little value for prediction and that's why the R^2 value is small.

justk · 2024-07-04T15:18:03

The correcting factor n/(n-1) in R is what explains my paradox about the law of total variance Var(Y) = E(var(Y|X)) + Var(E(X|Y)), I was obtaining result that don't match this formula because I corrected all the variances with the factor 20/19 but the total variance should have the factor 40/39 just like you pointed. Thanks for the comments and the correction.

I just added another comment that relates analysis of variance to this post to show that there is no real paradox here.

Finally, the formula for the total variance above is related to my intuition that having some information (having the data for each state) should make the means of the variances in each group smaller that the total variance, because variance is related to lack of information. But analysis of variance suggests (see other comment of mine) that the state factor is not representative because the high variance in each group (each state) and the low difference between the groups means and the total mean.

jncfhnb · 2024-07-04T12:34:32

A mixed model is not relevant here. A simple linear regression with one variable will achieve exactly the same results. Coding it as -1 and 1 has no difference to coding it as 0 and 1. You just stuff the rest into the intercept.

You would also want to be predicting 0.45 and 0.55 not 1 and 0 because we solve for squared error.