Hacker News new | past | comments | ask | show | jobs | submit login
Well-known paradox of R-squared is still buggin me (columbia.edu)
70 points by luu 2 days ago | hide | past | favorite | 103 comments





I don't see the problem. The R-squared is 0.01 for blue/red predicting individual votes, because both of the states in question are really just different shades of purple. The R-squared is 1.00 for predicting the total vote share and which party wins the state, because of course the red/blue binary completely determines those.

R-squared is very small when the effect is small because it's squared, as the name implies.

If he doesn't like that he should just use R by itself, which would turn that 0.01 into 0.1, and it'd turn that 0.16 into 0.4. R is the Pearson correlation coefficient of a univariable linear regression.


(Who came up with R squared anyway? Seems like someone wanted to remove the possible minus sign from R but didn't know about the ABS function, so he used the square which has this side effect of making the numbers too small.)

using just the ABS gives you the LAD. least absolute deviation is not used for a couple of reasons. LAD doesn’t have a unique solution - you can code it up and your program will give you a set of solutions which will all minimize the LAD. this looks like a feature, but for various pedagogical reasons, statistians insist on BLUE. LAD can give you the LUE ie linear unbiased estimator, but not the B ie best. also, LAD doesn’t have a closed form. so you can’t write an equation and say this is the LAD, then differentiate it and derive interesting properties about the estimator. since there’s no closed form, you can use numerical methods to find something that’s good enough like error below 1e-6 and declare that to be the LAD. but that will happen at a bunch of points so no uniqueness.

The reason of the apparent paradox: a) in this case the model is a mixed model b) second the variable are nominal so you have to select one of the pseudo R^2 models. For more information: (1) Pseudo R-squared: https://en.wikipedia.org/wiki/Pseudo-R-squared (2) R squared for mixed models – the easy way https://ecologyforacrowdedplanet.wordpress.com/2013/08/27/r-...

c) The R^2 used with a linear model requires a constant term, in this case the constant term or bias explains a lot about preferences (almost 50/50) so there is less information available for the slope term.

Hope this helps.


@justk What you're talking about might make sense if there were more independent variables to consider, but in this case there's only one, state. So in fact you could say that there are two conditional linear models in the example; one for the first state (state=0), and one for the second (state=1). The model does the best job with the information available (state).

Sorry, I edited my post several times and finally choose a short form with links other sources. If you fix state=1 then there are no more random variables so the R^2 doesn't have any meaning. Just for fun, what the model should predict for state = 0.5?, that corresponds to a person that is 50% in the red state and 50% in the blue state, I think a mixed model is appropriated here when the state variable is discrete, so that each value of the state variable represents a different part of the population, the other model should be used when people move a lot and change frequently the state where they vote in, but in that case you should have to consider the fluctuations in the total population in each state at the time of voting.

@justk The R^2 value of 0.01 calculated on that webpage uses both states, not just one: the variance of the predicted values across both states is 0.55^3+0.45^3 - (0.55^2+0.45^2) ≃ 0.497 ≃ 0.5 I don't think it makes sense to use a mixed model in this case since the variance is the same for each state. A mixed model is used when the observations have some structured heteroskedasticity, i.e. different variances for different values of the independent variables.

> The R^2 used with a linear model requires a constant term, in this case the constant term or bias explains a lot about preferences (almost 50/50) so there is less information available for the slope term.

This explains the paradox, basically. When you take the null model "preference = 50%" (i.e. intercept-only model), there simply isn't much residual variance left for the linear model to explain.

That's why you get an R^2 = 1 if you use the "R^2 = rho(state, preference)^2" formula (you are ignoring the role of the intercept in explaining most of the variance, and exploiting the translation-invariance of the Pearson correlation) vs. you getting an R^2 = 0.01 when you use the (more correct) "R^2 = explained variance / total variance" formula.

TL;DR: It makes sense to get a very low R^2 when it is the intercept and not the predictor that is explaining most of the variance.


> there simply isn't much residual variance left for the linear model to explain.

I'd say that there is still quite a lot of residual variance to explain. You need a baseline - the worst choice would be to predict 0 (or 1) and the mean squared error would be 0.5. Using 0.5 as baseline halves the mean squared error to 0.25.


This, of course, will depend on how you code your variables, but if you try to fit a null, intercept-only and predictor-only model, you get this as residual variance:

> data <- data.frame(state = c(0, 1), pref = c(0.45, 0.55))

> sum(residuals(lm(pref ~ 0, data = data))^2) # null model

[1] 0.505

> sum(residuals(lm(pref ~ 1, data = data))^2) # intercept-only model

[1] 0.005

> sum(residuals(lm(pref ~ state + 0, data = data))^2) # predictor-only model

[1] 0.2025

So, it seems clear that you only get a "perfect" prediction with the full (intercept + predictor) model mostly because of the intercept (which explains (0.505-0.005)/0.505 = 0.99 = 99% of the variance).

Thus, it makes sense that the predictor is only explaining the rest (i.e. 1%) of the variance... hence, the R^2 = 0.01


This is a little strange, you are using a data.frame with only two points so any linear model with two different parameters will be 100% accurate. This is the line that connect two points.

An affine model, yes (as I mentioned in my comment). A linear model, no ;)

But, anyway, seems like I interpreted things incorrectly.


Your calculation is not directly related to the model (and associated R²) discussed in the article which are about the prediction of individual votes using the state as predictor - not state averages using the state as predictor.

Maybe I'm completely missing your point but the calculations in the blog post are, adapting your code (I think you meant mean where you wrote sum):

  > data <- data.frame(state = rep(c(0, 1), each=20), pref = c(rep(0, 11), rep(1, 9), rep(0, 9), rep(1, 11)))

  > mean(residuals(lm(pref ~ 0, data = data))^2) # null model [NOT IN THE BLOG POST]
  [1] 0.5

  > mean(residuals(lm(pref ~ 1, data = data))^2) # BASELINE intercept-only model
  [1] 0.25

  > mean(residuals(lm(pref ~ state + 0, data = data))^2) # predictor-only model [NOT IN THE BLOG POST]
  [1] 0.34875

  > mean(residuals(lm(pref ~ state, data = data))^2) # MODEL
  [1] 0.2475

  > summary(lm(pref ~ state, data = data))$r.squared # MODEL
  0.01
The blog post is about what you call "intercept-only" model (MSE 0.25) and the full model (MSE 0.2475), the R² is (0.25-0.2475)/0.25=0.01. His calculation is slightly different: instead of 0.25-0.2475 he calculates directly 0.05^2 which is the variance of the predictions (in this case the total variance 0.25 can be decomposed as the variance of the errors 0.2475 plus the variance of the predictions 0.0025).

(After re-reading the blog post with more care...) you are right, and thanks for the correction.

Either way, the point stands... the improvement in using a full linear model (that predicts 0.45 or 0.55, depending on state) is marginal compared to the baseline model that always predicts 0.50, as you demonstrate with your code.

To me, this doesn't seem paradoxical... the predictor is indeed providing little information over the "let's flip a coin to predict someone's voting preference" null/baseline predictor, since people's preferences (in aggregate) are almost equivalent to "flipping a coin".

note: I meant "sum", but it's the same, since the ratio between sums of squares is equivalent to the ratio between mean squares


> Either way, the point stands... the improvement in using a full linear model (that predicts 0.45 or 0.55, depending on state) is marginal compared to the baseline model that always predicts 0.50

Yes, I think we don't disagree. I was just puzzled by the "little variance left to explain" remark.

> note: I meant "sum", but it's the same, since the ratio between sums of squares is equivalent to the ratio between mean squares

You're right, sum of squares made sense if it was just for the ratio.


Thanks for taking the time to clarify my confusion.

It's not that there is "little variance left to explain", but actually that (no matter what) there will always be too much variance left to be explained, when the response is Bernoulli-distributed and the parameter is not too far from 0.5 (i.e., the data generating process is like flipping a slightly loaded coin).

If you use the expected value to predict the Bernoulli variable, you will always be somewhat wrong (0.45 and 0.55 are both far from 0 and from 1, which are the only possible responses).

If you use a binary response to predict, you will quite often be very wrong, even if you are right on average, and even if your prediction is to generate Bernoulli-distributed samples from the exact same distribution (i.e., you know exactly how the coin is loaded/biased and you can exactly replicate its data generation process).

So... yeah... no "paradox" ;)


I don't think it's about not knowing the abs function, more about the fact that the first derivative would be discontinuous and the second doesn't exist in 0. Variance had much nicer properties mathematically than absolute deviation.

You can solve L1 regression using linear programming at fantastically large scales. In fact in many applications you do the opposite: go from squared to absolute because the latter fits into in lp

When you’re making a linear regression you’re minimizing the sum of squared error which is 1:1 with R2. That is, you’re getting the best possible R2 achievable with a line.

The reason we do that is because we are assuming the errors are normally distributed and finding the slope that gets the best possible R2 is equivalent to getting figuring out how to fit the line with the maximum likelihood estimator of the error (aka mean of the distribution).

So ultimately it’s about curves. If you wanted to get a sense for why this is strongly desirable you should try to fit a linear regression using absolute error instead of squared error.


Fun story: We once had a robot at a very big company you've all heard of that kept getting too close to walls when moving down hallway-like areas. The controls engineer swore he had tuned the controller to death and no further improvements could be had.

A perception engineer took one look at the controller and saw linear error terms for distance left + distance right (distance to walls), changed it to distance left^2 + distance right^2, and the whole thing magically worked beautifully. Exercise for the reader: What position in the hallway minimizes sum of distances squared, vs what position(s) in the hallway minimize sum of distances without square.

This is essentially the same problem you pose.


Kind of funny but wouldn’t you want to just maximize(distance_left, distance_right) if you wanted the center?

Edit: no, derp, just walked into the same problem lol. Maximize the min should work though


That has a unique solution, so it is infinitely better than the linear error, but is not a nice differentiable signal suitable for controls, I'd bet. But that wasn't in the problem statement.

Seems like you'd want to minimize the distance to the midpoint, but I'm probably missing something.

Fun fact: that's what sum of square does

R squared has extremely nice properties. If you add a bunch of linearly uncorrelated variables together, then compute R² between each variable and the sum, the R² values will sum to 1.

R (ρ) has also a nice property:

> if X,Y,Z are "generic" random variables such that ρ(X,Y)=a and ρ(Y,Z)=b, we should on average expect that ρ(X,Z)=ab.

https://www.lesswrong.com/posts/vfb5Seaaqzk5kzChb/when-is-co...


The idea of using the square is due to Carl Friedrich Gauss when calculating the orbit of Ceres. I suspect he was familiar with the idea of absolute value.

Not sure about the "R squared" name, but the use of this quantity as "coefficient of determination" (the proportion of the variance in the dependent variable that is predictable from the independent variables) can be traced back to Sewall Wright's "Correlation and causation" (1921).

> predicting individual votes

Isn't that around the top of the statistics no-nos list? Probability theory in general? Predicting individual result based on the whole sample base? It was long ago and I am not in this field at all but my recollection tells me that it was mentioned in the beginning of the first class of probability theory 101.


No, it's the opposite. The sample space measurement gives you the probability of an individual selected item. This is the fundamental reason why probabily works.

But it requires uniform sampling.


Yes, you are right and statistics is confusing from the outside.

My opinion is LLMs are just applied statistics and you see people losing their minds thinking the models have "come to life". Most people just really have no intuition for stats.


> The two states are much different in their politics!

Are they? Sounds like they're both swing states, pretty close to 50-50, so which state you're from doesn't have a big effect on what your politics are likely to be. Which is exactly what the R^2 tells us. Where's the paradox?


A ten point gap is not a swing state, that’s big. Its about the same as New Jersey vs Texas in the 2020 presidential vote.

In the US.

In other countries 45/55 could easily be a swing state.

For example the 2014 Scottish independence referendum was decided 45/55 and that is usually held to be a fairly close result, at least close enough that it hasn’t ended the question.

So the real story here is voter polarization. The reason that Texas isn’t a swing state is because there’s a big hard core of immoveable voters at either end. So the actual population of swing voters is small.


I'm not exactly sure I would call the current political climate in the US voter polarization, it's an odd framing for what is ostensibly party polarization. You have political conservatives and christian conservatives pulling the Republican party further to the right to the point where libertarians and capital-M Moderate Republicans like Romney, McCain, Fitzpatrick have more in common with Democrats than their own party which would sound insane to anyone in the 00's.

I genuinely don't think that outside of a small minority who drank the Trump kool-aid the political views of the country at large have really changed. The biggest shift is that my friends who were and still are run-of-the-mill "Republicans" don't feel particularly represented right now, and I get it.

With the murmurs of Biden maybe stepping down I think if the Democrats nominated a Moderate Republican they would win in a landslide. It's nuts how much I didn't appreciate having two genuinely good candidates in 2008/2012.


> A ten point gap is not a swing state, that’s big.

No, it isn't. Your "ten point gap" is something like a 5% delta. That's below the confidence level of some polls.


> Where's the paradox?

Exactly. I propose that the paradox is in first-past-the-post voting, a 5% swing leads to a 100% change in representation. How can that be?



Arrow’s impossibility theorem only applies to ranked choice voting systems.

Yes, but what I mean is that even if you move away from FPTP voting, the others all have compromises.

But for many the drawbacks of ranked choice systems are far more preferable to those of FPTP. Additionally, Arrow's theorem only states that the spoiler effect cannot be completely eliminated by ranked choice - it says nothing about how often such an event actually occurs, and in practice spoiler candidates will occur less frequently with ranked choice systems compared to FPTP.

Additionally, rated choice voting systems are not subject to Arrow's theorem.


... which are almost never used in voting for governments.

FPTP is a ranked choice system.

No it isn't.

https://en.m.wikipedia.org/wiki/Ranked_voting

> The most commonly-used example of a ranked-choice system is the familiar plurality voting rule, which gives one "point" (vote) to the candidate ranked first, and zero points to all others (making additional marks unnecessary).


This is not what is usually meant with the term (plurality voting is different from ranked voting where there are non-trivial preference orderings) and in any case not what the Arrow theorem applies to.

Arrow’s theorem defines a ranked voting system as a function that takes a permutation of the candidates for each voter and outputs a single permutation of the candidates. As a special case, if you take the function which sorts the candidates by how many voters ranked them as first, you get plurality voting.

And that makes plurality voting different from ranked voting where the voters can submit arbitrary preference orderings. In any case, as I said, the Arrow theorem doesn't apply to plurality voting, as there can be no cycles in the aggregate.


Only if you have infinite voters.

That was my first thought as well.

Also, traditional R squared with binary variables, or maybe categorical variables, never made much sense to me. The "meaning" of the variance I don't think is quite the same. You generally have to nonlinearly transform (e.g., logit) any linear model quantities into something else to put it on an observed variable scale.


My statisitcs are a little rusty, so I might be off here. Someone correct me if I have this wrong. R^2 = 1 would be every voter in one state votes blue and every voter in the other votes red. R^2 = 0 would mean both states are exactly even between red and blue. The states are a lot closer to that. Again, my statistics are rusty so I'm no sure if this next part is valid, but sqaure root of .01 is .1, which doesn't seem like such a bad representation of the situation.

part of this has to do with the fact that our intuitive sense of effect sizes don't really use proportions and we subconsciously start including sample sizes.

If a state had an election with millions of voters and got a 55-45 result, it would be a decisive landslide victory; If a elementary school classroom had an election 20 voters had got a 55-45 split, it's be the narrowest possible margin of victory.

Most would likely say that the effect in the former 'feels' much larger even though proportions are identical, which suggests that under the hood we're factoring in sample size to our intuition about effect sizes (probably something chi-square-ish).

The result is that the framing of the problem can change our sense of how big the effects are. When we hear that these are state-level elections, we think it's a huge effect and feel that we should be able to do reverse inference. If it was reframed as an election on a much smaller sample, the paradox disappears and you'd say "of course you wouldn't be able to reverse that inference"


This ties in to the difference between whether an effect is statistically significant ("does the effect exist?") and whether it's significant ("does the effect matter?").

It's very common to confuse the two ideas.

In particular, in an election with many millions of votes and a 55-45 margin, it's common to describe the winner as receiving a mandate to rule, because it was so easy to determine who the winner was, despite the fact that they appear to be extremely unpopular. That's not a mandate in any ordinary sense.


> whether an effect is statistically significant ("does the effect exist?") and whether it's significant ("does the effect matter?")

The more accurate terms to describe this is whether an effect is significant (i.e., "do we have enough information to be able to claim it is different from zero?") vs. whether an effect is relevant (i.e., "is [our estimate of] the size of the effect meaningfully different from zero?").

Statistics can only address the first issue (effect significance); the second issue (effect relevance) requires domain knowledge beyond statistics (in some contexts, a difference of 0.01 units can be irrelevant, while it may be relevant in other contexts).


R2 is more simply explained as the share of the error variance explained by the model out of the share of the error explained by the best guess, which is, in this case 0.5.

Guessing 0.5 will have you wrong wrong by 0.5 100% of the time. SST is 25 for a 100 sample example.

Guessing 0.55 for the 0.55 state will have you wrong by 0.45 55% of the time and 0.55 45% of the time for the other. SSE is 24.75

1- 24.75 / 25 = 0.01

Looking at it this way it’s not too hard to see why the R2 is bad. It barely explains any more difference in the individual behavior than the basic guess.

R2 is not a great metric for percentages or classification problems like this.


> Looking at it this way it’s not too hard to see why the R2 is bad. It barely explains any more difference in the individual behavior than the basic guess.

Right. R² is 1% because the prediction is bad - only marginally better than the basic guess.

> R2 is not a great metric for percentages or classification problems like this.

Using a different metric won't improve the prediction.

Is the Brier score a great metric for problems like this?

The Brier score for the model is 0.2475.

The Brier score for the "basic guess" is 0.25.

The improvement in the Brier score for the model relative to the basic guess is 1%.


The math is correct, but I think the model used is not correct since it doesn't reflect that the variable s is dichotomous so rather a mixed model should be used. If we continue thinking that s is continuous we could think of this example: s=state is encoded as a continuous variable between -1 and 1 here people change state frequently and -1 reflects the person will vote in the blue state with probability 1 and s=1 that the person will vote in the red state with probability 1 while s=0 means that the person has the same probability of voting in the red or blue states. When s is near zero the model is not able to predict the preferences of the voter and this is the reason of the low predictive power of this model for a continuous s. The extreme cases s=-1 or s=1 could be rare for populations that move from one state to the other frequently so the initial intuition is misleaded to this paradox.

This.

R2 is not the correct measure to use.

This article is a perfect example of the principle that simply doing math and getting results is not necessarily meaningful.


R² is a measure like any other. In this case it measures the relative reduction in MSE - which is low because the prediction of individual votes remains quite bad even if the state is taken into account.

Does another measure give substantially different results?


I think that you are using here a different definition of R^2 for example the way you are thinking of R^2 doesn't allow for an interpretation of the constant term used in the linear model for the formula of the R^2 to be true. What you are thinking is R^2 = 1 - mean(the variance in each state)/(total variance), but that is not the definition of R^2 for a linear model.

As the user fskfsk.... says in another comment, here the constant term explains a lot of the variance so that the slope terms contains less information, that is not available using your definition or idea of R^2


> I think that you are using here a different definition of R^2

Different from what?

According to wikipedia:

The most general definition of the coefficient of determination is R^2 = 1 - SS_res / SS_tot ( = 1 - 0.2475 / 0.25 = 0.01 in this case)

Edit to clarify the definition above:

SS_res is the sum of squares of residuals (also called the residual sum of squares) ∑( y_i - predicted_i )^2

SS_tot is the total sum of squares (proportional to the variance of the data) ∑( y_i - ∑y_i/N )^2


The most general definition of R^2 can produce a result that is negative, and we are talking about a paradox related to values of R^2 that one should expect. So it is common to use linear models and linear regression. I don't know if the variance of the total population can be computed as the sum of the variances in each state, and state is not a continuous variable.

The population variance is the sum of the Between Group Variance and the Within Group Variance weighted by the number of elements in each group.


I don't understand what you mean. I'll just note that the value of R² in this case is 1% as the blog post explains and the code below confirms.

  > data <- data.frame(state = rep(c(0, 1), each=20), pref = c(rep(0, 11), rep(1, 9), rep(0, 9), rep(1, 11)))
  > summary(lm(pref ~ state, data = data))$r.squared
  0.01

The math is correct, I am referring to your comment: >> R² is a measure like any other. In this case it measures the relative reduction in MSE - which is low because the prediction of individual votes remains quite bad even if the state is taken into account.

I may be reading too much from your comment, but it seems that you relate R^2 to the reduction in the prediction error in each state, so it seems you are thinking about the formula of computing the R^2 as the (average variance in each state)/(total variance), that I think is not correct in general since at least it should require the total variance to be the sum of the variances in each state. If you based your ideas in that formula then your intuition is not correct, that is my point. When I apply R^2 I am thinking in a multivariable linear model with continuous variables, and this is not the case. I should measure this problem by how the entropy change when we apply the information about the state, something like the cross entropy using the total distribution and the distribution by states.


When I wrote "In this case it measures the relative reduction in MSE" I meant exactly that.

The mean squared error of the baseline model which doesn't include the state as a regressor is 0.25 (it predicts always 0.5 - it's off by 0.5 in every case).

The mean squared error of the model which includes the state as a regressor is 0.2475 (it predicts 0.45 or 0.55 depending on the state - in both cases it's off by 0.45 with 55% probability and it's off by 0.55 with 45% probability).

The mean squared error is directly related to variance when the predictor is unbiased. The ratio of the sum of squares is the same as the ratio of the mean square errors.

Edit: http://brenocon.com/rsquared_is_mse_rescaled.pdf

"R2 can be thought of as a rescaling of MSE, comparing it to the variance of the outcome response."

https://dabruro.medium.com/you-mention-the-average-squared-e...

"Also it is worth mentioning that R-squared (coeff. of determination) is a rescaled version of MSE such that 100% is perfection and 0% implies the same MSE that you would get by simply always predicting the overall mean of the dataset."


Let d1 = data[state==0] and d2 = data[state==1], then var(d1$pref) = 0.26, var(d2$pref)= 0.26 and var(d$pref)= 0.256 (using R and one of your dataframes), so the intuition is that knowing the state does not give information about the preferences of the voters, so this suggests that any model based on state should give poor results and so having R^2=1 is not a big paradox in this case.

There must be a formula to compute R^2 from variances both among states and inside states but anyway, when the variances inside any state are bigger that the total variance that should imply that the feature that divides the population in groups is of little value for prediction so it should have a small R^2 value.


That seems more or less what I said in the comment you replied to: the prediction of individual votes remains quite bad even if the state is taken into account. That's why the relative reduction in MSE is low. That's why the R² is low. I don't think there is any paradox.

I was replying to someone who claimed that "R2 is not the correct measure to use. This article is a perfect example of the principle that simply doing math and getting results is not necessarily meaningful." I've not seen any comment from anyone getting "different results" with a different measure.

Edit: You used var(...) which includes a factor N/N-1 and doesn't give exactly the total sum of squares.

The example dataframe contains 40 observations (20 per state) and you get higher variance estimate for the subsamples than for the aggregate sample but if you put toghether a few copies of the data (for example doing "data <- rbind(data, data, data, data, data)") even the adjusted (unbiased) estimator of the variance is lower for the states.

You can calculate the "exact" values yourself doing (x-mean(x))^2 or undoing the adjustment:

  > var(data$pref)*39/40
  [1] 0.25
  > var(data[data$state==0, "pref"])*19/20
  [1] 0.2475
  > var(data[data$state==1, "pref"])*19/20
  [1] 0.2475
> when the variances inside any state are bigger that the total variance

They are not. But you're right in that a small difference shows that dividing the population in groups is of little value for prediction and that's why the R^2 value is small.


The correcting factor n/(n-1) in R is what explains my paradox about the law of total variance Var(Y) = E(var(Y|X)) + Var(E(X|Y)), I was obtaining result that don't match this formula because I corrected all the variances with the factor 20/19 but the total variance should have the factor 40/39 just like you pointed. Thanks for the comments and the correction.

I just added another comment that relates analysis of variance to this post to show that there is no real paradox here.

Finally, the formula for the total variance above is related to my intuition that having some information (having the data for each state) should make the means of the variances in each group smaller that the total variance, because variance is related to lack of information. But analysis of variance suggests (see other comment of mine) that the state factor is not representative because the high variance in each group (each state) and the low difference between the groups means and the total mean.


A mixed model is not relevant here. A simple linear regression with one variable will achieve exactly the same results. Coding it as -1 and 1 has no difference to coding it as 0 and 1. You just stuff the rest into the intercept.

You would also want to be predicting 0.45 and 0.55 not 1 and 0 because we solve for squared error.


I'm no statistician, but the whole premise seems mismatched. Why are we using a tool from regression to analyze a classification problem?

> Why are we using a tool from regression to analyze a classification problem?

Because classification is a regression problem.

Think about it for a second. You want to put together a tool to tell which class an input belongs to. You have training data you can use to build your tool around. Your training data is already divided into sets that belong to a specific classm Your goal is to put together a model that can tell you what's the closest class your input belongs to by comparing with how close your input is to elements of the training data belonging to a specific class.

What's your strategy?

Well, one of the textbook strategie starts by specifying how you measure the distance between elements of your training set, and from that point you work on putting together a function that not only minimizes the distance between elements of your training set but also, when used to evaluate elements of a training set, works well in telling the type of elements of the training set that are closest to them. Then you assume the class of your input element is the same as the class of the elements of the training set that are closest to them.

In the example above, the minimization step is... Yes, regression. You use regression to fit your model to your training data so that it is able how close your input element is to elements of a certain class, and then outputs how close it is to each of the classes.


Classification models break down to regression problems under the hood, but regression metrics are not good tools to evaluate the efficacy of classification models.

> Classification models break down to regression problems under the hood, but regression metrics are not good tools to evaluate the efficacy of classification models.

You're simply wrong. Regression is a tried and true classification technique. Posting personal and baseless assertions don't refute that. I mean, pick up pretty much any textbook on supervised and unsupervised learning. You always end up with an approach which boils down to having training data, put together a trial function, apply a minimizer to fit trial functions to training data, and evaluate the resulting model by running trial data through it. Fitting trial functions to training data has a name: regression. Minimum squares has a very precise interpretation both in linear models and in probability. There is no way around it.


In most modern ML circles “regression” analysis refers to the prediction of continuous variables whereas “classification” refers to the estimation across discrete outputs. This is true even for “logistic regression”.

What I said was that regression metrics are not good for evaluating usefulness in classification problems. An R2 of 0.01. The fact that there are mitigating circumstances to justify why this might not be the case is not evidence that R2 is still a good thing to use. It’s actually evidence of the opposite.

With classification problems we are concerned more often with the ordinality of estimated probabilities vs outcomes and/or the calibration of a model.

> Posting personal and baseless assertions don't refute that.

Your mom didn’t refute me either.


> In most modern ML circles “regression” analysis refers to the prediction of continuous variables (...)

You're voicing very opinionated takes while showing considerable ignorance on the topic.

Classification problems are solved with trial functions adjusted to the training set through regression. These trial functions,once fitted, represent membership functions. They are essentially interpolation functions that, say, converge to 1 when close to elements of the training set of a specific class, and 0 for members of all other classes. In grey areas where elements of the training set are sparsely distributed, these membership functions can output values in the middle, because they are interpolating.

That's literally data mining 101.

> What I said was that regression metrics are not good for evaluating usefulness in classification problems.

That's simply not true. It's like saying models that do not fit the data are good approximations of the data. Another way to put it is praising the accuracy of a broken clock because it's spot on two times a day.

> The fact that there are mitigating circumstances to justify why this might not be the case is not evidence that R2 is still a good thing to use. It’s actually evidence of the opposite.

I don't think you have an adequate grasp on the subject, neither classification problems nor linear models. Therefore, your personal baseless assertions don't mean anything nor bring any value to the discussion.


I’m a data scientist that builds ML models for a living.

You’ve twice now tried to explain how fitting a model works in response to me stating that regression “metrics” are not well suited to describing classification “efficacy”. If anyone is making a basic mistake here it seems to be you failing to understand the difference between a “regression model” like a linear regression, and a “regression metric” like R2.

I’ve stated my semantic meaning of regression vs classification. You can Google this to see it is not a fringe view. It’s been standard for over a decade. Eg https://math.stackexchange.com/questions/141381/what-is-the-...

> Another way to put it is praising the accuracy of a broken clock because it's spot on two times a day.

Accuracy is a classification metric kiddo

Here’s an introductory Wikipedia article if you would like to learn more about what metrics are appropriate for binary classification evaluations:

https://en.m.wikipedia.org/wiki/Evaluation_of_binary_classif...


I am a statistician, and you're right, for this kind of thing we would normally use a binary response model such as a logit or probit model that constrains the response variable to be between 0 & 1. However in this case it doesn't matter since there's only one independent variable (state), and it's binary so there's only 2 different predictions the model could make (which will be the correct probabilities of 0.45 & 0.55, even with a linear model).

The normal R^2 formula can't be applied to a logit/probit model; instead you use an alternative such as McFadden's or Cox & Snell pseudo R-squared. I'd be interested to see what value they take for this example.

Linear models are sometimes used even in models with many independent variables since it can be shown that the coefficients in a linear model are unbiased estimators for the average partial effects of any non-linear binary response model.


Yep. I'm also not a statistician, but linear regression that the blogger is using predicts the mean for each state, and this is being conflated with trying to predict p(color | state). The goodness of fit here would be better modeled cross entropy and not a standard deviation.

For what it's worth, the "blogger" is an statistician.

I mean, Andrew Gelman, he is quite the statistician.

Oh he only went to MIT, pfff. And wrote a textbook.

Haha, this is why I always Google whoever the author is to articles posted on HN before commenting. More than once I've thought "this person is an idiot" only to Google their name and find out they are a famous person in that field. Then I go back and re-read their article and realize I missed something more subtle going on.

This seems to be a special case.

The blog post was rebutted by Seth in the comments 2 weeks ago, same as Colin Percival's HN rebuttal, and Andrew didn't reply. It seems like a weird goof. Andrew was "buggin".


Yeah, the more I read the post the more confused I am actually. At first glance to me this seemed like a non-paradox. So I kept wondering if I'm missing something, but based on everyone's responses, maybe I'm not?

This comes up a lot in genetics. One crowd says "polygenic scores for education don't tell you much, because look how low the R-squared is!" Another crowd (including me) says "polygenic scores for education are a big deal, because look how big the effect size is!"

What paradox? People don't vote a particular way because they live in a state. The logic here would imply "Welp, I live in Kentucky, so I guess Red?" would be the expected mode at the voting booth.

Statistics does not require asserting causality.

Yes but you do actually have to understand the outcome you're measuring for.

Not to encounter the “paradox” of R2 being low while a real effect is identified

There are two ways to resolve the paradox

1. if you insist on using the r-squared (i.e., a linear regression measure), then properly center and normalize your data, and model what you actually predict: the difference between the baseline (0.5) and the probability to vote for party 0 or party 1. If you model the outcomes as 0/1 without this, then you are using a model made for gaussian variables on what should be a logistic regression 2. if you can live with something that more accurately captures the idea of "explanatory power", you can use a GLM (logistic link function), do a logistic regression, and then use the log odds or another measure.

In both cases, the variance explained by the state that you are in is 1, because of course it is, that's how the thought experiment is constructed - p(vote for party 1)=0.5+ \delta(state).

"Paradoxes" like this are often interesting in the sense that they point to the math being the wrong math or you using it wrong, but instead people tend to assume that they are obviously understanding things correctly so it must be some weird property of the world (which then sometimes is used to construct some faulty conclusions as in some of the cited papers)


From (1) On the other hand, if the variation between the group means and the grand mean is small, and the variation within groups is large, this suggests there are no real differences in the group means i.e. the variations we observe is just sampling variation.

The above is in the context of analysis of variance. In our example the means in each state are 0.55 and 0.45 and the total mean is 0.50 so first summand is small but the variances in the red and blue states are both 0.247, large summand, so the variations we observe are just sampling variations. Hence the state factor is not important and that explains the low R^2 value. Note that in each state the predicted value for the model is the group mean of that group. So analysis of variance explains that the OP result is not a paradox or something strange.

https://saestatsteaching.tech/analysis-of-variance


Isn’t the phenomenon just related to the way the vote options are encoded? Use different methods and you will see different R^2 results. Aren’t the votes represented artificially on a continuous domain for the R^2 calculation but the actual values are categorical values?

Not really. No matter how you encode extremes will be 0% and 100% and one option will be 45% and the other 55%.

As other commenters have pointed out in one way or another, the problem seems to actually be that this simplistic model of voter choice can't capture all the structure of the real world that humans can quickly infer from the setup. Things like: state elections have millions of voters, 55/45 is actually a decisive, not a narrow win etc.

In a generic setup, imagine you have a binary classifier that outputs probabilities in the .45-.55 range - likely it won't be a really strong classifier. You would ideally like polarized predictions, not values around .5.

Come to think of it, could this be an issue of non-ergodicity too ( hope I'm using the term right)? i.e. state level prior is not that informative wrt individual vote?


No, you want your model to be well calibrated. If the model accurately assessed a 0.55 probability of going blue, then that is what you want.

People who try to correct for “unbalanced classes” and contort their model to give polarizing predictions are frankly being pretty dumb.

The correct answer is to take your well calibrated probabilities and use you brain on what to do with them.


This is not a matter of class balance that much. If you want to predict which of two parties somebody will vote with, the most natural framing is that of binary classification.

For that you need to threshold your predictions. Ideally you'd like your model to generate a bimodal distribution so that you can threshold without many false positives etc.


Yes but the prompt here states that all we know is the probability is either 0.55 or 0.45. By definition this is the best model you can produce.

If voters are split 60-40 on an issue, that doesn't mean that the odds are 60-40.

You should instead be asking, what are the odds that that X voters could change their vote.


The states (and even more so the sub-state regions) really are much more different than what you would think just looking at R vs D. A Democrat in a city the Democrats win 90-10 is likely a very different Democrat from one where they lose 60-40.

If you think that's bad, the R^4 coefficient is even lower.

Nothing but endless cloudfare captchas here for me.

Removing cookies for the domain doesn't help, because (doh) I've never visited it before.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: