Thanks for posting this. Salient point for me was the following bit I'd never considered before, but explains quite well the feeling I've had about testing for a long time.
There’s a great observation called Goodhart’s Law that basically says any metric becomes useless once you start using it for control purposes. So the SAT, for example, is a good general test of academic aptitude. But since it’s used so much for admission to college, kids are trained and coached. They spend lots of time and effort, specifically to improve their SAT score at the expense of a well-rounded education, to the point where [the SAT] may not be such a good guide to general academic excellence, even though it used to be before students started optimizing
This is true in most of social science. Metrics work when people don't know they're being measured. Once people learn they're being measured on something, they optimize to it.
On another note... If a metric is being used properly, it might not seem like it's working. For example... Harvard saw that GMATs didn't correlate with their MBA's success, so they dropped it as admission criteria. Years later they checked the GMAT scores for the classes where they weren't required. What happened? GMATs correlated to performance. What's the cause? Earlier on they were properly weighting in. (Higher GPA? GMAT doesn't matter much. Lower GPA? It matter mores. Grad degree in stats? Lower undergrad GPA and GMAT don't matter as much, etc.) If you're weighting it properly, the importance won't show. (In essence, a 780 is the right marginal score to let one person in, a 680 on another)
All that said, in a way, the perfect measure is an unfair one, because you don't know what you're being measured on so you can't change it.
That is partially true. A metric should continue to "work" when people know they are being measured on it. A good metric is one which makes the measured person more effective when they explicitly try to optimize it.
So for example, pro basketball player knows they are measured by their accuracy. However, when a player optimizes his accuracy, that does not compromise the effectiveness of that metric.
The important thing here is that a metric that is a _correlate_ of the thing you care about but can't measure directly might start to work less well once people know you're measuring it. And the reason is that the correlation with the thing you really care about might decrease.
To use your basketball example, say you have a player that takes a shot every time they get close enough to the basket, scoring 60% of them. You decide to measure accuracy. They get more conservative with when they take shots, so now half the time they run out of shot clock (resulting in a turnover) and the other half of the time they take a shot and score with 100% accuracy.
Their accuracy went from 60% to 100%, but the fraction of time possession results in scoring went from 60% to 50%.
The issue there is that you don't _really_ care about accuracy per se. What you care about is contribution to the score differential vs the opposing team. But that's hard to quantify, so people come up with various metrics that correlate to such contribution (shooting accuracy, rebound percentage, etc, etc), along with relative weights for these metrics. In fact, the weights depend on the exact team you're playing, who your current teammates are, etc.
In any case, it's actually fairly rare to be able to directly measure the thing you care about, especially in the social sciences, not least because a lot of the things we care about ("success in life", "college preparedness", "job satisfaction", "knowledge and understanding of the material covered by this class") are fairly difficult to quantify. So instead one ends up measuring various correlates and hoping that either people don't adjust their behavior too much to degrade the correlation before you're done measuring or that your measurement process attempts to account for such adjustments.
Indeed. Just yesterday a buddy told me of how he knew someone a few years ago (great story I know) who was gaming telephone subscriptions. He'd go into the system and check a box, say 'extra 500 minutes' with a deal that gave the first 3 months for free. This is normally what you do after you close such a sale when a customer buys the extra deal, you go into the system and register it. Only now he'd just do it without the customer knowing. 3 months later he'd cancel it. His sale was still registered and his customer never knew or ever paid, and he got his sales bump.
And another friend of mine works at a big software company (SAP, Oracle, those sorts of companies) in sales. Basically he has to generate leads, pass them on, and if they're closed he hits his target. So what most people do is they call the manager and suck up to him, and ask him to register a closed sale's lead as one of his. A lot of these sales are simply recurring sales, or sales closed when a customer called the company, as opposed to closed after a cold-call led to a lead. Long story short, you've got people hitting their targets without generating any leads or sales, making $40k a year and moving into the same closing function two years later without ever having done anything.
The idea seems pretty straightforward, once you create a metric, people will game it. And it seems a single person designing the system (the metrics, the governance of the metrics) over time continuously gets outsmarted by the hundreds or even thousands of people that are incentivized to find the loophole in the design of that individual.
Test preparation does not do much to increase SAT scores:
"For students that have taken the test before and would like to boost their scores, coaching
seems to help, but by a rather small amount. After controlling for group differences, the
average coaching boost on the math section of the SAT is 14 to 15 points. The boost is
smaller on the verbal section of the test, just 6 to 8 points. The combined effect of
coaching on the SAT for the NELS sample is about 20 points."
I went through the Princeton Review many years ago and it improved my score by about 200 points.
Maybe I'm an outlier, but I can tell you this: nobody would pay for or go through the course if the expected boost was 20 points.
On the other hand, if I hadn't taken the Princeton Review, I'd have done something else to prep. And who knows how much I'd have been able to improve on my own.
They did not measured whether it is possible to prepare for SAT. They measured whether commercial courses boost your score. The comparison is not between student who does no additional work for SAT and the one who study a lot. The comparison is between student enrolled paying coach and the one who is not paying special SAT coach. The latter one can still work a lot with help of books and websites.
It does not mean you can not prepare for SAT, it just means that money spent on commercial coach are probably wasted. It says nowhere that if you skip all preparation altogether, you will get good result.
Quote: "Most studies have focused on estimating the effect of one specific type of test preparation, known as “coaching.” In this analysis, students Students have been coached if they have enrolled in a commercial preparation course not offered by their school but designed specifically for the SAT or ACT. The distinction made here is whether a test-taker has received systematic instruction over a short period of time. Preparation with books, videos and computers is excluded from the coaching definition because while the instruction may be systematic, it has no time constraint. Preparation with a tutor is excluded because while it may have a time constraint, it is difficult to tell if the instruction has been systematic."
Article conclusion focus on coached students only.
Note the time constraint in the above quote. People who take those tests seriously not only take the short course, but probably focused on the test long before the course started. At least, that is how well performing students I knew years ago behaved. The course itself might have boost their score only a little, but their preparation consisted of much more then just one course. After all, they are adults about to enter college - they are expected to be able to learn alone.
Most coaching is really poor. I used to do math prep for the SAT and I consistently improved scores more than the data would predict. After sitting through Princeton Review, it was easy to see why.
I suspect those that benefit the most from prep aren't telling.
Sounds like Google's toolbar pagerank, which was a neat way of seeing the very generic "relevance" score Google assigned to a page.
Then Search Engine Optimization really started catching ground, and some SEO's would obsess about that number, going through all kinds of work to manage that number, and basing business decisions (such as: should we link to this site? should we request a link from that site?) far too heavily on that number, as opposed to the actual business relevance of what they were doing, and as opposed to whether it would be a useful result for their own visitors (or those of the other site).
It is of course important to look at whether or not what you are measuring validly predicts what you are trying to predict. Despite much worry about SAT coaching courses, I get the very strong impression that the young people who score highest on the SAT are still, after all these decades, the young people who read widely for fun and who think about math problems for fun, just as they were in the 1970s when young people in my region of the United States were not coached for the SAT at all. The standardized conditions for administering the SAT have always included (in my lifetime) giving students registered for the test access to a complete previously administered test to serve as practice material. According to a psychometrician I know, precisely because the SAT allows everyone to practice in advance, it is an especially valid estimate of the test-taker's general ability to learn and to reason.
You can look up research about SAT correlations with general intelligence[1] and IQ test scores[2] and college grades[3] (what SAT scores are intended to predict) on Google Scholar and see for yourself what the empirical research says. No one-factor means of predicting college success is perfect (and if college admission officers were forced to use ONE factor to admit applicants to college, the factor they would be wise to choose to predict college grades is UNWEIGHTED high school grade average), but SAT scores (or ACT scores) are still used in the admission process of most colleges that are at all selective because they provide helpful information for comparing applicants who attended different high schools.
If you would like to put names of particular researchers into a Google Scholar search to find out what they have discovered about current use of SAT tests, I would suggest Nathan Kuncel or Paul Sackett, both of whom have studied quite a few detailed data sets about SAT-takers and what their scores predict.
AFTER EDIT: Terence Tao, in the interview kindly posted to open this thread, makes the important point that sometimes teachers entertain more than they actually teach. (I think a lot of commercial test-prep courses have this defect.) He said, "Sometimes making a class more enjoyable or entertaining is not the same as making it more educational. I remember once when I taught calculus, one of the sections was quadric surfaces, like ellipsoids, paraboloids and so forth. So there’s this thing called a hyperbolic paraboloid, and I wanted to demonstrate this. It turns out that a Pringle has the shape of a hyperbolic paraboloid. So I brought in a pack of Pringles to class and said, 'this is a hyperbolic paraboloid, this is what it looks like.' I ate the Pringle as I was writing equations on the board. Well many years later, I ran into someone on the street who said, 'Oh, I know you. I took one of your classes. I forget what it was, but there was a Pringle.' So I thought, 'you didn’t remember any of the math, but you remember the Pringle.' So it didn’t really work." Following up on that in the context of this subthread, I think a lot of students know who Joe Bloggs[4] is without being able to reason successfully to find the right answer choice on the harder SAT questions.
Here you have an implicit assumption that the system
being controlled can react, that is, in the sense of
control theory, change its plant dynamics, due to the
effort of the control. Well, for a lot of work in
control, say, in most aerospace applications, that
assumption is false or essentially so. Then
getting a good metric and using it is fine.
E.g., get the power spectrum of ocean wave noise
and use that in a control system to keep
a submarine at essentially a constant depth.
We don't expect that the metric of the
ocean wave power spectrum and the control system in the
submarine will change the ocean very much!
E.g., GPS is a terrific metric, but we don't expect
that using it will change the shape of the earth.
Not really -- the IQ tests used by MENSA for example are quite reliable/stable. You can improve your result by 6-10 points with a lot of training, but that's usually the most that people can achieve.
A standard deviation on IQ tests is 15 points. A 6-10 point improvement from the median represents a jump of 15.5 - 24.75 percentile rank. Even at the level MENSA is looking at, someone with a 10 point improvement could have a "true IQ" at just above the 95th percentile and qualify for their 98th percentile cutoff.
Maybe the reason MENSA's tests are reliable is that not many people are interested in joining MENSA - I am not denying that MENSA can be of interest, but its admittance would not be worth preparing IQ tests like crazy.
On the other hand, if IQ tests were used for admissions in prestigious universities, you can bet some students will train specifically for IQ tests.
I'm truly impressed by the cognitive dissonance here. On the one hand, this is a wildly popular platitude for intellectuals to endorse, and Terry Tao would certainly suffer some social penalties if he strongly opposed it. On the other hand, if this claim were actually true, it would invalidate basically the entire field of mathematical optimization and decision theory.
It's also strange that proponents of this view rarely apply it uniformly. There are many metrics that are quite popular among proponents of Goodhart's law - for instance, CO2 emissions, capital reserve requirements, etc.
"On the other hand, if this claim were actually true, it would invalidate basically the entire field of mathematical optimization and decision theory."
I might be able to help by pointing out that a technology that depends on random sampling of small components of "intelligence" or whatever is going to be totally screwed up if the sampling is no longer random and now the outcome of the test depends mostly on the amount of money spent on prep classes.
From manufacturing I can predict a lot about the quality of something in general off just one graph of the distribution of length of one part compared to other manufacturers graphs of the same single measurement on the same part, but if the mfgr "cooks the books" by optimizing one measurment of that part to perfection while taking away labor from other areas of part quality, then my comparisons will be useless, especially when comparing optimizers vs non-optimizers.
I'm not sure your strong criticism of Goodhart's Law is quite fair. It is stated so as to be pithy, but I think it is fairly obviously meant to be applied only to those situations where the thing being measured in some sense "judges itself" by that score, and adjusts its performance to try and improve that score.
This is quite likely the case with school children, but often not the case with, say, industrial processes. If, however, you make someone's job dependent on meeting some particular target for that process, then a feedback loop is created that can lead to the kinds of problems being discussed.
If the feedback loop isn't there, then the metric will remain as useful as it ever was. So the point is, be wary of unintentional feedbacks in your system, not some grandiose rejection of the possibility of optimisation.
Do you believe a company running an industrial process will not "judge itself" by it's CO2 emissions, or that a bank will not adjust to meet capital reserve requirements?
Yes, I do believe that they will, and those judgments can reduce the effectiveness of legislation as it is originally conceived. I only see that as speaking to the point, not against it.
A: I used to have more. When you work and you have family, it’s tough. When I was younger I used to watch a lot anime and play computer games and so forth, but I have no time for these things anymore.
He taught in Australia for a little while. I was in one of his classes and I watched some Roruoni Kenshin with him. He was probably the best math lecturer I've had. But I was a poor student.
I can't put my finger on exactly why, but this was the most interesting answer for me (not in the specifics of anime and computer games, but just in the ubiquity of the winnowing of interests over time due to career and family pressures; even Terence Tao didn't start out all-math, all-the-time!).
I mean you need a certain amount of base mathematics so that you can learn everything else quickly. But once you have the foundation, it’s fairly quick.
Shame the interviewer did not proceed to ask him what he considers this foundation to be, I think the question almost suggests itself.
I wouldn't suggest focusing on qualifying exam topics unless you need to pass a math qualifying exam.
In modern mathematics, particularly applied mathematics, these are NOT the most useful topics. That's just how things have always been done, and no one wants to argue about changing them in faculty meetings. See also the language exam:
http://web.math.princeton.edu/~templier/language-exam.txt
In my former career I never once used topics covered in "abstract algebra" (note that I'm distinguishing linear algebra from abstract), a common situation among analysts. Most people outside of analysis never use complex variables.
The only common core of topics I can identify that nearly every mathematician I know uses is:
Measure theoretic probability (this intersects with real analysis)
Linear Algebra
Algorithms and optimization (this is less common than the above two).
I learned proofs in a numerical analysis class. That doesn't mean a budding algebraist should study numerical analysis, it means you will learn how to do proofs in any rigorous math class.
Sorry didn't mean to imply that one needs to learn abstract algebra to do proofs. Was just mentioning my experience as it was course where I really enjoyed doing proofs.
Wondering why this is voted down. He is quite right. Several math professors at my school who teach applied math do not know even the utmost basics...say things like computing kernel of a quotient group. Am sure they must have known it at some point during their student years, or maybe during the 60s-70s, you could get away without knowing these things, I dunno.
otoh, my abstract algebra professor was able to learn and then teach me Ito's lemma all the way up to semimartingales, topics he had never set eyes upon in his entire life! So I'd say the abstract stuff makes you much more stronger - gives you tools to understand from scratch material you've never seen before. The applied people generally glaze over when you mention math topics out of their competence zone, especially topics that are "useless" for some definition of "use".
I don't know what you mean by "validity". I am absolutely certain that applications of all the "standard" fields (real, complex, algebra) can be found. They aren't bad things to learn.
I'm just suggesting that from what I've seen, the other fields I listed are more commonly useful. Everyone I know, across pure and especially applied math has used probability and linear algebra. Lots of people haven't used complex analysis or algebra ever, except on the qual.
I have read his summary of his qual linked here before, and upon reading it this time I also read a few additional. My takeaway is that it seems like it would be a lot of fun for the profs to conduct these interviews (most of the time, with probably a few utterly disastrous one thrown in for good measure).
Unrelated: why was your post not repliable when I first saw it? I refreshed the page a few minutes later and the reply button was there. No idea why that happened?
> a lot of fun for the profs to conduct these interviews
They can be. But it can get boring if your colleague wants to see the details of some boring computation. And, more seriously, it can be painful if the candidate is doing poorly. Fortunately I have not yet had to fail anyone.
> why was your post not repliable when I first saw it?
I believe there is a delay, the length of which is a function of how deeply nested your comment is, to encourage more top-level comments. (In particular, it tends to defuse arguments if you have to wait a long time to reply...)
One way to help get through quals: The quals are,
on the surface, in some common justifications, to be
more sure the student can do the dissertation research.
Okay.
Well, there's another way to be "more sure", really, more reliable than any
quals can ever be: Have the student do the dissertation
research independently before the quals. Now
in this case, just what
are the quals for?
Or,
for an engineering Ph.D., a guy writes a good
dissertation in applied probability
with careful attention to the tricky
subject of measurable selection,
and want to hold him back
due to some qual with some tricky
issue about Feller I
probability?
Or, there's a qual in optimization,
in part on the details of the Kuhn-Tucker
conditions,
but the student has already done original,
clearly publishable work, e.g., in JOTA, in optimization
and, in particular, the Kuhn-Tucker conditions?
E.g., for problems in functional form, are the
Zangwill and Kuhn-Tucker constraint qualifications
independent? Along the way, given a closed
subset of R^n (usual topology), is there a function f: R^n --> R
positive on the closed set, 0 otherwise, and
infinitely differentiable (not quite the
same as the Whitney extension theorem)?
Is the Mandelbrot set closed and, thus, an
example? What about a sample path of
Brownian motion? So an infinitely
differentiable function can have
a bizarre level set? What does this say about
a question, without an answer, in the
famous paper in mathematical economics
by Arrow, Hurwicz, and Uzawa? And in this
case, the qual in optimization serves just
what purpose?
The quals might be to see if the student
is prepared to take advanced grad courses,
but he's already done that and used
some of the best content in his Ph.D. research?
Now what are
the quals for?
The quals can start to obscure a basic point
about the three things important in high end
academics, research, research, and research,
that is, the publishable kind. If a guy is
doing well in research, want to hold him
back because of what?
Can there be such students? Yup. I have
an existence proof.
Besides, quals can be awash in politics.
The usual criteria for publication are
that the
work be "new, correct, and significant".
At some good research universities,
there is no coursework requirement for a Ph.D.,
and the dissertation is supposed to be
"an original contribution to knowledge
worthy of publication" or some such.
So, for a student, do some research
and publish it.
As I recall, at one time the math department
at Princeton said that the grad courses
were introductions to research by
experts in their fields, that no courses
were given for preparation for the qualifying
exams, that students were expected to prepare
for the quals by independent study,
and students were expected to have some
research underway in their first year.
Good. I'd done a lot of
independent study before I
went to grad school.
I got accepted as a grad student at the
Princeton math department but went elsewhere
(where my wife was still in grad school),
brought my own research problem with me
to grad school, and did my dissertation
research independently in my first summer
building on one of the courses in my
first year. Then the quals were for WTF?
The funny thing about mathematics is that you don’t work with regular numbers so much. I never see a 37, I see ‘n’ –a lot of what I do involves a big number n that goes to infinity. Never any specific number.
An interesting answer to probably the most common question that mathematicians are asked. Radiolab[0] covered this and the significance of "favorite/lucky numbers" in a podcast a couple of months back.
Alexandre Grothendieck is a famous mathematician known for his incredibly abstract methodologies. He once compared his technique to mathematics to the norm by relating standard mathematical methods as using a sledgehammer to open a nut. To him, his method was to slowly raise the water level and soak the nut until its shell becomes soft and he can open it with just a little pressure from his hand.
However, his highly abstract nature got him in trouble at least once.
One time in discussion a colleague of Grothendieck asked him to consider the argument for a concrete example of prime number. "You mean an actual number?" replied Grothendieck. "Alright, well then take 57".
I'd rather say something like "Fischer–Griess monster" (it is a constant in the set of finite simple groups). Surely a constant, but not a numerical one. The interviewer will probably change the topic immediately for not becoming embarrassed. ;-)
Surely a less diplomatic way than Tao's, admitted.
For the thinking about SATs, there's a standard situation
not commonly discussed:
Get a lot of multi-dimensional, empirical data and find some
pattern, maybe an equation, that fits.
Okay.
But does this equation have to work in practice?
Not really: The empirical data may not have
actually represented all the possible dynamics
of the real system. So, when apply the equation,
might not have maintained some crucial conditions
that held when found and fit the data.
Or, e.g., looking at academic performance data
and SAT scores, etc., have a lot of multidimensional
data and might find a fit. But the data collected
might have been done with some unstated, unclear conditions and not have included all cases that could occur. So, this fit might
not hold when those conditions don't hold.
So, applying the fit in reality where the
conditions are free to change might
make the fit poor.
Or, if someone pulls too hard on the
electric power cord of their vacuum
cleaner, then they will have a broken
vacuum cleaner. Well, that would be true
for women like the mother of a girlfriend
I once had! But not for me! I just
did a little work with a screwdriver and
pocket knife and fixed the problem. Maybe
I impressed the mother of the girl!
I was 14, the girl 12, and the prettiest
human female I ever saw in person or
otherwise. Maybe it was good I
impressed the mother!
So, a condition of the good fit was that
I was that the user of the vacuum
cleaner didn't know how to
repair some broken electric power wiring;
that condition held for the mother
but not when she had a drop dead gorgeous,
sweet daughter with a boyfriend whose
father had long since been teaching
him about basic work with hand tools.
For the SAT scores, early on might argue
that so far in practice they really
did capture real, pure
academic ability. Even if that was true,
that need not mean that SAT scores really
have anything very important to do with
real, pure academic ability and that
some students, maybe most students,
could, with some additional conditions,
do well on the SATs and not have the
coveted ability.
Standard, old problem in multidimensional
data analysis!
I like his response regarding strong AI. The idea that it is a moving target might have been expressed before, but I thought he formulated it very well:
"The funny thing about AI is that it’s a moving target. In the seventies, someone might ask “what are the goals of AI?” And you might say, “Oh, we want a computer who can beat a chess master, or who can understand actual language speech, or who can search a whole database very quickly.” We do all that now, like face recognition. All these things that we thought were AI, we can do them. But once you do them, you don’t think of them as AI."
So I think, almost by definition, we will never have AI because we’ll never achieve the goals of AI or cease to be caught up with it.
Never is such a long time. Maybe he is biased as an educator, i.e. not to discourage thinking because a machine might do it for you at some point in the future?
I think you missed the point. He's saying we won't the goals of AI because: "it’s a moving target"
For example: "In the seventies, someone might ask “what are the goals of AI?” And you might say, “Oh, we want a computer who can beat a chess master, or who can understand actual language speech, or who can search a whole database very quickly.” We do all that now, like face recognition. All these things that we thought were AI, we can do them."
As far as I know there are no limits on how intelligent something will get. Exploring the limits of AI is like trying to find the biggest possible number.
That's true, but nowadays it's a little more like, "Well, we want a self-improving program with enough intelligence and moral fiber to radically improve human life on a scale not usually conceived of outside the Book of Isaiah."
This may be absurdly ambitious, to the point of ridicule, but it's a concretely-defined target.
There’s a great observation called Goodhart’s Law that basically says any metric becomes useless once you start using it for control purposes. So the SAT, for example, is a good general test of academic aptitude. But since it’s used so much for admission to college, kids are trained and coached. They spend lots of time and effort, specifically to improve their SAT score at the expense of a well-rounded education, to the point where [the SAT] may not be such a good guide to general academic excellence, even though it used to be before students started optimizing