So I know people are probably tired of posts that just plug something into ChatGPT, but I think it might be interesting to see how GPT3.5/ChatGPT might handle it.
It got the answer to the hen problem wrong every time. I kept modifying the prompt to let it know it was a tricky problem, to think step by step, to self-critique and adjust, & tried at temperatures of 0, 0.7, and 1 multiple times and after a dozen tries it kept giving either 12, 36 (most common), or 144. (Starting a new session every time.)
> This is a tricky arithmetic word problem. Think step by step. After having an initial answer, critique it, double check it, and change it to the correct answer if necessary:
3 hens lay 3 eggs in 3 days. How many eggs do 12 hens give in 12 days?
Again, show your work and don’t just give the answer right away but reason through the problem.
That is, UNTIL I gave it the specific advice that the article gave, adding this to the beginning of the above prompt (with another fresh session):
>I know that humans are biased for intuitive understanding rather than logical one. Intuitions are shortcuts that are very valuable, but sometimes they are just plainly wrong.
…and then it got the right answer immediately, first try. So this trick isn’t just a neat blog post, but empirically useful, and I’ll probably include it in prompts from now on for these kinds of problems…
(And note this is a blog post from today, so the blog post was not in the training set.)
> So this trick isn’t just a neat blog post, but empirically useful, and I’ll probably include it in prompts from now on for these kinds of problems…
Alernative take: this is a very classic problem used to illustrate this kind of System 1 inaccuracy, so by deliberately including words that increase the conditional probability of us looking at a text that includes discussion on System 1 inaccuracies, you have essentially narrowed down the search space for our dear stochastic parrot to those texts that do contain the right answer.
This sounds dismissive, but I don't mean it that way. I'm just saying that "think about human biases" may not be a universally useful cue, it may only work specifically when trying to get the model to look like a discussion on human biases, and the typical sample problems used to illustrate that.
Very much yes to your first sentence, but I don’t think “just plug something into ChatGPT” is a fair summary of your (very interesting) post, so you’re fine with me.
I’ve seen several statements recently, including one in a talk by a Microsoft researcher working with OpenAI, that claim that all the RLHF safety training introduces human biases and makes the models worse at abstract reasoning. I really wonder why a relatively small amount of human feedback has that effect, while training on this gigantic human-written corpus doesn’t seem to introduce the same biases. That hints at something interesting going on.
I tried GPT-4 via ChatGPT and it nailed it without any prompt engineering.
---
Prompt: 3 hens lay 3 eggs in 3 days. How many eggs do 12 hens give in 12 days?
GPT-4:
Let's first determine how many eggs one hen can lay in 12 days.
We know that 3 hens can lay 3 eggs in 3 days. This means that each hen lays 1 egg in 3 days (3 hens / 3 eggs = 1 hen / 1 egg).
To find out how many eggs one hen can lay in 12 days, we need to multiply the number of eggs laid in 3 days by 4 (since 12 days is 4 times longer than 3 days). So, one hen will lay 1 egg * 4 = 4 eggs in 12 days.
Now, we have 12 hens. To find out how many eggs they can lay in 12 days, we just multiply the number of eggs one hen can lay in 12 days by the number of hens:
12 hens * 4 eggs per hen = 48 eggs in 12 days.
---
To be completely fair, this math problem can be googled, so it is probably in the training set.
Interestingly if you give GPT-4 the problem but append the instruction "answer only, no explanation" it gets it wrong (I got "108" as the answer in that case).
It is fascinating how much it needs to work step by step.
somehow this answer reminded me of the good advice for starting to tighten a nut. first turn left till you feel the nut drop- then turn right. I feel there is a universal lesson here. Maybe going, as the crow flies, directly toward your goal isn’t always best.
Stupid question: What is the "tricky" part about it? How can someone who isn't completely retarded get this even wrong?
This puzzle is trivial, and especially doesn't require any kind of "thinking around the corner". So what is this about?
That language models can't do math (or logic) and can't even reliably tell which of two given written out numbers is the bigger one is imho a different story.
Language models are great for text. But when you need a tool for math and/or logic you should use an adequate tool. We have for example algebra systems for that. Or prove assistants. Or just good old Prolog. It makes no sense to use the wrong tool and than wonder that the results are terribly wrong.
But for the above "puzzle" you don't even need a calculator. So I really don't get the point.
The trickiness is that it was a part of a set of easy questions. You have 3 same numbers x x x, then change two of them, and ask what's the third y y ? It seems obvious it's y again.
Of course if you have some experience with math (apparently more experience than kids in school have), it's immediately apparent it's a multiplication and the product doesn't scale linearly… But I can imagine myself still answering this incorrectly just because I have no incentive to be careful and I'm in a state of mind to just answer the question quickly. Though while reading the article I solved it, in memory.
> The trickiness is that it was a part of a set of easy questions. You have 3 same numbers x x x, then change two of them, and ask what's the third y y ? It seems obvious it's y again.
OK… That's all?
I start to see the problem.
But I'm not going to write what my conclusion is as this would get likely down-voted into oblivion. (I was actually expecting down-votes already just for asking but still wanted to understand what the issues here is as I honestly don't see any "tricky part" in the "puzzle" at all.)
It is tricky because humans are biased for intuitive understanding rather than logical understanding. Intuitions are shortcuts that are very valuable, but sometimes they are just plainly wrong. Many people will try to apply intuition rather than logic because that works with many other similarly worded questions.
I think it is an issue of language "3 hens lay 3 eggs in 3 days" can be interpreted (as intended that each hen lays an egg only once every three days) or as implying that it is implied that they mean 3 eggs for each hen over the three days, meaning 1 egg per hen per day (incidentally closer to what actual commercial hens lay, not that this matters).
In practice "each" is optional if it seems implied from context. I mean, if you say "Our employees work 40 hours/week" it would really be weird to interpret it as anything other than each employee working 40 hours a week rather than all employees teaming up to work 40 hours in total. Same thing with hens teaming up to lay 3 eggs. If you mean that hens lay an egg once every three days then say that and the whole thing is far clearer.
Your example nails it. Now I get where the confusion comes form.
I'm not a native speaker and I guess it's not intuitive for me to insert an implicit "each" into the "puzzle" sentence. For me it sounded completely unambiguous. But if you subtextual "hear" an "each" in this sentence the wrong calculations start to make sense.
I've just tried to translate all the variants of the sentences into German (my primary langue) and indeed you would need a "jeweils" in both cases to be clear that you really mean "each". Your example with the employees would sound quite "unclean" without the "jeweils". One would understand it, but it sounds a bit wrong. But in English "Our employees work 40 hours/week" sounds just fine (at least for me).
I guess when I do calculations and "puzzles" in my head I do it in German… :-)
> "Ovulation (release of the yolk from the ovary) occurs every 24 – 26 hours regardless of fertilization (so a rooster is not needed). A hen ovulates a new yolk after the previous egg was laid. It takes 26 hours for an egg to fully form (white and shell added), so a hen will lay an egg later and later each day. Eventually the hen will lay too late in a day for ovulation to be signaled. She will then skip a day or more before laying another egg." (UWisconsin Livestock)
So, the 3 hens must have been at the least productive point in their egg-laying cycle over the initial three-day time period... Now if we have the two-hour daily offset, over 12 days, ummm, maybe two or three days skipped per hen? So, ah, 108-120 eggs is what the farmer could expect from the 12 hens?
I remember a story where school teacher was showing plastic animals and asking kids what it was. At some point all kids wrote "cow" except for one farm boy who wrote "goat". Turns out the plastic cow had wrong number of udders. Cows have four, goats have two.
I read it as the students being familiar with a similar "trick" question and erroneously pattern matching on that.
Another similar riddle goes like:
"If it takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets?"
The correct answer there being "5 minutes", and the intuitive wrong answer being "100" as per your post.
That would make sense as to why the teacher also expected "3", since they should be familiar with the existence of their non-intuitive questions, even if they misremember the specific non-intuition.
I took me surprisingly long to convince myself the author was correct: the answer is 48 because there's 4x as many hens and 4x as many days: 3 x 4 x 4 = 48. The "wrong" way to solve this problem is to compute hen-laying in terms of hen-egg-days, and then scaling.
I dunno, I immediately noted that 12 hens must lay 12 eggs in 3 days (it takes 3 days for a hen to lay an egg). Then I just asked myself how many 3-days are in 12-days--okay, 12 [eggs] * 4 [3-day egg periods] = 48...
I didn't need to resort to pen and paper, but I'm also not in primary school.
He misremebered the third task as well, since his solution to it, as stated, is wrong. If you aren't given that every card has a number on one side and a color on the other, then you need to check everything which does not satisfy the consequent.
The rules were clear before, what has changed is that they are now different rules. You were just wrong before.
You've still got a problem though, now when you ask the reader to verify the rule, you have to explicitly state that they're not trying to verify the first rule.
Can't pass by without making a public service announcement about the average speed example: don't compute integrals using the calculus definition of Sum(f(x_i) * delta). Look up quadrature or numerical integration methods instead.
Although it is true that you can approximate the average speed by taking an average of instantaneous speed measurements, that's usually a very bad way to do it in any real world situation. Numerical difference values are always noisier than the underlying quantity, sometimes to the point of being unusable, so of course if you can just read off the quantity you want directly (total difference over time), you should do that. But even if you can't, you should use a proper integration method instead of the calculus definition.
I have seen the Sum(f(x_i) * delta) calculation in a lot of real-world code. It has bad convergence properties, bad errors when the function has large derivatives, and bad performance when the data has noise. Some of the code I've seen produces garbage results, or has thousands of function evaluations when you need, like, four. "Quadrature? I think I heard that before, but I don't remember what it means."
In summary, please don't compute derivatives as (f(x_i+1)-f(x_i))/delta, or compute integrals as Sum(f(x_i) * delta), and especially, please don't do the first immediately followed by the second. Which also happens. Look up numerical methods instead.
Isn't it the case that other numerical integration methods only work if you have a f(x) that you can evaluate for any x you want (albeit possible costly)?
It seems to me that in many practical applications, the only thing you have to work with it is samples at discrete moments in time. It certainly seems to be the case here: "I would measure car's speed at every instant and produce an average of those measurements." We only know f(t_0), f(t_1), f(t_2), ... (and if we're lucky t_1-t_0 = t_2-t_1 = t_3-t_2 and so on); we have no way to compute things like f((t_0 + t_1)/2). In that case, how can we improve our calculation?
Even if you're limited to uniform sampling, something as simple as the trapezoid rule will give you quadratic convergence instead of linear for the naive Sum(f(t_i) * delta). In other words, error proportional to 1/n^2, instead of 1/n, where n is the number of samples, which is going to be a huge difference. There are many methods depending on the constraints of your problem - your ability to choose sampling intervals, knowledge of the bounds of your function or its derivatives, etc. The PSA is to study these things, instead of just writing the first thing that seems familiar from a long-ago calculus class.
Huh? Quadrature is a general term for "measuring area". In this context it's a synonym for integration.
I think you are trying to say that it's better to do weighted sums of fewer samples, instead of uniformly weighted Reimann sum. Both are "calculus definition" integration, of course, since calculus is true.
When mathematicians say Quadrature, they mean that if your function is suitably approximated by projecting onto some orthogonal basis functions, you can get very cheap approximations by cleverly expressing those integrals exactly as a linear combination of their values at certain points along the interval. You need very few.
> In numerical analysis, a quadrature rule is an approximation of the definite integral of a function, usually stated as a weighted sum of function values at specified points within the domain of integration
Simpson's rule, taught in first year calculus, is exact for cubics, with 3 sample points.
This calculation amplifies any noise present in the values of the function, often to the point of the output being unusable. There are many methods that can be used to approximate derivatives, depending on the problem. Just as we shouldn't try to invent our cryptographic methods from scratch, we should take advantage of the extensive knowledge already in use for numeric methods.
I've seen naive numeric methods cause everything from jerky motion in video games to incorrect navigation data for cars.
So what would you suggest as a general calculation for finite differences, especially in those cases when only forward differentiation is possible, e.g. with respect to time?
The closest thing to a universal approach would be a Kalman filter. It's usually where you start when you have noisy measurements coming in, and you need to maintain state such as value and derivative.
Since the original question was about computing the velocity of a car, and since I work in the automotive field, let's take a real example: you want to know the approximate position, acceleration, and velocity (linear and angular) of your car. Your inputs are driven wheel speed (noisy, affected by wheelspin), non-driven wheel speed (noisy), accelerometer output (inaccurate, only present for some axes), GPS position (updated occasionally, has errors), and steering angle (pretty accurate, can be put into a chassis dynamics model). Almost certainly, you would use a Kalman filter to estimate the state of the car. Naive approaches such as subtracting two wheel speed values to obtain acceleration will not work well.
My point is that we should remember that numerical algorithms are a developed field with a lot of knowledge, and we should take advantage of the proven approaches. Sometimes, programmers who are not specifically from the physics or numerical fields, and who need to perform some computation, reach for a very simple approach such as the rectangle-rule integrals, and get bad results.
I see -- we are talking about two different things!
You are working on the problem of figuring out the hidden state based on noisy observations and a transition model.
I interpreted your statement much more broadly, so I was trying to discuss the problem of computing the explicit next state based on perfect knowledge of the transition model, in which case (y_1-y)/dt is a perfectly viable approach to estimate the derivative.
(You can do better by adding higher-order terms of course, but I haven't found that to be universally useful compared to making dt smaller.)
> task is presented in context of social relations (or more broadly "evolutionarily familiar problems"), people tend to get it immediatelly, or rather intuitivelly. Bonkers, right?
That's obvious, not bonkers - if you have experience in some class of problems, it's easier to solve the same kind of problem, the patterns match in your head, so the solution pattern is used right away.
Just like you can solve these numeric problems quickly if you're trained on them
But the task is exactly the same in both situations – only the symbols on the cards have less/more familiar meaning. I see this as being not able to generalize from known examples.
It's not exactly the same exactly because the difference in symbols maps to different patterns - one is a pattern from school, which you forgot since it hasn't been used for many years (or never learned properly).
Another is a pattern that keeps getting refreshed throughout your life.
It's a very different cognitive workload, leading to difference off required effort/correctness, especially in these simplistic low value tests
And why is your assessment of the ability to generalize limited to only this math representation which people don't use?
You could see that mostly because the first example does everything as that's what everyone knows, the second is similar and solidifies the pattern, which you could then match to the rest
The question is, what does it tell you when people struggle with Roman LCX*ML more vs the more familiar first pattern besides the fact that training/familiarity makes it easier?
Oh, I see your point now. Multiplication in Roman numerals is strikingly different (and much harder) than addition. So representation of the pattern might be so alien that it's unrecognizable even though the pattern is known.
I question whether it is a matter of not able to generalize from known examples, as I suspect that most of the people who do well on the corresponding social questions did not know the answer beforehand, but nevertheless found their way to the correct one.
The article you link to seems to support this view: "Cosmides and Tooby argued that experimenters have ruled out alternative explanations, such as that people learn the rules of social exchange through practice and find it easier to apply these familiar rules than less-familiar rules."
I now wonder if people tend to do better on abstract Wason tasks after being prompted by logically-equivalent social-relations questions, as opposed to being prompted with additional abstract problems (without being told whether they gave the correct responses to the prompts.)
My guess at what might be underlying these outcomes is that people clearly see the purpose of the social rules but not of the abstract ones, and our intuition is attuned to assessing how various social scenarios will play out, taking into account the motives of the actors (this might seem to be the same as what Cosmides and Tooby claim has been ruled out, and maybe it is, but I think what they are ruling out is us having learned various schema to be applied mechanically in the corresponding social situations.)
Alternatively, could it be something as simple as this: having an understanding of the social rule's purpose helps the subjects avoid confusing conditional and biconditional rules? At least in the English language, acceptable usage can be quite ambiguous in this regard.
Wait, they're not - at least, in the example of the article. Allow me to digress:
There is a figure with 4 cards, showing: all orange, all blue, the number 2, the number 1.
The only condition is that an even number must have an all blue flipside. This condition is either irrelevant or satisfied for the blue card. For all other cards, we don't know. The card with "1" may have the number "4" on the flip side. So we need to check ask other cards.
TL;DR: only a blue face guarantees compliance with the rule, all other cards need to be flipped.
In the second case, the unstated assumption that the cards all have one side worth digits and one side with symbols is still unstated, but a lot stronger. (These are things to check for age-alcohol compliance, so logically they must contain both age and alcoholic state of consumption.)
> The only condition is that an even number must have an all blue flipside. This condition is either irrelevant or satisfied for the blue card. For all other cards, we don't know. The card with "1" may have the number "4" on the flip side. So we need to check ask other cards.
Very good point, thank you. I have added an explicit statement that cards are consistent.
This argument is invalid. These two things are easy to pattern match, but every theorem comes with an obligation to show that your current state matches your predicate, which is very hard, generally.
Could it be explained in terms of abstract vs concrete? I've heard that we humans can reason about concrete things like beer etter than abstract concepts like numbers and colors? (I know that age is also a number, but is has more assigned meaning)
I think it could as dealing with concrete things is a direct evolutionary advantage grounded in reality. Usually, we are shit dealing with abstract concepts. A quote that beautifully demonstrates this:
"To deal with hyper-planes in 14-dimensional space, visualize a 3-D space and say 'fourteen' to yourself very loudly. Everybody does it." — Geoffrey Hinton
I don’t think that’s a remotely fair characterization of the author’s self-evaluation. A fair bit of the article is spent expressing embarrassment, presuming they were in fact wrong, and surprise at discovering that presumption was their error.
What you’re describing would be an entirely different post, likely making a very different point. There’s value in exploring “sometimes your idiosyncratic intuitions turn out to be right”, without muddying that with the obvious intuition that this won’t always be the case. There’s value in exploring the idea through demonstrative anecdote without counter examples. If anything, counter examples might influence the kind of pattern matching biases which lead to such faulty intuitions.
Anyway, I think the author has demonstrated that they’re fallible and happy to be corrected. They’ve acknowledged more concrete problems with the post and made changes to address them at least a couple times.
> I think it would come across better if the author mixed times where their non-traditional intuition gave them the wrong answer.
Oh, there are a lot of instances where I was biased and my actions were dead wrong. (example story: https://mlajtos.mu/posts/i-told-my-crush-her-leaked-password ) This post was about rare moments when I was right. I'll try to write about my failures more often, as the source material is basically endless.
Especially because the post has several mistakes that commenters here corrected.
Including a case where he failed to understand the Fundamental Theorem of Calculus, and proposed an infinitely more inefficient solution:
> We were just presented with integral calculus and there it was – you can get an average of some function with definitive integral.
That is literaly measuring car's speed at every instant!
I'm always amazed that people can get so hung up on events that happen in elementary schools. This piece feels like it could be posted to /r/iamverysmart.
> I'm always amazed that people can get so hung up on events that happen in elementary schools. This piece feels like it could be posted to /r/iamverysmart.
Ages were 10, 15 & ~23 respectively. Yes, the post suffers from selection bias. I will write more about my failures in the future.
And as per your request, I went and try post it to /r/iamverysmart, but only screenshot submissions are allowed and screenshot of the whole post wouldn't be very effective.
There's a certain set of people who by virtue of genetics upbringing or both, feel very strongly:
1. You should do your best to please authority figures.
2. You should do your best to be correct.
It's really distressing the first time they run into a clear example where those two deep-seated needs are in conflict. For a lot of gifted teacher's pet kind of kids, this is their Kobayashi Maru moment.
I love these kind of logic puzzles, if you're up for losing an evening the clips game [0] is one of my long time favourite ones. It starts easy but puzzles 6 and 7 are tricky to do within the assertion limit
Is because the the puzzle has several facts that are relevant in the real world, but irrelevant in the puzzle, and the solution to the puzzle is a fact that is false in the real world. It's a terrible tutorial example.
I'm not very good at this but I'll see if I can explain my thought process:
(is tiger fed) requires (eats tiger ?food)
which means there must exist ?what such that (is ?what eatable) and (is ?what ?food)
But in the fact table, nothing is actually eatable! We could have our tiger eat plants by asserting (eats tiger plants) and (is grass eatable), but that's two assertions. Tigers eat animals, and tigers ARE animals, so since we need to make something eatable we might as well do it on the only animals we have: (is tiger eatable)
I got level 2 as well but only after checking for hints in the source. I have been thoroughly nerd-sniped.
No, you don't. You still only have to check those 2 cards. If you're asking about card 3 then that's the common mistake – it doesn't matter whether that card has a number or an emoji on the other side, because the only logical test here is that if it has an even number on one side, the other side must be blue.
It doesn't matter what blue cards have on the other side, we don't need to test them.
edit: sorry, I just realised you're referring to card 1 and not 3, yeah I suppose you'd need to check it too unless there's a rule about always needing to match a color to a number.
> The teacher was perplexed of why I said such wrong answer to such simple question.
This is extremely toxic on many levels. A teacher who doesn't know their subject matter or at least can't be bothered to refresh before a lesson. A teacher who submits to social pressure from their students when determining the answer. A teacher who shames students for being wrong. A teacher who shames students for being right!
The students who were wrong initially will, thanks to this teacher, have completely failed to learn the lesson the problem was teaching. They will remember only that they got the problem right (the teacher said so) and only due to some technicality in the answer book were they subjected to an embarrassing "lesson" from the smart kid.
Yes, this is making a mountain of a molehill, but molehills like this can have mountainous impacts on society.
I mean, there’s a lot going wrong with this “lesson.” The exercise is pretty clearly designed to trick people, which is very counterproductive if the goal is to teach people about multiplication or fractions or whatever. It’s really not that damning that a teacher would be fooled into skipping the actual work, since that deceit is clearly the entire point of choosing those numbers.
It’s a bit like if a proofreading exercise in an English class deliberately did that trick where you put two “the”s next to each other but across a line break. That will often fool a skilled English writer, but that says nothing about the writer’s skill.
While the teacher in the story obviously bungled it, I disagree with your interpretation of the lesson. The point is not to "teach people about multiplication or fractions or whatever". The point is to teach them to recognize dimensional parameters in a natural scenario. This can be counter-intuitive; we have a bias towards linear relationships. Somebody twice as tall is twice as heavy, thinks the child.
Tricks ("surprises" is a less negative descriptor) are important because they wake your brain up. Most people don't like learning arbitrary things for their own sake, but they don't like being fooled either.
Describe the base rate fallacy to a room of college students and it might stick for the rest of the semester, but give them this problem[0] and tell them how many licensed doctors it tricked (after you let them answer incorrectly, of course) and they might remember it for a while longer.
If the point of the lesson truly was to identify that the exercise was chosen deliberately to be deceptive, then sure. But I doubt it. Your example is very different, because there’s no obvious answer that one might spit out instantly by doing naive pattern matching.
The fact that there is an obvious but incorrect answer that you might spit out if you were just pattern matching is the whole point. It tricks you into being wrong, which "tricks" you into wanting to know why, where before you might not have cared about the answer or the underlying reasoning at all. You think the question just accidentally picked tricky numbers? Or you think the reason that it picked them was just malice towards children?
> You think the question just accidentally picked tricky numbers? Or you think the reason that it picked them was just malice towards children?
I think it's probably just abysmal pedagogy. This should be pretty clear given that almost the entire class and even the teacher fell for the trick, which shows that it's a terrible way to teach something and even a terrible way to test if they know how to do the thing you're wanting them to learn. The teacher and many of the children probably did in fact know how how to do the calculation and solve the problem, and would have done so if the numbers weren't very deliberately chosen to invoke a hasty pattern-matching reaction.
So which is the correct way to choose parameters for a problem?
1. Deliberately choose numbers that have no notable relationships with eachother that could mislead students.
2. Randomize parameters in an effort to introduce only as many trick questions as would occur randomly in nature (so, given small-valued problems like this one, maybe one question in a hundred will have tricky parameters by chance).
3. Deliberately choose parameters that are somewhat or even maximally tricky to ensure student's undertanding is robust in the face of misdirection.
4. A deliberate mix of 1 and 3, with easier questions to start and trickier questions always included but never too often and never too early.
You seem to be advocating for #1, or possibly #2 (though I get the impression that if #2 were employed and trick parameters arose by chance on a quiz you would complain, bringing us back to #1). #4 seems like the best strategy to me, with an extra helping of tricks for my kids, please. It's not like you learn less when you get tricked. Quite the opposite!
It's not nearly as complicated as all that. Just don't deliberately try to trigger a misleading pattern match. Almost any conceivable way you would naturally come up with a math exercise like this would not end up fooling people into giving an immediate incorrect answer. You have to go out of your way (or get very unlucky if you're choosing random parameters) to construct an exercise that will fool a large portion of people.
> You have to go out of your way (or get very unlucky if you're choosing random parameters) to construct an exercise that will fool a large portion of people.
That's not remotely true. You're generally picking small integers, here, and frequently dealing with problems that have very few parameters.
"Species of birds in the genus Madeupicus all have the same proportions. Adult birds of one species in the genus have a wingspan of one meter and weigh one kilogram. How much do birds of a species in the genus with a two meter wingspan weigh?"
This is a very similar problem to the one in the original example. It has the same trick, which is that the integers on either side of the equation are the same in the first example, but they will not be the same in the second equation due to dimensional scaling.
Students do hundreds of math problems per week. Randomly-selected small integers will collide frequently.
And that's all ignoring the benefits of surprising people, which you still tacitly deny. What are you optimizing your questions for, "fairness"? "Predictability"? That sounds like a great way to ensure as many people as possible who didn't actually understand the subject matter will pass your tests.
> That is literaly measuring car's speed at every instant!
Well, there's no way to measure to measure speed instantly, it's always over some period even if small (take some time interval and see how much distance have been covered). For example, at time zero you cannot know what speed the car has.
Similar like when I was working with audio, producing frequency diagram -- the frequency cannot be gathered from a single timestamp on the waveform, you always need some interval, like 20ms to get an FT.
1. I had a similar issue in the fifth grade of elementary school (in my country, elementary school is 6 years with an extra year at age about 5 before pre-elementary school, then three years of middle high, then three years of optional high school before university, but almost everyone does the full 12/13 years.), when we discussed Fahrenheit (foreign) vs Celsius (domestic). The teacher said that 50°C is 106°F because 212°F is 100°C and 212/2 = 106. At the time I said “this doesn't feel right, 106 is not in the middle between 32 and 212” but I couldn't phrase it mathematically yet, and the teacher didn't take my objection in a good manner. Afterwards I learnt that he understood his error but never said a word to me about it.
2. I don't see the colours on the “back sides” of the cards. I just see the cards flipped/rotated 180° around the Y-axis (probably through CSS), so all numbers are visible, just in cards 3 & 4 the numbers are mirrored.
2. Firefox user, right? It should be fixed now. Thank you for reporting this issue.
1. You reminded me of another nice example of wrong shortcuts. Average speed of car from city A to city B is 120km/h. Average speed back (B -> A) was 80km/h. What was the average speed of the car on the whole path (A -> B -> A)? It's not 100km/h.
To be pedantic, 100km/h is the arithmetic mean. Of course this is the wrong tool to use (which is your point) but 100km/h is the correct answer if by mean you intend to say arithmetic mean.
How do you arrive at 100km/h if you're averaging over distance?
I don't doubt you're right, but I don't see it. My calculation is as follows: suppose the distance between both cities is 120km. The trip A -> B thus takes exactly one hour. The trip B -> A takes 1.5 hours (120km / 80km/h) = 1.5h. The total trip therefore takes 2.5h, the total distance is 240km, which averages at 96km/h (240/(1 + 1.5)).
A simple (V1 + V2)/2 is surely possible, but I don't see the value of that average, I can't think of a situation where it would be useful.
I think you are thinking about a case where someone goes 120km/h for 1 hour and then 80km/h for 1 hour instead of going over a fixed distance at 120km/h and then going over the same distance at 80km/h.
Sorry, unrelated to the writing (which I enjoyed), but related to the horizontal scrollbar! If you change w-screen to w-full on both the navbar and footer, it goes away. 100vw -> 100%. Just a tiny enhancement to an otherwise good looking website.
Why? Because if three hens laid three eggs in three days, then it means one hen lays one egg per day - so twelve hens, given twelve days, will lay twelve eggs. That's the immediate reasoning, right?
It's very easy to fall into that trap, and miss the fact that two of the variables were changed.
Interesting. So we can tell from the last example that humans are perfectly good at doing this reasoning in theory and that therefore the trouble is actually in interpreting what the question is asking.
Put another way, if the question could be beamed into someone's head they'd get it right. The flaw is in the part of the mind that loads the words into their brain and not the reasoning after that.
I don’t think this is the most accurate way to describe it. It’s considering the mind as separate from the brain. There’s also no true version of the problem to “beam” into the brain. That would just be a third version.
Our brains aren’t purely general purpose problem solvers. There are parts of the brain that are specifically related to facial recognition versus all images. We’re better at recognizing differences in faces versus buildings.
There seems to be a similar “hardware acceleration” for problems about people following rules (can’t remember the source for this). People generally do better on problems if you can frame them as who is breaking a rule, who is lying, etc. I think the last example would still be easier to solve on average even if it was a rule unfamiliar to people.
I think it shows that we suck at generalization. Even though the task is the same, we are not able to see the same pattern in an arbitrary framing of the same problem. Beaming directly might help.
I didn't get why he deprecated distance/time as a direct way of determining average velocity.
Is this pedantry about "average" vs. "mean"? As I understand it, "average" is a vague word, one of whose meanings is "mean". And I imagine most of the other meanings would have distance/time as their value as well.
Another tricky question is the one about 3 doors, two with goats behind them, and
the 3rd one is hiding the prize. The host opens door #1 with a goat behind it, the doors #2 and #3 are still closed. Now what's the chance of the door #2 hiding the prize behind it? The intuitive answer is 50%, but it's wrong.
You described a slightly wrong Monty Hall problem [0]. When I explain solution to this problem, I always just add more doors. 1 million doors, 1 car, 999999 goats. You pick one door, and I show you 999998 goats. Only two doors are left. What is the chance you picked car at the first guess?
That's not how it goes. And, yes, the answer is 50% in this case.
The problem relies on first choosing a door (1). The host then opens a door that you did NOT choose (3) and asks if you want to switch your choice to door #2. In this case, the probability that the goat is behind door 2 is not 50%.
That's still 50% though, now if you have decided to open door #2 before host opened #1 you may increase your chances by changing to door #3.
The chances of a price being behind either door are exactly 50/50, it's the chance of you made the right choice in the beginning is only 33%. That is called Monty Hall PARADOX for a reason.
The answer is 0 because the prize is behind the third door #3, but I feel you're either missing a couple of details or you're trying to mess with people who know Monty-Hall.
It got the answer to the hen problem wrong every time. I kept modifying the prompt to let it know it was a tricky problem, to think step by step, to self-critique and adjust, & tried at temperatures of 0, 0.7, and 1 multiple times and after a dozen tries it kept giving either 12, 36 (most common), or 144. (Starting a new session every time.)
> This is a tricky arithmetic word problem. Think step by step. After having an initial answer, critique it, double check it, and change it to the correct answer if necessary:
3 hens lay 3 eggs in 3 days. How many eggs do 12 hens give in 12 days?
Again, show your work and don’t just give the answer right away but reason through the problem.
That is, UNTIL I gave it the specific advice that the article gave, adding this to the beginning of the above prompt (with another fresh session):
>I know that humans are biased for intuitive understanding rather than logical one. Intuitions are shortcuts that are very valuable, but sometimes they are just plainly wrong.
…and then it got the right answer immediately, first try. So this trick isn’t just a neat blog post, but empirically useful, and I’ll probably include it in prompts from now on for these kinds of problems…
(And note this is a blog post from today, so the blog post was not in the training set.)