There is a long history of people thinking humans are special and better than animals / technology. For animals, people actually thought animals can't feel pain and did not even consider the ways in which they might be cognitively ahead of humans. Technology often follows the path from "working, but worse than a manual alternative" to "significantly better than any previous alternative" despite naysayers saying that beating the manual alternative is literally impossible.
LLMs are different from humans, but they also reason and make mistakes in the most human way of any technology I am aware of. Asking yourself the question "how would a human respond to this prompt if they had to type it out without ever going back to edit it?" seems very effective to me. Sometimes thinking about LLMs (as a model / with a focus on how they are trained) explains behavior, but the anthropomorphism seems like it is more effective at actually predicting behavior.
The first iteration vectorized with numpy is the best solution imho. The only additional optimization is using modulo 9 to give you a sum of digits mod 9; that should filter out approximately 1/9th of numbers. The digit summing is the slow part so reducing the number of values there results in a large speedup. Numpy can do that filter pretty fast as `arr = arr[arr%9==3]`
With that optimization its about 3 times faster, and all of the none numpy solutions are slower than the numpy one. In python it almost never makes sense to try to manually iterate for speed.
What "O(1)-state shuffle" could you possibly be talking about? It takes `O(nlogn)` space to store a permutation of list of length n. Any smaller and some permutations will be unrepresentable. I am very aware of this because shuffling a deck of cards correctly on a computer requires at least 200 random bits.
If the requirements are softer than "n random permutations", there might be a lot of potential solutions. It is very easy to come up with "n permutations" if you have no requirements on the randomness of them. Pick the lowest `k` such that `n < k!`, permute the first k elements leaving the rest in place, and now you have n distinct permutations storeable in `O(log(n)` (still not O(1) but close).
I know this is not really your point, but misusing `O(1)` is a huge pet peeve of mine.
It's O(1) if you don't need access to every permutation (common in various monte carlo applications). 64-128 bits of entropy is good enough for a lot of applications, and that's all you get from any stdlib prng, so that's what I was comparing it to.
Those sorts of applications would tend to not work well with a solution leaving most elements in the same place or with the same relative ordering.
Based on the data in table 3, I would attribute most of the difference to length of advice. LLMs average word count (29.4) is more than double human word count (13.25). Most other measures do not have a significant ratio. "Difficult word count" would be the only other with a ratio higher than 2, but that is inherited from total word count.
I think it would be difficult to truly convince me to answer differently in a test with 14 words where 30 would have enough space to actually convey an argument.
I would be very interested to see the test rerun while limiting LLM response length or encouraging long responses from humans.
If you think writing more words will be more persuasive, just... write more words?
The test already incentivises being persuasive! If writing more words would do that, and the incentivised human persuaders don't write more words and the LLMs do, then I think it's fair to say that LLMs are more persuasive than incentivised human persuaders.
Sure. I am not contesting that LLMs are more persuasive in this context. That basic result comes through very clearly in the paper. Its not as clear how relevant this is to other situations though. I think its quite likely that humans given the instruction to increase word count might outperform LLMs. People are very unlikely to have practiced the specific task of giving advice on multiple choice tests whereas LLMs have likely gotten RLHF training which likely helps in this situation.
I always try to pick out as many tidbits as possible from papers that might be applicable in other situations. I think the main difference of word count may be overshadowing other insights that may be more relevant to longer form argumentation.
> I would be very interested to see the test rerun while limiting LLM response length or encouraging long responses from humans.
I don’t know if that would have the effect you want. And if you’re more likely have hallucinations at lower word counts, that matters for those who are scrupulous, but many people trying to convince you of something believe the ends justify the means, and that honesty or correspondence to reality are not necessary, just nice to have.
I'm not sure what effect you think I want. The suggestion was just to increase the "interestingness" of the study. It seems to be like the main difference between LLM and human shown was length of response. Controlling for that variable and rerunning the experiment would help show other differences.
I do think its distinctly possible that LLMs will be much less convincing due to increased hallucinations at a low word count. I also think that may have less of an effect for dishonest suggestions. Simply stating a lie confidently is relatively effective.
I would prefer advising humans to increase length rather than restricting LLMs because of the cited effects.
Advising the opposite to humans does not make sense. 13 words is already tiny to convince someone. The choices I was thinking were restricting LLM word count and increasing human word count. The goal is specifically to make them more comparable.
The given study does not show any strength of humans over LLMs. Both goal metrics (truthful and deceptive) are better for LLMs than humans. If you are misinterpreting my advice as general advice for people not under the study's conditions, I would want to see the results of the proposed rerun before suggesting that.
However, if length of text is legitimately convincing regardless of content, I don't know why humans should avoid using that. If LLMs end up more convincing to humans than other humans simply because humans are too prideful to make their arguments longer, that seems like the worst possible future.
> If LLMs end up more convincing to humans than other humans simply because humans are too prideful to make their arguments longer, that seems like the worst possible future.
People aren’t too proud to make long arguments, they just take more time and effort to make for humans, and so historically, humans subconsciously consider longer arguments as more intellectually rigorous whether they are or not, and so length of a written piece is used as a kind of lazy heuristic corresponding with quality. When we're comparing the output of humans to that of other humans, this kind of approach may work to a certain extent, but AI/LLMs seem to be better at writing long pieces of text upon demand than humans. That humans find the LLM output more convincing if it is longer is not surprising to me, but I’ll agree with you that it isn’t a good sign either. The metric has become a target.
The study is testing a very specific type of "recognizing shapes"; which the title of the article calls "geometric regularity". The "background stimuli" are shapes that crows would be expected to be able to distinguish, and are used to train the crows on the task. Whereas the "probe stimuli" are the actual experiment.
As a sibling indicated baboons can not distinguish these shapes easily. Additionally, rather than a binary "crows can recognize shapes" the study shows how well crows process the shapes. One of the graphs in the paper, but not the article shows that two different crows have a similarly hard time with the rhombus.
In other studies, this same test was applied to humans to find that it is a fairly innate skill rather than developed by doing geometry in school.
I would agree the "don't blunder" and "punish opponents blunders" are harder than endgame knowledge. However, knowing the basics of endgames is actually important to closing out games. Specifically, knowing KQvK, KRvK, and the "ladder technique" is important.
Without any tactics "take free pieces" probably only gets you to around 1200 (chess.com), but if it includes knight and pawn forks, skewers, pins, and discovered attacks it can get you to 1500. Playing perfectly every game is hard though. I would recommend having more chess knowledge than just those 3 rules before you really try for 1500.
This seems like a very shallow way of thinking. "Losing all respect for the person" implies that you think this is NEVER an appropriate way to address someone. Phrasing a disagreement of opinion as a question of reasoning is often the best course of action.
In particular if a choice has been made and going back to reverse it has significant costs, it is important to not say anything like "We should not be doing this" or "You made a mistake." Unless there is a good of action to reverse course that is simply being rude for no reason. Even in the case where there is a good way to reverse a decision, I would rather ask for the reasoning that led to the decision than strongly state the decision is wrong. If I am working with someone I respect at all, I must entertain the thought that I am wrong and they made the right decision with good reasoning.
What would you say to a superior who made a decision that you disagree with, but don't think is worth reversing? My best guess is either nothing or something that more strongly asserts your belief, but I can't think of any better option than phrasing it as a question.
> What would you say to a superior who made a decision that you disagree with, but don't think is worth reversing?
"I don't understand ... it seems it has the consequence of ... My professional opinion in that case would be... and I would advise to... because of... Is there something I'm not seeing here?"
Benefits:
- I'm not faking it.
- I already provide a lot of information up front to limit back-and-forth. This avoids assumptions and also works better for when you WFH.
- The person knows exactly where I stand and where I want to go. It's not chit-chat, it's not politics, it's purely technical and I want to move on the issue.
- If I'm wrong, I can get told right away. If I'm right, it's factual, and we can move on to solving the problem. And if the person's ego/social status is on the line, they can just BS their way out of it, and I'll just add nothing and move on.
- The template drives the conversation enough that they only need a short answer to let us decide if it's worth reversing. And we can conclude on the price / consequence of that and move on if needed.
I'll change that depending on the person. Some people are way better than me, in that case, I'll default to asking what I'm missing because it's likely they see something I don't.
On the opposite, if it's a junior, I'll assume they get it wrong and help them to fix it (unless they can justify it).
And of course, phrasing will depend of how much intimate I am with the person. Good friends will get a playful version, uptight clients will get the more formal one.
Once you have done that several times and people know the routine and the relationship is good, you barely have to speak. You can just nod at something or raise an eyebrow, and start problem solving or get the info.
But note that I can do that also because my clients value my opinion enough, have respect for my professionalism, and also know, because of my past interactions with them, that I focus on the problem to solve rather than blaming.
That's essentially the same thing. The only difference is that you're putting your uncertainty at the end and I'm putting it at the beginning. The key is to explicitly acknowledge that you recognize the possibility that you might be wrong.
It is however important to note that allowing kings to be taken can result in both kings being traded which by the scoring function would be considered neutral. This possibility does not occur in normal games so a search with that modification may generate the wrong value.
Games would need to also be stopped when a king is taken to make the search approximately correct.
> "Completeness" is not about finishing in finite time, it also applies to completing in infinite time.
Can you point to a book or article where the definition of completeness allows infinite time? Every time I have encountered it, it is defined as finding a solution if there is one in finite time.
> No breadth first search is still complete given an infinite branching factor (i.e. a node with infinite children).
In my understanding, DFS is complete for finite depth tree and BFS is complete for finite branching trees, but neither is complete for infinitely branching infinitely deep trees.
You would need an algorithm that iteratively deepens while exploring more children to be complete for the infinite x infinite trees. This is possible, but it is a little tricky to explain.
For a proof that BFS is not complete if it must find any particular node in finite time: Imagine there is a tree starting with node A that has children B_n for all n and each B_n has a single child C_n. BFS searching for C_1 would have to explore all of B_n before it could find it so it would take infinite time before BFS would find C_1.
LLMs are different from humans, but they also reason and make mistakes in the most human way of any technology I am aware of. Asking yourself the question "how would a human respond to this prompt if they had to type it out without ever going back to edit it?" seems very effective to me. Sometimes thinking about LLMs (as a model / with a focus on how they are trained) explains behavior, but the anthropomorphism seems like it is more effective at actually predicting behavior.