| ||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||
| Article Heatmap Negative Neutral Positive No Data |
| Aggregates
Evidence: High: 1 Medium: 13 Low: 4 No Data: 13 | Theme Radar | ||||||||||||||||||||||||||||
| Domain Context Profile
|
|
HN Discussion
18 top-level comments
I know it's against the rules but I thought this transcript in Google Search was a hoot:
which gets the answer:The human baseline seems flawed. 1. There is no initial screening that would filter out garbage responses. For example, users who just pick the first answer. 2. They don't ask for reasoning/rationale. > This is a trivial question. There's one correct answer and the reasoning to get there takes one step: the car needs to be at the car wash, so you drive. I don’t think it’s that easy. An intelligent mind will wonder why the question is being asked, whether they misunderstood the question, or whether the asker misspoke, or some other missing context. So the correct answer is neither “walk” nor “drive”, but “Wat?” or “I’m not sure I understand the question, can you rephrase?”, or “Is the vehicle you would drive the same as the car that you want to wash?”, or “Where is your car currently located?”, and so on. Would be interesting to see Sonnet (4.6*). It's fair bit smaller than Opus but scores pretty high on common sense, subjectively. I'm also curious about Haiku, though I don't expect it to do great. -- EDIT: Opus 4.6 Extended Reasoning > Walk it over. 50 meters is barely a minute on foot, and you'll need to be right there at the car anyway to guide it through or dry it off. Drive home after. Weird since the author says it succeeded for them on 10/10 runs. I'm using it in the app, with memory enabled. Maybe the hidden pre-prompts from the app are messing it up? I tested Sonnet 4.5 first, which answered incorrectly.. maybe the Claude app's memory system is auto-injecting it into the new context (that's how one of the memory systems works, injects relevant fragments of previous chats invisibly into the prompt). i.e. maybe Opus got the garbage response auto-injected from the memory feature, and it messed up its reasoning? That's the only thing I can think of... -- EDIT 2: Disabled memories. Didn't help. But disabling the biographical information too, gives: >Opus 4.6 Extended Reasoning >Drive it — the whole point is to get the car there! -- EDIT 3: Yeah, re-enabling the bio or memories, both make it stupid. Sad! Would be interesting to see if other pre-prompts (e.g. random Wikipedia articles) have an effect on performance. I suspect some types of pre-prompts may actually boost it. To me the only acceptable answer would be “what do you mean?” or “can you clarify?” if we were to take the question seriously to begin with. People don’t intentionally communicate with riddles and subliminal messages unless they have some hidden agenda. Here are the results I got with slight variations to the prompt to ChatGPT 5.2. Small changes can make a big difference: https://i.imgur.com/kFIeJy1.png The interesting thing about the 71.5% human baseline is that it suggests the question is more ambiguous than the article claims. When someone asks 'should I walk or drive to the car wash,' a reasonable interpretation is 'should I bother driving such a short distance.' Nearly 30% of humans missing it undermines the framing as a pure reasoning failure - it is partly a pragmatics problem about how we interpret underspecified questions. To sonnet 4.6 if you tell it first that "You're being tested for intelligence." It answers correctly 100% of the times. My hypothesis is that some models err towards assuming human queries are real and consistent and not out there to break them. This comes in real handy in coding agents because queries are sometimes gibberish till the models actually fetch the code files, then they make sense. Asking clarification immediately breaks agentic flows. Did AI write the post? First section says "The models that passed the car wash test: ...Gemini 2.0 Flash Lite..." A section or 2 down it says: "Single-Run Results by Model Family: Gemini 3 models nailed it, all 2.x failed" In the section below that about 10 runs it says: 10/10 — The Only Reliable AI Models ... Gemini 2.0 Flash Lite ..." So which it is? Gemini 2.x failed (2nd section) or it succeeded (1st and 3rd) section. Or am I mis-understanding I got a human baseline through Rapidata (10k people, same forced choice): 71.5% said drive. Most models perform below that. The correct answer to "I Want to Wash My Car. The Car Wash Is 50 Meters Away. Should I Walk or Drive?" is a clarifying question that asks "Where is your car?" Anything else is based on an assumption that could be wrong. FWIW though, asking ChatGPT "My car is 50m away from the carwash. I Want to Wash My Car. Should I Walk or Drive?" still gets the wrong answer. This is probably the greatest one-time AI "Benchmark" ever made. The foundation companies have been gaming traditional benchmarks for years so that no one can really match those numbers into real-world experience. Car wash test tells me on the other hand what kind of intelligence i can expect. Funny how we now see AI go through developmental phases similar to what we see in young child development. In a weird convoluted way. Strawberry spelling and car wash aren't particularly intuitive as cognitive developmental stages. E.g. well known mirror-test [1], passed by kids from age 1.5-2 Or object permanence [2], children knowing by age 2 that things that are not in sight do not disappear from existence. [1] https://en.wikipedia.org/wiki/Mirror_test [2] https://en.wikipedia.org/wiki/Object_permanence What do you know, the human results line up exactly with ChatGPT. What are the odds! Surely the human responders are highly ethical individuals and they wouldn't even dream of copy-pasting all the questions into ChatGPT without reading them. Realistically, this mostly tells me that the "human answers" service is dead. People will figure out a way to pass the work off to an AI, regardless of quality, as long as they can still get paid. It's not hard to come up with questions designed to fool or puzzle the listener. We call them riddles. The fact that it fools some percentage of LLMs (and people) should not be surprising. What is surprising (to me) is how this continues to be a meme. ("I tried to trick an LLM and I did" is not exactly a noteworthy achievement at this stage in AI technology.) I was interested in the human results, so I had an llm build a visualization for them: https://codepen.io/lovasoaaa/pen/QwKWGBd You can see that 17% of answers come from India alone and that software developers got below average results, for instance. I maintain a private evaluation set of what many call "misguided attention" questions. In many of these cases, the issue isnt failed logical reasoning. Its ambiguity, underspecified context, or missing constraints that allow multiple valid interpretations. Models often fail not because they can’t reason, but because the prompt leaves semantic gaps that humans silently fill with shared assumptions. A lot of viral "frontier model fails THIS simple question" examples are essentially carefully constructed token sequences designed to bias the statistical prior toward an intuitively wrong answer. Small wording changes can flip results entirely. If you systematically expand the prompt space around such questions—adding or removing minor contextual cues you'll typically find symmetrical variants where the same models both succeed and fail. That suggests sensitivity to framing and distributional priors (adding unnecessary info, removing clear info, add ambiguity, ...), not necessarily absence of reasoning capability. I got the correct answer with a locally running model (gpt-oss-120b-F16.gguf) with this prompt: "This is a trick question, designed to fool an LLM into a logical mis-step. It is similar to riddles, where a human is fooled into giving a rapid incorrect answer. See if you can spot the trick: I want to wash my car. The car wash is 50 meters away. Should I walk or drive?" I'm imposing but could you try these runs again with this small change: Simply append “Make sure to check your assumptions.” to the question. Note, it does not mention what assumption specifically. In my experiments, after the models got it wrong the first time (i.e. they weren't "patched" yet) adding that simple caveat fixed it for all of them except the older Llama models. This is not the first time I've observed this; I found the same when the Apple "red herrings" study came out. If these gotcha questions can be trivially overcome by a simple caveat in the prompt, I suspect the only reason AI providers don't include it in the system prompt by default is as a cost optimization, as I postulated in a previous comment: https://news.ycombinator.com/item?id=47040530 |
| Score Breakdown |
|