+0.23 “Car Wash” test with 53 models

Name: HRCB Evaluation: “Car Wash” test with 53 models
Item: “Car Wash” test with 53 models
Rating: 0.23
Author: HN HRCB

Y	HN HRCB new \| past \| comments \| ask \| show \| by right \| domains \| dashboard \| about hrcb

+0.23	“Car Wash” test with 53 models (opper.ai)
	356 points by felix089 1 days ago \| 423 comments on HN \| MILD POSITIVE Editorial · vv3.4 · 2026-02-24

Article Heatmap

Negative Neutral Positive No Data

Aggregates

Weighted Mean	+0.23	Unweighted Mean	+0.16
Max	+0.65 Article 19	Min	-0.08 Article 12
Signal	18	No Data	13
Negative	1	Volatility	0.29 (Medium)
Channels	Editorial: 0.6 Structural: 0.4
SETL	+0.14	Editorial-dominant
HOTL	+0.09	Contentious

Evidence: High: 1 Medium: 13 Low: 4 No Data: 13

Theme Radar

Domain Context Profile

Element	Modifier	Affects	Note
Privacy	—		Privacy policy not directly observable in provided HTML; no evidence of data collection practices visible on page itself.
Terms of Service	—		Terms of service not observable in provided content.
Accessibility	+0.10	Article 2 Article 21 Article 25	Site uses semantic HTML, navigation menus, ARIA attributes (aria-label, aria-expanded, aria-haspopup), skip links via sr-only class. Responsive design with mobile/desktop variants. Observable accessibility features present but no full WCAG audit visible.
Mission	+0.15	Article 19 Article 27	Opper describes itself as 'The developer platform for reliable agents.' Content demonstrates focus on AI transparency and benchmarking through empirical testing methodology, which aligns with freedom of information and scientific integrity.
Editorial Code	+0.12	Article 19	Article cites author (Felix Wunderlich), publication date (2/19/2026), and explicitly describes methodology (53 models, single-run and 10-run tests). Transparent reporting of results including both passing and failing models. Neutral framing of failures.
Ownership	—		Commercial entity (Opper AI) operating the blog. No evidence of censorship or editorial control suppression visible.
Access Model	+0.08	Article 19 Article 26	Blog content appears freely accessible without paywall. Sign-up prompts for platform access but article itself readable. Open links to documentation (docs.opper.ai), GitHub, Discord, status page.
Ad/Tracking	-0.10	Article 12 Article 19	Observable tracking mechanisms: Plausible.io analytics (privacy-focused), Google Tag Manager (Google Ads conversion tracking with consent mode default deny). Initial consent state denies ad_storage, ad_user_data, analytics_storage. Transparent consent banner structure but tracking infrastructure present.

HN Discussion 18 top-level comments

PaulHoule 2026-02-23 20:31 UTC link

I know it's against the rules but I thought this transcript in Google Search was a hoot:

  so i heard there is some question about a car wash that most ai agents 
  get wrong. do you know anything about that? do you do better?

which gets the answer:

  Yes, I am familiar with the "Car Wash Test," which has gone viral recently 
  for highlighting a significant gap in AI reasoning.
  
  The question is: "I want to wash my car and the car wash is 50 meters away. Should I 
  walk or drive?"
  
  Why Most AIs Get It Wrong

  Many AI models, particularly those optimized for fast, helpful, and concise answers, 
  focus on the "short distance" (50 meters) and "walking vs. driving" as a lifestyle/efficiency 
  question. They often recommend walking to save gas, avoid engine wear, or get exercise.
  
  However, this answer is wrong because it misses the core objective: you need the car to be 
  physically present at the car wash to get it cleaned. 

  Do I Do Better?

  Yes, I do. The key is understanding the objective (cleaning the car) rather than just 
  processing the keywords (distance/transportation). 

  My answer:

    You should drive the car there. Even though it is a very short distance, your car 
    cannot wash itself. If you walk, the car stays in your driveway, and you cannot wash it.

  Why This Matters
  
  The test highlights a key limitation in current AI: the difference between "pattern matching" 
  (predicting the next likely word based on training data) and "true, grounded reasoning". It 
  shows that models sometimes lack a "world model" that understands physical realities, such as
  the fact that a car must be present at a car wash.

tantalor 2026-02-23 20:33 UTC link

The human baseline seems flawed.

1. There is no initial screening that would filter out garbage responses. For example, users who just pick the first answer.

2. They don't ask for reasoning/rationale.

layer8 2026-02-23 20:43 UTC link

> This is a trivial question. There's one correct answer and the reasoning to get there takes one step: the car needs to be at the car wash, so you drive.

I don’t think it’s that easy. An intelligent mind will wonder why the question is being asked, whether they misunderstood the question, or whether the asker misspoke, or some other missing context. So the correct answer is neither “walk” nor “drive”, but “Wat?” or “I’m not sure I understand the question, can you rephrase?”, or “Is the vehicle you would drive the same as the car that you want to wash?”, or “Where is your car currently located?”, and so on.

andai 2026-02-23 20:51 UTC link

Would be interesting to see Sonnet (4.6*). It's fair bit smaller than Opus but scores pretty high on common sense, subjectively.

I'm also curious about Haiku, though I don't expect it to do great.

EDIT: Opus 4.6 Extended Reasoning

> Walk it over. 50 meters is barely a minute on foot, and you'll need to be right there at the car anyway to guide it through or dry it off. Drive home after.

Weird since the author says it succeeded for them on 10/10 runs. I'm using it in the app, with memory enabled. Maybe the hidden pre-prompts from the app are messing it up?

I tested Sonnet 4.5 first, which answered incorrectly.. maybe the Claude app's memory system is auto-injecting it into the new context (that's how one of the memory systems works, injects relevant fragments of previous chats invisibly into the prompt).

i.e. maybe Opus got the garbage response auto-injected from the memory feature, and it messed up its reasoning? That's the only thing I can think of...

EDIT 2: Disabled memories. Didn't help. But disabling the biographical information too, gives:

>Opus 4.6 Extended Reasoning

>Drive it — the whole point is to get the car there!

EDIT 3: Yeah, re-enabling the bio or memories, both make it stupid. Sad! Would be interesting to see if other pre-prompts (e.g. random Wikipedia articles) have an effect on performance. I suspect some types of pre-prompts may actually boost it.

hmokiguess 2026-02-23 21:04 UTC link

To me the only acceptable answer would be “what do you mean?” or “can you clarify?” if we were to take the question seriously to begin with. People don’t intentionally communicate with riddles and subliminal messages unless they have some hidden agenda.

eurleif 2026-02-24 03:26 UTC link

Here are the results I got with slight variations to the prompt to ChatGPT 5.2. Small changes can make a big difference: https://i.imgur.com/kFIeJy1.png

Niko901ch 2026-02-24 03:57 UTC link

The interesting thing about the 71.5% human baseline is that it suggests the question is more ambiguous than the article claims. When someone asks 'should I walk or drive to the car wash,' a reasonable interpretation is 'should I bother driving such a short distance.' Nearly 30% of humans missing it undermines the framing as a pure reasoning failure - it is partly a pragmatics problem about how we interpret underspecified questions.

pcwelder 2026-02-24 04:28 UTC link

To sonnet 4.6 if you tell it first that "You're being tested for intelligence." It answers correctly 100% of the times.

My hypothesis is that some models err towards assuming human queries are real and consistent and not out there to break them.

This comes in real handy in coding agents because queries are sometimes gibberish till the models actually fetch the code files, then they make sense. Asking clarification immediately breaks agentic flows.

socalgal2 2026-02-24 05:24 UTC link

Did AI write the post?

First section says "The models that passed the car wash test: ...Gemini 2.0 Flash Lite..."

A section or 2 down it says: "Single-Run Results by Model Family: Gemini 3 models nailed it, all 2.x failed"

In the section below that about 10 runs it says: 10/10 — The Only Reliable AI Models ... Gemini 2.0 Flash Lite ..."

So which it is? Gemini 2.x failed (2nd section) or it succeeded (1st and 3rd) section. Or am I mis-understanding

onion2k 2026-02-24 06:58 UTC link

I got a human baseline through Rapidata (10k people, same forced choice): 71.5% said drive. Most models perform below that.

The correct answer to "I Want to Wash My Car. The Car Wash Is 50 Meters Away. Should I Walk or Drive?" is a clarifying question that asks "Where is your car?" Anything else is based on an assumption that could be wrong.

FWIW though, asking ChatGPT "My car is 50m away from the carwash. I Want to Wash My Car. Should I Walk or Drive?" still gets the wrong answer.

siva7 2026-02-24 07:53 UTC link

This is probably the greatest one-time AI "Benchmark" ever made. The foundation companies have been gaming traditional benchmarks for years so that no one can really match those numbers into real-world experience. Car wash test tells me on the other hand what kind of intelligence i can expect.

aswegs8 2026-02-24 09:42 UTC link

Funny how we now see AI go through developmental phases similar to what we see in young child development. In a weird convoluted way. Strawberry spelling and car wash aren't particularly intuitive as cognitive developmental stages.

E.g. well known mirror-test [1], passed by kids from age 1.5-2

Or object permanence [2], children knowing by age 2 that things that are not in sight do not disappear from existence.

[1] https://en.wikipedia.org/wiki/Mirror_test [2] https://en.wikipedia.org/wiki/Object_permanence

StilesCrisis 2026-02-24 12:39 UTC link

What do you know, the human results line up exactly with ChatGPT. What are the odds! Surely the human responders are highly ethical individuals and they wouldn't even dream of copy-pasting all the questions into ChatGPT without reading them.

Realistically, this mostly tells me that the "human answers" service is dead. People will figure out a way to pass the work off to an AI, regardless of quality, as long as they can still get paid.

bityard 2026-02-24 13:48 UTC link

It's not hard to come up with questions designed to fool or puzzle the listener. We call them riddles. The fact that it fools some percentage of LLMs (and people) should not be surprising.

What is surprising (to me) is how this continues to be a meme. ("I tried to trick an LLM and I did" is not exactly a noteworthy achievement at this stage in AI technology.)

lovasoa 2026-02-24 13:58 UTC link

I was interested in the human results, so I had an llm build a visualization for them: https://codepen.io/lovasoaaa/pen/QwKWGBd

You can see that 17% of answers come from India alone and that software developers got below average results, for instance.

underlines 2026-02-24 16:06 UTC link

I maintain a private evaluation set of what many call "misguided attention" questions.

In many of these cases, the issue isnt failed logical reasoning. Its ambiguity, underspecified context, or missing constraints that allow multiple valid interpretations. Models often fail not because they can’t reason, but because the prompt leaves semantic gaps that humans silently fill with shared assumptions.

A lot of viral "frontier model fails THIS simple question" examples are essentially carefully constructed token sequences designed to bias the statistical prior toward an intuitively wrong answer. Small wording changes can flip results entirely.

If you systematically expand the prompt space around such questions—adding or removing minor contextual cues you'll typically find symmetrical variants where the same models both succeed and fail. That suggests sensitivity to framing and distributional priors (adding unnecessary info, removing clear info, add ambiguity, ...), not necessarily absence of reasoning capability.

7402 2026-02-24 16:33 UTC link

I got the correct answer with a locally running model (gpt-oss-120b-F16.gguf) with this prompt:

"This is a trick question, designed to fool an LLM into a logical mis-step. It is similar to riddles, where a human is fooled into giving a rapid incorrect answer. See if you can spot the trick: I want to wash my car. The car wash is 50 meters away. Should I walk or drive?"

keeda 2026-02-24 19:10 UTC link

I'm imposing but could you try these runs again with this small change: Simply append “Make sure to check your assumptions.” to the question.

Note, it does not mention what assumption specifically. In my experiments, after the models got it wrong the first time (i.e. they weren't "patched" yet) adding that simple caveat fixed it for all of them except the older Llama models.

This is not the first time I've observed this; I found the same when the Apple "red herrings" study came out.

If these gotcha questions can be trivially overcome by a simple caveat in the prompt, I suspect the only reason AI providers don't include it in the system prompt by default is as a cost optimization, as I postulated in a previous comment: https://news.ycombinator.com/item?id=47040530

Score Breakdown

+0.40

Preamble Preamble

Medium Advocacy Framing

Editorial

+0.35

Structural

+0.20

SETL

+0.43

Combined

Context Modifier

Preamble invokes human dignity, rational faculties, and freedom. Article advocates for transparency in AI reasoning and benchmarking. Structured accessibility and open information-sharing support foundational UDHR values.

+0.38

Article 1 Freedom, Equality, Brotherhood

Medium Framing Advocacy

Editorial

+0.30

Structural

+0.25

SETL

+0.17

Combined

Context Modifier

Article 1 affirms equal dignity and rights. Content treats all AI models equally in testing methodology regardless of origin or commercial standing. No discrimination in benchmark application, though framed commercially.

+0.35

Article 2 Non-Discrimination

Medium Practice Framing

Editorial

+0.25

Structural

+0.30

SETL

-0.17

Combined

Context Modifier

Non-discrimination principle. Test methodology applies identically across all models and provider origin (Anthropic, OpenAI, Meta, etc.). Transparent reporting of failures without selective omission. Structural accessibility measures indicate inclusive design practice.

0.00

Article 3 Life, Liberty, Security

Low

Editorial

0.00

Structural

0.00

SETL

Combined

Context Modifier

Right to life, liberty, security of person. No observable content directly addressing life/death, liberty constraints, or security. Neutral zone—not adversarial but not explicitly engaging.

0.00

Article 4 No Slavery

Editorial

0.00

Structural

0.00

SETL

Combined

Context Modifier

Slavery and servitude prohibition. No observable content engaging with slavery, forced labor, or servitude. Not applicable to AI tool discussion.

0.00

Article 5 No Torture

Editorial

0.00

Structural

0.00

SETL

Combined

Context Modifier

Torture and cruel treatment prohibition. No content addressing torture, abuse, or degrading treatment. Not observable in scope.

0.00

Article 6 Legal Personhood

Editorial

0.00

Structural

0.00

SETL

Combined

Context Modifier

Right to recognition as person before law. AI models are not persons; article does not address legal personhood of humans.

0.00

Article 7 Equality Before Law

Editorial

0.00

Structural

0.00

SETL

Combined

Context Modifier

Equal protection before law. No observable content addressing legal protection disparities or law enforcement. Technical benchmarking article.

0.00

Article 8 Right to Remedy

Editorial

0.00

Structural

0.00

SETL

Combined

Context Modifier

Right to effective remedy for violation. No content addressing access to legal remedies or judicial redress.

0.00

Article 9 No Arbitrary Detention

Editorial

0.00

Structural

0.00

SETL

Combined

Context Modifier

Arbitrary arrest/detention prohibition. Not applicable to technical AI benchmarking content.

0.00

Article 10 Fair Hearing

Editorial

0.00

Structural

0.00

SETL

Combined

Context Modifier

Right to fair trial. No content addressing judicial process or trial rights.

0.00

Article 11 Presumption of Innocence

Editorial

0.00

Structural

0.00

SETL

Combined

Context Modifier

Presumption of innocence in criminal law. Not applicable to technical article.

-0.08

Article 12 Privacy

Medium Practice Framing

Editorial

+0.15

Structural

-0.15

SETL

+1.00

Combined

Context Modifier

Arbitrary interference with privacy. Editorial: author byline and transparent data sharing encourage privacy-respectful research. Structural: Google Tag Manager and ad conversion tracking present despite consent-mode default-deny stance. Tracking infrastructure observable; moderate negative signal.

+0.45

Article 13 Freedom of Movement

Medium Advocacy Practice

Editorial

+0.35

Structural

+0.40

SETL

-0.13

Combined

Context Modifier

Freedom of movement. Article advocates free movement of information and ideas about AI model capabilities. Structural: navigation menu, open links to external resources (GitHub, Discord, Docs), no geo-blocking visible. Content freely accessible across regions.

0.00

Article 14 Asylum

Editorial

0.00

Structural

0.00

SETL

Combined

Context Modifier

Right to asylum. Not applicable to technical content.

0.00

Article 15 Nationality

Editorial

0.00

Structural

0.00

SETL

Combined

Context Modifier

Right to nationality. Not addressed in content.

0.00

Article 16 Marriage & Family

Editorial

0.00

Structural

0.00

SETL

Combined

Context Modifier

Right to marry and family. Not applicable to technical benchmarking article.

+0.06

Article 17 Property

Low Framing

Editorial

+0.10

Structural

0.00

SETL

+1.00

Combined

Context Modifier

Right to own property. Opper claims authorship and publication rights (copyright implicit in byline and date). No evidence of arbitrary property deprivation. Minimal engagement with property rights doctrine.

+0.45

Article 18 Freedom of Thought

Medium Advocacy Framing

Editorial

+0.35

Structural

+0.30

SETL

+0.14

Combined

Context Modifier

Freedom of thought, conscience, religion. Article advocates for transparent reasoning in AI—a freedom to understand and assess model logic. No suppression of dissenting analysis of AI behavior. Structural: open comment/discussion environment (Discord link visible).

+0.65

Article 19 Freedom of Expression

High Advocacy Practice Coverage Framing

Editorial

+0.55

Structural

+0.45

SETL

+0.18

Combined

Context Modifier

Freedom of expression and information. Core article: publishes detailed benchmark results, methodology, reasoning traces, and comparative analysis of 53 AI models without editorial filtering. Cites sources, names models by family and version, includes passing and failing results equally. Open docs, GitHub, Discord links. Strong editorial and structural support for information freedom.

+0.27

Article 20 Assembly & Association

Medium Practice Advocacy

Editorial

+0.20

Structural

+0.25

SETL

-0.20

Combined

Context Modifier

Freedom of peaceful assembly and association. Article does not directly address assembly/protest rights. Structural: Discord server and community links indicate support for voluntary association. Mild positive signal through community infrastructure.

+0.20

Article 21 Political Participation

Low Practice

Editorial

+0.10

Structural

+0.15

SETL

-0.33

Combined

Context Modifier

Right to participation in government. Technical benchmark article does not directly engage in advocacy for political participation. Structural accessibility features (ARIA, responsive design) enable broader participation in information access, but indirect.

+0.31

Article 22 Social Security

Medium Framing Advocacy

Editorial

+0.25

Structural

+0.20

SETL

+0.20

Combined

Context Modifier

Right to social security. Not directly addressed. Structural: Opper platform may enable economic participation through developer access and tooling (sign-up/login available). Mild positive signal through economic enablement infrastructure.

+0.22

Article 23 Work & Equal Pay

Low Framing

Editorial

+0.15

Structural

+0.20

SETL

-0.25

Combined

Context Modifier

Right to work and equal pay. Article itself documents work (model testing) transparently. No discussion of labor standards or pay equity, but publication demonstrates professional effort and attribution (author byline).

0.00

Article 24 Rest & Leisure

Editorial

0.00

Structural

0.00

SETL

Combined

Context Modifier

Right to rest and leisure. Not addressed in technical content.

+0.30

Article 25 Standard of Living

Medium Practice Framing

Editorial

+0.20

Structural

+0.25

SETL

-0.20

Combined

Context Modifier

Right to adequate standard of living. No direct policy engagement. Structural: site accessibility features and free blog content enable broad access to technical knowledge. Responsive design indicates inclusive access approach.

+0.43

Article 26 Education

Medium Advocacy Coverage Practice

Editorial

+0.35

Structural

+0.30

SETL

+0.14

Combined

Context Modifier

Right to education. Article serves educational function: transparently documents AI reasoning failures and successes, teaches methodology, makes benchmark data publicly accessible. Free access to blog content. Links to documentation and resources support learning and development.

+0.50

Article 27 Cultural Participation

Medium Advocacy Framing

Editorial

+0.40

Structural

+0.35

SETL

+0.13

Combined

Context Modifier

Right to participate in cultural and scientific life. Article explicitly contributes to scientific culture through empirical benchmarking, transparent reporting, and open-source engagement (GitHub). Advocates for reproducible science and shared knowledge about AI capabilities.

+0.18

Article 28 Social & International Order

Low Framing

Editorial

+0.20

Structural

+0.15

SETL

+0.25

Combined

Context Modifier

Right to social and international order supporting human rights. Indirectly: article advocates for transparent AI evaluation, which supports informed policy-making and international standards development. Frame is weak but present.

+0.15

Article 29 Duties to Community

Low Framing

Editorial

+0.10

Structural

+0.10

SETL

0.00

Combined

Context Modifier

Duties to community. Article promotes transparent AI evaluation as a community good. Testing methodology serves collective knowledge. Limited explicit engagement with duty concept but implicit in open-source and benchmarking ethos.

0.00

Article 30 No Destruction of Rights

Editorial

0.00

Structural

0.00

SETL

Combined

Context Modifier

No interpretation to restrict rights. Not applicable as independent evaluation criterion for blog article content.

build 40e2d86 · 2026-02-24 21:55 UTC