colinfly — HRO

Alpha This system is experimental. Scores and classifications are early-stage research and may be unreliable. Methodology →

colinfly 1 karma 3d on HN HN profile →

Coverage

We've seen 2 of ~2 submissions

Full eval: 0 Lite-only: 0 Unevaluated: 2

2 stories

1.		What broke when I tried to evaluate an AI agent in production
		I tried to evaluate an AI agent using a benchmark-style approach.<p>It failed in ways I didn’t expec...
		1 points by colinfly 2 days ago \| 1 comments \| skipped
2.		Open-source LLM-as-judge eval suite with root cause analysis and failure mining (github.com)
		2 points by colinfly 5 days ago \| 1 comments \| skipped

build ee2b489+gzrb · deployed 2026-03-10 22:52 UTC · evaluated 2026-03-16 02:03:38 UTC