colinfly 1 karma 3d on HN HN profile →
Coverage
We've seen 2 of ~2 submissions
Full eval: 0 Lite-only: 0 Unevaluated: 2
2 stories
1. What broke when I tried to evaluate an AI agent in production
I tried to evaluate an AI agent using a benchmark-style approach.<p>It failed in ways I didn’t expec...
1 points by colinfly 2 days ago | 1 comments | skipped
2. Open-source LLM-as-judge eval suite with root cause analysis and failure mining (github.com)
2 points by colinfly 5 days ago | 1 comments | skipped