dial481 3 karma 35d on HN HN profile →
Coverage
We've seen 1 of ~3 submissions
Full eval: 0 Lite-only: 0 Unevaluated: 1
1 stories
1. LoCoMo AI Benchmark: 6.4% of answer key wrong, judge accepts 63% of fake answers (github.com)
3 points by dial481 10 days ago | 2 comments | skipped