Study identifies weaknesses in how AI systems are evaluated

0.00	Study identifies weaknesses in how AI systems are evaluated (www.oii.ox.ac.uk)
	416 points by pseudolus 111 days ago \| 192 comments on HN \| Neutral ~lite vlite-1.4

Summary ~lite AI evaluation Neutral

Technical study on AI

EQ 0.50

SO 0.50

TD 0.50

Lite evaluation by llama-3.3-70b-wai · editorial channel only · no per-section breakdown available

Audit Trail 6 entries

2026-02-28 08:03	eval_success	Light evaluated: Neutral (0.00)	- -
2026-02-28 08:03	eval	Evaluated by llama-4-scout-wai: 0.00 (Neutral)
2026-02-28 08:03	rater_validation_warn	Light validation warnings for model llama-4-scout-wai: 0W 1R	- -
2026-02-28 07:52	eval_success	Light evaluated: Neutral (0.00)	- -
2026-02-28 07:52	eval	Evaluated by llama-3.3-70b-wai: 0.00 (Neutral)
2026-02-28 07:52	rater_validation_warn	Light validation warnings for model llama-3.3-70b-wai: 0W 1R	- -

build d1f8d9e+mpqz · deployed 2026-02-28 11:28 UTC · evaluated 2026-02-28 11:37:51 UTC