Book: The Emerging Science of Machine Learning Benchmarks

Alpha This system is experimental. Scores and classifications are early-stage research and may be unreliable. Methodology →

Model: @cf/meta/llama-4-scout-17b-16e-instruct lite +0.10 @cf/meta/llama-4-scout-17b-16e-instruct lite ND @cf/meta/llama-3.3-70b-instruct-fp8-fast lite ND @cf/meta/llama-3.3-70b-instruct-fp8-fast lite +0.10 Compare

+0.10	Book: The Emerging Science of Machine Learning Benchmarks (mlbenchmarks.org)
	138 points by jxmorris12 6 days ago \| 11 comments on HN \| Moderate negative ~lite vlite-1.6

Summary ~lite Machine Learning Ethics Acknowledges

Preface of a book on machine learning benchmarks, discussing their implications and limitations.

EQ 0.50

SO 0.60

TD 0.40

Lite evaluation by llama-4-scout-wai · editorial channel only · no per-section breakdown available

Longitudinal 545 HN snapshots · 57 evals

Audit Trail 77 entries

2026-03-19 19:02	eval_success	Lite evaluated: Moderate negative (-0.34)	- -
2026-03-19 19:02	eval	Evaluated by llama-4-scout-wai: -0.34 (Moderate negative) 0.00
	reasoning Technical book on machine learning benchmarks, discusses ethical objections and implications.
2026-03-19 18:46	eval_success	PSQ evaluated: g-PSQ=0.120 (3 dims)	- -
2026-03-19 18:46	eval	Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) 0.00
2026-03-19 18:11	eval_success	Lite evaluated: Moderate negative (-0.34)	- -
2026-03-19 18:11	eval	Evaluated by llama-4-scout-wai: -0.34 (Moderate negative) 0.00
	reasoning Technical book on machine learning benchmarks, discusses ethical objections and implications.
2026-03-19 17:24	eval_success	PSQ evaluated: g-PSQ=0.120 (3 dims)	- -
2026-03-19 17:24	eval	Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) 0.00
2026-03-19 17:06	eval_success	Lite evaluated: Moderate negative (-0.34)	- -
2026-03-19 17:06	eval	Evaluated by llama-4-scout-wai: -0.34 (Moderate negative) 0.00
	reasoning Technical book on machine learning benchmarks, discusses ethical objections and implications.
2026-03-19 16:09	eval_success	PSQ evaluated: g-PSQ=0.120 (3 dims)	- -
2026-03-19 16:09	eval	Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) 0.00
2026-03-19 15:53	eval_success	Lite evaluated: Moderate negative (-0.34)	- -
2026-03-19 15:53	eval	Evaluated by llama-4-scout-wai: -0.34 (Moderate negative) 0.00
	reasoning Technical book on machine learning benchmarks, discusses ethical objections and implications.
2026-03-19 14:53	eval_success	PSQ evaluated: g-PSQ=0.120 (3 dims)	- -
2026-03-19 14:53	eval	Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) 0.00
2026-03-19 14:31	eval_success	Lite evaluated: Moderate negative (-0.34)	- -
2026-03-19 14:31	eval	Evaluated by llama-4-scout-wai: -0.34 (Moderate negative) 0.00
	reasoning Technical book on machine learning benchmarks, discusses ethical objections and implications.
2026-03-19 13:30	eval_success	PSQ evaluated: g-PSQ=0.120 (3 dims)	- -
2026-03-19 13:30	eval	Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) 0.00
2026-03-19 13:11	eval_success	Lite evaluated: Moderate negative (-0.34)	- -
2026-03-19 13:11	eval	Evaluated by llama-4-scout-wai: -0.34 (Moderate negative) 0.00
	reasoning Technical book on machine learning benchmarks, discusses ethical objections and implications.
2026-03-19 12:43	eval_success	PSQ evaluated: g-PSQ=0.120 (3 dims)	- -
2026-03-19 12:43	eval	Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) 0.00
2026-03-19 12:30	eval_success	Lite evaluated: Moderate negative (-0.34)	- -
2026-03-19 12:30	eval	Evaluated by llama-4-scout-wai: -0.34 (Moderate negative) 0.00
	reasoning Technical book on machine learning benchmarks, discusses ethical objections and implications.
2026-03-19 12:05	eval_success	PSQ evaluated: g-PSQ=0.120 (3 dims)	- -
2026-03-19 12:05	eval	Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) 0.00
2026-03-19 11:51	eval_success	Lite evaluated: Moderate negative (-0.34)	- -
2026-03-19 11:51	eval	Evaluated by llama-4-scout-wai: -0.34 (Moderate negative) -0.34
	reasoning Technical book on machine learning benchmarks, discusses ethical objections and implications.
2026-03-19 11:26	eval_success	PSQ evaluated: g-PSQ=0.120 (3 dims)	- -
2026-03-19 11:26	eval	Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) 0.00
2026-03-19 11:16	eval_success	Lite evaluated: Neutral (0.00)	- -
2026-03-19 11:16	eval	Evaluated by llama-4-scout-wai: 0.00 (Neutral) +0.34
	reasoning Technical book on machine learning benchmarks, discusses ethical objections and implications.
2026-03-19 11:16	rater_validation_warn	Lite validation warnings for model llama-4-scout-wai: 1W 0R	- -
2026-03-19 10:47	eval_success	PSQ evaluated: g-PSQ=0.120 (3 dims)	- -
2026-03-19 10:47	eval	Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) 0.00
2026-03-19 10:36	eval_success	Lite evaluated: Moderate negative (-0.34)	- -
2026-03-19 10:36	eval	Evaluated by llama-4-scout-wai: -0.34 (Moderate negative) 0.00
	reasoning Technical book on machine learning benchmarks, discusses ethical objections and implications.
2026-03-19 10:09	eval	Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) 0.00
2026-03-19 09:58	eval	Evaluated by llama-4-scout-wai: -0.34 (Moderate negative) -0.34
	reasoning Technical book on machine learning benchmarks, discusses ethical objections and implications.
2026-03-19 09:29	eval	Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) 0.00
2026-03-19 09:22	eval	Evaluated by llama-4-scout-wai: 0.00 (Neutral) +0.34
	reasoning Technical book on machine learning benchmarks, discusses ethical objections and implications.
2026-03-19 08:52	eval	Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) 0.00
2026-03-19 08:45	eval	Evaluated by llama-4-scout-wai: -0.34 (Moderate negative) 0.00
	reasoning Technical book on machine learning benchmarks, discusses ethical objections and implications.
2026-03-19 08:14	eval	Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) 0.00
2026-03-19 08:06	eval	Evaluated by llama-4-scout-wai: -0.34 (Moderate negative) 0.00
	reasoning Technical book on machine learning benchmarks, discusses ethical objections and implications.
2026-03-19 07:35	eval	Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) 0.00
2026-03-19 07:25	eval	Evaluated by llama-4-scout-wai: -0.34 (Moderate negative) 0.00
	reasoning Technical book on machine learning benchmarks, discusses ethical objections and implications.
2026-03-19 07:00	eval	Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) 0.00
2026-03-19 06:48	eval	Evaluated by llama-4-scout-wai: -0.34 (Moderate negative) 0.00
	reasoning Technical book on machine learning benchmarks, discusses ethical objections and implications.
2026-03-19 06:23	eval	Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) 0.00
2026-03-19 06:14	eval	Evaluated by llama-4-scout-wai: -0.34 (Moderate negative) 0.00
	reasoning Technical book on machine learning benchmarks, discusses ethical objections and implications.
2026-03-19 05:47	eval	Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) 0.00
2026-03-19 05:38	eval	Evaluated by llama-4-scout-wai: -0.34 (Moderate negative) -0.34
	reasoning Technical book on machine learning benchmarks, discusses ethical objections and implications.
2026-03-19 04:53	eval	Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) 0.00
2026-03-19 04:46	eval	Evaluated by llama-4-scout-wai: 0.00 (Neutral) +0.34
	reasoning Technical book on machine learning benchmarks, discusses ethical objections and implications.
2026-03-19 03:39	eval	Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) 0.00
2026-03-19 03:34	eval	Evaluated by llama-4-scout-wai: -0.34 (Moderate negative) 0.00
	reasoning Technical book on machine learning benchmarks, discusses ethical objections and implications.
2026-03-19 02:21	eval	Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) 0.00
2026-03-19 02:16	eval	Evaluated by llama-4-scout-wai: -0.34 (Moderate negative) 0.00
	reasoning Technical book on machine learning benchmarks, discusses ethical objections and implications.
2026-03-19 00:59	eval	Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) 0.00
2026-03-19 00:56	eval	Evaluated by llama-4-scout-wai: -0.34 (Moderate negative) 0.00
	reasoning Technical book on machine learning benchmarks, discusses ethical objections and implications.
2026-03-19 00:08	eval	Evaluated by llama-3.3-70b-wai-psq: +0.32 (Moderate positive)
2026-03-19 00:02	eval	Evaluated by llama-3.3-70b-wai: -0.34 (Moderate negative)
	reasoning Technical content with implicit rights discussion
2026-03-18 23:58	eval	Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) 0.00
2026-03-18 23:53	eval	Evaluated by llama-4-scout-wai: -0.34 (Moderate negative) +0.06
	reasoning Technical book on machine learning benchmarks, discusses ethical objections and implications.
2026-03-18 23:23	eval	Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) 0.00
2026-03-18 23:18	eval	Evaluated by llama-4-scout-wai: -0.40 (Moderate negative) 0.00
	reasoning Technical book on machine learning benchmarks, discusses ethical objections and implications.
2026-03-18 22:46	eval	Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) 0.00
2026-03-18 22:43	eval	Evaluated by llama-4-scout-wai: -0.40 (Moderate negative) -0.40
	reasoning Technical book on machine learning benchmarks, discusses ethical objections and implications.
2026-03-18 21:34	eval	Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) 0.00
2026-03-18 21:31	eval	Evaluated by llama-4-scout-wai: 0.00 (Neutral) +0.40
	reasoning Technical book on machine learning benchmarks, discusses ethical objections and implications.
2026-03-18 20:24	eval	Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) 0.00
2026-03-18 20:21	eval	Evaluated by llama-4-scout-wai: -0.40 (Moderate negative) -0.06
	reasoning Technical book on machine learning benchmarks, discusses ethical objections and implications.
2026-03-18 19:48	eval	Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive)
2026-03-18 19:47	eval	Evaluated by llama-4-scout-wai: -0.34 (Moderate negative)
	reasoning Technical book on machine learning benchmarks, discusses ethical objections and implications.

build ee2b489+gzrb · deployed 2026-03-10 22:52 UTC · evaluated 2026-03-16 02:03:38 UTC