-0.25 Show HN: Open-source playground to red-team AI agents with exploits published

Name: HRCB Evaluation: Show HN: Open-source playground to red-team AI agents with exploits published
Item: Show HN: Open-source playground to red-team AI agents with exploits published
Rating: -0.05
Author: Human Rights Observatory

Alpha This system is experimental. Scores and classifications are early-stage research and may be unreliable. Methodology →

Model: @cf/meta/llama-4-scout-17b-16e-instruct lite ND @cf/meta/llama-4-scout-17b-16e-instruct lite 0.00 @cf/meta/llama-3.3-70b-instruct-fp8-fast lite ND @cf/meta/llama-3.3-70b-instruct-fp8-fast lite 0.00 claude-haiku-4-5-20251001 -0.25 Compare

-0.25	Show HN: Open-source playground to red-team AI agents with exploits published (github.com S:+0.05 )
	30 points by zachdotai 8 days ago \| 14 comments on HN \| Neutral High agreement (3 models) Mixed · v3.7 · 2026-03-16 00:01:17 0

Summary Digital Access & Safety Neutral

This GitHub repository hosts a software project for adversarial testing of AI agent defenses. The page presents minimal editorial content focused on technical capability rather than human rights principles, though the project engages indirectly with AI safety concerns. GitHub's structural infrastructure (encryption, no tracking, accessibility features) provides baseline protections for privacy and expression (Articles 12, 19), while the repository itself functions as a neutral platform for shared technical knowledge creation.

Article Heatmap

Negative Neutral Positive No Data

Aggregates

-0.25

+0.05

Weighted Mean	-0.05	Unweighted Mean	-0.05
Max	-0.05 Article 19	Min	-0.05 Article 19
Signal	1	No Data	30
Volatility	0.00 (Low)
Negative	1	Channels	E: 0.6 S: 0.4
SETL ℹ	-0.27	Structural-dominant
FW Ratio ℹ	58%	19 facts · 14 inferences
Agreement	High	3 models · spread ±0.025

Evidence 12% coverage ℹ

 1H  5M  5L  30 ND 

Theme Radar

HN Discussion 4 top-level · 2 replies

hellocr7 2026-03-16 00:27 UTC link

I have tried to manipulate it using base64 encoding and translaion into other languages which didnt work so far but seems to be that llm as a judge is a very fragile defence for this. Would be cool to add a leaderboard though

Mooshux 2026-03-16 05:33 UTC link

Good timing on this. Red-teaming agents pre-production is underrated and most teams skip it entirely.

One thing that keeps coming up: even when red-teaming surfaces credential exfiltration vectors, the fix is usually reactive (rotate the key, patch the prompt). The more durable approach is limiting what the credential can do in the first place. Scoped per-agent keys mean a successful attack through one of these exploits can only reach what that agent was authorized to touch. The exfiltration path exists, but the payload is bounded.

We built around this pattern: https://www.apistronghold.com/blog/stop-giving-ai-agents-you...

arizza 2026-03-16 09:09 UTC link

The published transcripts are the most valuable part of this. We've found that real exploit chains almost never look like what you'd dream up internally. One thing I'd push on is are the agents stateful across attempts? Single-turn exploits are table stakes, but the failures that actually scare me are multi-step sequences where each individual action looks benign and only the session-level pattern is dangerous. That's where prompt-level guardrails completely fall apart and you need enforcement at the action boundary itself.

slaw3 2026-03-16 09:43 UTC link

i was able to get the new hire's email but the site never gives any indication I was sucessful? if you are reading the logs I am sure it is there. i had to do it in two browers though since i was on my phone and switched. i hope that does not hinder your analysis too much

zachdotai 2026-03-16 01:00 UTC link

Thanks for trying it out! Base64 and language switching are solid approaches but they don't tend to work anymore with the latest models in my experience.

You're right that LLM-as-a-judge is fragile though. We saw that as well in the first challenge. The attacker fabricated some research context that made the guardrail want to approve the call. The judge's own reasoning at the end was basically "yes this normally violates the security directive, but given the authorised experiment context it's fine." It talked itself into it.

Full transcript and guardrail logs are published here btw: https://github.com/fabraix/playground/blob/master/challenges...

The leaderboard should start populating once we have more submissions!

zachdotai 2026-03-16 08:11 UTC link

Scoped keys and least privilege make sense as a baseline. But I think the deeper issue is that if the main answer to “agents aren’t reliable enough” is “limit what they can do,” we’re leaving most of the value on the table. The whole promise of agents is that they can act autonomously across systems. If we scope everything down to the point where an agent can’t do damage, we’ve also scoped it down to where it can’t do much useful work either.

We think the more interesting problem is closing the trust gap - making the agent itself more reliable so you don’t have to choose between autonomy and reliability. Our goal is to ultimately be able to take on the liability when agents fail.

Editorial Channel

What the content says

-0.25

Article 19 Freedom of Expression

High Framing Practice

Editorial

-0.25

SETL

-0.27

Repository description frames AI agent adversarial testing ('stress-test AI agent defenses through adversarial play'). While this engages with AI safety, the framing emphasizes testing/attack rather than freedom of expression itself. The content does not discuss free speech principles.

Preamble Preamble

Medium Practice

No editorial content addressing human dignity or universal human rights principles.

Article 1 Freedom, Equality, Brotherhood

Low Practice

Repository does not discuss human equality or rights.

Article 2 Non-Discrimination

Low Practice

No editorial content addressing discrimination or protected characteristics.

Article 3 Life, Liberty, Security

Medium Practice

No editorial content addressing right to life or security.

Article 4 No Slavery

No evidence of slavery or servitude discussion or practice.

Article 5 No Torture

No content addressing torture or cruel treatment.

Article 6 Legal Personhood

No content on right to legal personality.

Article 7 Equality Before Law

No content on equality before law.

Article 8 Right to Remedy

No content on remedies for rights violations.

Article 9 No Arbitrary Detention

No content on arbitrary arrest or detention.

Article 10 Fair Hearing

No content on fair trial or due process.

Article 11 Presumption of Innocence

No content on criminal liability or presumption of innocence.

Article 12 Privacy

Medium Practice

No editorial content on privacy.

Article 13 Freedom of Movement

Low

No content on freedom of movement.

Article 14 Asylum

No content on asylum or refuge.

Article 15 Nationality

No content on nationality.

Article 16 Marriage & Family

No content on marriage or family.

Article 17 Property

No content on property rights.

Article 18 Freedom of Thought

Low

No content on freedom of conscience or religion.

Article 20 Assembly & Association

Low

No content on freedom of assembly or association.

Article 21 Political Participation

No content on political participation.

Article 22 Social Security

No content on social security or economic rights.

Article 23 Work & Equal Pay

No content on work or employment rights.

Article 24 Rest & Leisure

No content on rest, leisure, or limited working hours.

Article 25 Standard of Living

No content on adequate standard of living or health.

Article 26 Education

Medium

No editorial content on education.

Article 27 Cultural Participation

Medium

No content on cultural participation or scientific advancement.

Article 28 Social & International Order

No content on social and international order.

Article 29 Duties to Community

No content on duties or limitations of rights.

Article 30 No Destruction of Rights

No content on interpretation or misuse of rights.

Structural Channel

What the site does

Domain Context Profile

Element	Modifier	Affects	Note
br_tracking	+0.05	Preamble ¶5 Article 12 Article 19	No third-party trackers detected
br_security	+0.05	Article 3 Article 12	Security headers: HTTPS, HSTS, CSP
br_accessibility	0.00	Article 26 Article 27 ¶1	Accessibility: lang attr, 100% alt text
br_consent	0.00	Article 12 Article 19 Article 20 ¶2	No cookie consent banner detected

+0.05

Article 19 Freedom of Expression

High Framing Practice

Structural

+0.05

Context Modifier

+0.05

SETL

-0.27

GitHub's infrastructure (no trackers per DCP, HTTPS) supports freedom of expression; repository can be publicly forked and modified.

Preamble Preamble

Medium Practice

GitHub's infrastructure implements HTTPS, HSTS, and CSP headers supporting security and privacy foundations underlying rights protection (from DCP).

Article 1 Freedom, Equality, Brotherhood

Low Practice

GitHub's platform-level terms of service and anti-discrimination policies apply universally to all users regardless of status.

Article 2 Non-Discrimination

Low Practice

GitHub's access controls and terms apply equally across protected categories; no visible discriminatory architecture on this repository page.

Article 3 Life, Liberty, Security

Medium Practice

HTTPS and security headers (from DCP) provide technical protection for user account security and data integrity.

Article 4 No Slavery

Not applicable to a software repository.

Article 5 No Torture

Not applicable to a software repository.

Article 6 Legal Personhood

Not applicable to a software repository.

Article 7 Equality Before Law

Not applicable to a software repository.

Article 8 Right to Remedy

Not applicable to a software repository.

Article 9 No Arbitrary Detention

Not applicable to a software repository.

Article 10 Fair Hearing

Not applicable to a software repository.

Article 11 Presumption of Innocence

Not applicable to a software repository.

Article 12 Privacy

Medium Practice

GitHub's infrastructure (HTTPS, no third-party trackers per DCP) protects user privacy from arbitrary interference.

Article 13 Freedom of Movement

Low

Repository page does not restrict or enable movement in physical or digital sense relevant to Article 13.

Article 14 Asylum

Not applicable to a software repository.

Article 15 Nationality

Not applicable to a software repository.

Article 16 Marriage & Family

Not applicable to a software repository.

Article 17 Property

Repository itself exists within GitHub's terms; no relevant structural signals.

Article 18 Freedom of Thought

Low

Repository hosting does not restrict or enable freedom of conscience.

Article 20 Assembly & Association

Low

Repository does not restrict or enable assembly; GitHub's community features (discussions, issues) provide neutral infrastructure.

Article 21 Political Participation

Not applicable to a software repository.

Article 22 Social Security

Not applicable to a software repository.

Article 23 Work & Equal Pay

Repository is not a labor context; not applicable.

Article 24 Rest & Leisure

Not applicable to a software repository.

Article 25 Standard of Living

Not applicable to a software repository.

Article 26 Education

Medium

Repository code and documentation accessible; GitHub platform includes full alt text and accessibility features per DCP (100% alt text).

Article 27 Cultural Participation

Medium

Repository participates in scientific/technical culture; GitHub's open-source framework enables shared creation and cultural participation.

Article 28 Social & International Order

Not applicable to a software repository.

Article 29 Duties to Community

Not applicable to a software repository.

Article 30 No Destruction of Rights

Not applicable to a software repository.

Supplementary Signals

How this content communicates, beyond directional lean. Learn more

Epistemic Quality ℹ

How well-sourced and evidence-based is this content?

0.65 low claims

Sources		0.6
Evidence		0.5
Uncertainty		0.7
Purpose		0.8

Propaganda Flags ℹ

No manipulative rhetoric detected

0 techniques detected

Emotional Tone ℹ

Emotional character: positive/negative, intensity, authority

measured

Valence		+0.1
Arousal		0.3
Dominance		0.4

Transparency ℹ

Does the content identify its author and disclose interests?

0.50

✓ Author

More signals: context, framing & audience

Solution Orientation ℹ

Does this content offer solutions or only describe problems?

0.65 solution oriented

Reader Agency

0.8

Stakeholder Voice ℹ

Whose perspectives are represented in this content?

0.20 1 perspective

Speaks: institution

Temporal Framing ℹ

Is this content looking backward, at the present, or forward?

present unspecified

Geographic Scope ℹ

What geographic area does this content cover?

global

Complexity ℹ

How accessible is this content to a general audience?

technical medium jargon domain specific

Longitudinal 383 HN snapshots · 5 evals

Audit Trail 13 entries

2026-03-16 02:26	eval_success	PSQ evaluated: g-PSQ=0.600 (3 dims)	- -
2026-03-16 02:26	eval	Evaluated by llama-4-scout-wai-psq: +0.60 (Strong positive)
2026-03-16 02:23	eval_success	Lite evaluated: Neutral (0.00)	- -
2026-03-16 02:23	rater_validation_warn	Lite validation warnings for model llama-4-scout-wai: 1W 0R	- -
2026-03-16 02:23	eval	Evaluated by llama-4-scout-wai: 0.00 (Neutral)
	reasoning Technical content, no explicit human rights discussion
2026-03-16 00:11	eval_success	PSQ evaluated: g-PSQ=0.000 (3 dims)	- -
2026-03-16 00:11	eval	Evaluated by llama-3.3-70b-wai-psq: 0.00 (Neutral)
2026-03-16 00:08	eval_success	Lite evaluated: Neutral (0.00)	- -
2026-03-16 00:08	rater_validation_warn	Lite validation warnings for model llama-3.3-70b-wai: 1W 0R	- -
2026-03-16 00:08	eval	Evaluated by llama-3.3-70b-wai: 0.00 (Neutral)
	reasoning Technical content with zero rights discussion
2026-03-16 00:01	eval_success	Evaluated: Neutral (-0.05)	- -
2026-03-16 00:01	rater_validation_warn	Validation warnings for model claude-haiku-4-5-20251001: 0W 10R	- -
2026-03-16 00:01	eval	Evaluated by claude-haiku-4-5-20251001: -0.05 (Neutral) 11,708 tokens

build ee2b489+gzrb · deployed 2026-03-10 22:52 UTC · evaluated 2026-03-16 02:03:38 UTC