This article presents research demonstrating large-scale deanonymization capabilities using LLMs across multiple platforms, framing it as a technical achievement. The content directly engages privacy, freedom of expression, and security rights by describing a capability that systematically undermines anonymity—a foundational condition for safe exercise of those rights—without discussing ethical constraints, consent frameworks, or protections against misuse.
The best course of action to combat this correlation/profiling, seems to be usage of a local llm that rewrites the text while keeping meaning untouched.
Additionally, you can open up copilot.microsoft.com or w/e and ask it to summarize any reddit users (and presumably HN) posts. Not just the content, but their emotional state (without prompting).
[0] Note: last I tried this was months ago, things may have changed.
I'm not sure the practical implications are as dramatic as the paper suggests. Most adversaries who would want to deanonymize people at scale (governments, corporations) already have access to far more direct methods. The people most at risk from this are probably activists and whistleblowers in jurisdictions where those direct methods aren't available, not average users.
"We apply our de-anonymization methodology to the Netflix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers of Netflix [...]. We demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber’s record in the dataset."
and that was 20 years ago! de-anonymization techniques have improved by leaps and bounds since then, alongside the massive growth in various technology that enhances/enables various techniques.
i think the age of (pseduo-)anonymous internet browsing will be over soon. certainly within my lifetime (and im not that young!). it might be by regulation, it might be by nature of dragnet surveillance + de-anonymization, or a combination of both. but i think it will be a chilling time.
As people will point out, the OSINT techniques described are nothing new - typically, in the past, you could de-anonymize based on writing style or niche topics/interests. Totally deanonymization can occur if any of these accounts link to profiles containing pictures of their faces, which can then be web-searched to link to a real identity. It's astounding how many people re-use handles on stuff like porn sites linked very easily to their IRL identity.
While people will point out this isn't new, the implication of this paper (and something I have suspected for 2 years now but never played with) is that this will become trivial, in what would take a human investigator a bit of time, even using common OSINT tooling.
You should never assume you have total anonymity on the open web.
I post under my real name here, pretty much the only place I post. It keeps me honest and straight in what I say when I choose to say it. I tried talking to my children about leaving as clean of a footprint on the internet as one can in anticipation of future people/systems taking that into consideration. I don't know what it will be but I would expect some adversarial stuff. Trying to keep clean is what I'd prefer for myself and my kids.
On other hand, the Neal Stephenson's Fall or, Dodge in Hell book has an interesting idea in early phase of the book where a person agrees to what we now know "flood the zone with sh*t" (Steve Bannon's sadly very effective strategy) to battle some trolls. Instead of trying to keep clean, the intent is just to spam like crazy with anything so nobody understands the core. It is cleverly explored in the book albeit for too short of a time before moving into the virtual reality. I think there are a few people out here right now practicing this.
I feel like this is one of those products OpenAI et al are quietly perfecting. Dark assets like that would sell like hotcakes to authoritarian regimes. That would explain how they eventually plan to reach profitability.
Despite being pseudonymous, I don’t take great pains to hide who I am. I am in my 50s and live on the West coast. I don’t have socials and I don’t post anywhere else. Have at it!
If you are semi-retired, you’re free from the threat of cancellation. As long as you aren’t posting about crimes, there’s limits to what anyone can legally do to you. (Still, it’s good to be prudent and limit sharing.)
Everyone should really stop posting online unless their job requires it.
The platforms offer only castrated interactions designed not to accomplish anything. People online are useless obnoxious shadows of their helpful and loving self.
No one cares more what you say than those monitoring you and building that detailed profile with sinister motives. The ratio must be something like 1000:1 or worse.
I want to use "slower" methods of identification more. Like say for instance within a few blocks of you a human can identify who you are for any service that wants to do some kind of verification/proof you are/have XYZ.
We could designate specific individuals to do for you and me just like we do for today's trust authorities for website certificates.
No more verified profiles by uploading names, emails and passports and photographs(gosh!). Just turned 18 and want to access insta? Go to the local high school teacher to get age verified. Finished a career path and want it on linked in? Go to the company officer. Are you a new journalist who wants to be designated on X as so but anonymously? Go to the notary public.
One can do this cryptographically with no PII exchanged between the person, the community or the webservice. And you can be anonymous yet people know you are real.
It can be all maintained on a tree of trust, every individual in the chain needs to be verified, and only designated individuals can do actions that are sensitive/important.
You only need to do this once every so often to access certain services. Bonus: you get to take a walk and meet a human being.
Doesn’t all this deanonymization stuff depend on one fatal assumption: that people are actually being truthful with what they say about themselves?
If you’re basically LARPing a new personality every time and just making up details about where you live or what your life is like then how is this ever going to work? Someone could say they live in San Francisco while actually living in Indiana.
If with LLM's you can deanonymize at scale, on a personal level, you should also be able to figure out what posts are leading to this deanonymization and remove them or modify them.
But with HN, I'd like to ask @dang and HN leadership to support deleting messages, or making them private (requiring an HN account to see your posts).
At first I thought of how this would impact employment. But then I thought about how ICE has been tapping reddit,facebook and other services to monitor dissenters. The whole orwellian concern is no longer theoretical. I personally fear physical violence from my government, as a result. But I will continue to criticize them, I just wish it wasn't so easy for them to retaliate.
Maybe I missed something, but I see little evidence that there is a concerning ability to deanonymize. Many people post under a pseudonym but then link to their GitHub etc.
In fact by construction the HN dataset _only_ consists of people who are comfortable with their real identity being linked to it.
The real question is whether someone who is pseudonymous and actually attempting to remain so can be deanonymized.
We don't use (much) stylometry, so this won't help. This is totally something you could try, but we use interests and clues. Semantic information you reveal about yourself.
I don't think this is working any more, but there was a stylometic analysis of HN users a few years ago, and it was extremely effective (at least, for myself and people who felt the need to post in the comments): https://news.ycombinator.com/item?id=33755016
There is also a practical issue here that people usually don't write a lot on linkedin, most people just have structured biographical information. We use very limited stylometry in section 6 for matching reddit users who we synthetically split according to time.
L33tsp34k also accomplishes this. The original anonymising hacker stylometry :)
I am intrigued by the idea that in the future, communities might create a merged brand voice that their members choose to speak in via LLMs, to protect individual anonymity.
Maybe only your close friends hear your real voice?
People who comment about their boss and workplaces?
People on HN who talk about their work but want to remain anonymous? People who don’t want to be spammed if they comment in a community? Or harassed if they comment in a community? Maybe someone doesn’t want others to find out they are posting in r/depression. (Or r/warhammer.)
Anonymity is a substantial aspect of the current internet. It’s the practical reason you can have a stance against age verification.
On the other hand, if anonymity can be pierced with relative ease, then arguments for privacy are non sequiturs.
I can imagine a lot of countries who want to control what their citizens say abroad. I know Iraq in Saddam Hussein's time did it in the UK, China does it now.
That's a great background paper on the Netflix attack, we make a pretty direct comparison in section 5. We also try to use similar methods for comparison in sections 4 and 6. In section 5 we transform peoples Reddit comments into movie reviews with an LLM and then see if LLMs are better than naraynan purely on movie reviews. LLMs are still much better (getting about 8% but the average person only had 2.5 movies and 48% only shared one movie, so very difficult to match)
I think the implication is this will become trivial and trivially automated, no human investigator needed. I bet there will be plugins in one year's time to right click on a post and get a full report on who the author is.
We test different methods, in section 2, we use LLM agents to agentically identify people. We don't share any code here, but you could try with various freely available agents on yourself.
If LLMs can identify a person across websites, I can ask LLM to read up his posts and write like him impersonating him and then this feeds back into the tools identifying him. I can probabilistically malign a person this way.
I actually think those most at risk are normal people the activists will harass. Soon it will be possible for anybody who works at the “wrong” business or expresses any opinion on any subject to be casus belli for unhinged, terminally online, mentally ill people who are mad about the thing of the day to start making threatening calls to your employer or making false reports to police or sending deep fake porn to your mom.
I think that we are close to a time where the Internet is so toxic and so policed that the only reasonable response is to unplug.
> I tried talking to my children about leaving as clean of a footprint on the internet as one can in anticipation of future people/systems taking that into consideration.
I don’t think you’re wrong, but the fact that people consider it inevitable we’ll all have an immutable social acceptance grade that includes everything from teenage shitposts to things you said after a loved one died, or getting diagnosed with cancer, makes me regret putting even a moment of my professional energies towards advancing tech in the US.
Throwaway accounts using "clever" turns of phrase can often be anonymized by double click, right-clicking -> googling their witty pun and seeing their the sole instance elsewhere, on Twitter, Facebook, etc
If I see a couple words I dont know in a row, I can infer a posters real name.
Id be more specific but any example is doxxing, literally so
Attacks can be chained, and this can all be automated. For example, imagine pigbutchering scams... except it's there, similar to some voice-cloning scams, just to get enough data to stylometrically fingerprint you for future reference. You make sure to never comment too much or spicily under your real name, but someone slides into your DMs with a thoughtful, informative, high-quality comment, and you politely strike up an interesting conversation which goes well and you think nothing of it and have forgotten it a week later - and 5 years later you're in jail or fired or have been doxed or been framed. 'Direct methods' can't deliver that kind of capability post hoc, even for actors who do have access to those methods (which is a vanishing percentage of all actors). No one has cheap enough intelligence and skilled labor to do this right now. But they will.
Clearly the cia or other gov institution. Its purpose is to create an irresistible honeypot so that anyone who figures out a working and time feasible implementation of shor's law or other prime factorization technique would reveal their hand.
I am similar in that all of my interactions are with my real name and it is unique enough that just putting it into google will instantly identify me. There is one other 'jeff sponaugle' but I think he is far more annoyed with my presence than I would be with him.
On the plus side, someone will sometimes say while talking to me - oh your are that Subaru guy, or that youtube guy, or whatever and that is fun connection.
I have lived my life on the web under the assumption the other Tom Clancy will leave enough chaff in my wake to make things hard. But probably not because I make the same 5 or 6 jokes over and over.
Unless you're in the nebulous situation of being Hispanic in the US, in which case you might get profiled. Or you might have family with jobs that are subject to pressure -- and right now, that seems like most jobs, because calling employers spineless is an insult to worms. Or if you'd like to travel by air, because watchlists are back, and carriers may just refuse service.
How would "flooding the zone" actually work in that case?
AFAIK the strategy is usually used to divert attention from one subject that could be harmful to a person to some other stuff.
Wouldn’t spamming in that case provide more information about you?
Editorial Channel
What the content says
+0.20
Article 27Cultural Participation
Medium Practice
Editorial
+0.20
SETL
+0.10
The article is published and freely accessible, supporting participation in the advancement of knowledge. However, the knowledge shared concerns a technology that could undermine the conditions for safe intellectual contribution.
FW Ratio: 67%
Observable Facts
Article shares research findings openly and is freely accessible.
Substack platform enables open publication without apparent barriers to contribution.
Inferences
Free access and open publication support participation in the advancement of knowledge.
-0.30
Article 19Freedom of Expression
High Framing Practice
Editorial
-0.30
SETL
-0.41
Freedom of expression is both enabled (by free publication) and threatened (by deanonymization risk). The article celebrates the deanonymization capability without discussing how it chills free expression. Anonymous speech is often essential for vulnerable speakers.
FW Ratio: 60%
Observable Facts
Article is published freely and accessible without paywall.
The research demonstrates capacity to identify individuals from anonymous posts, creating a chilling effect on anonymous speech.
No discussion of how deanonymization undermines freedom of expression for whistleblowers, activists, or vulnerable individuals.
Inferences
Free publication supports Article 19, but the content describes a technology that undermines the conditions for secure anonymous speech.
The framing celebrates deanonymization without acknowledging its role in suppressing free expression through chilling effects.
-0.30
Article 21Political Participation
Medium Framing
Editorial
-0.30
SETL
-0.30
Participation in government and public affairs often requires anonymity for safety. The capacity to deanonymize individuals undermines the conditions for free and fair participation, particularly in contexts of oppression or corporate surveillance.
FW Ratio: 67%
Observable Facts
The research identifies individuals based on their online posts and inferred attributes.
No discussion of how deanonymization affects political participation or whistleblowing.
Inferences
Systematic deanonymization constrains the conditions for free and fair participation in governance, particularly for marginalized voices.
-0.40
Article 1Freedom, Equality, Brotherhood
Medium Framing
Editorial
-0.40
SETL
-0.40
Content does not engage with equality or non-discrimination frameworks. The technology described can be used to target individuals based on inferred attributes (location, profession, interests), raising discrimination risks that are not discussed.
FW Ratio: 67%
Observable Facts
The method 'identifies users' based on 'where you live, what you do, and your interests.'
No discussion of how this capability might be weaponized against marginalized groups or for discriminatory purposes.
Inferences
The framing of identifying individuals by inferred demographic and professional attributes creates conditions for discriminatory targeting, which is not acknowledged.
-0.40
Article 13Freedom of Movement
Medium Framing
Editorial
-0.40
SETL
-0.40
The technology described constrains freedom of movement and association by making anonymity—a precondition for free movement online—difficult or impossible. The article does not address this constraint.
FW Ratio: 67%
Observable Facts
The method scales deanonymization 'to tens of thousands of candidates,' making evasion difficult.
Individuals seeking anonymity for safety reasons may now avoid online platforms or self-censor.
Inferences
Systematic deanonymization constrains freedom of movement and association online, particularly for vulnerable populations.
-0.40
Article 30No Destruction of Rights
Medium Framing
Editorial
-0.40
SETL
-0.40
The article does not discuss safeguards against misuse of the deanonymization technology. By framing it as a capability to be scaled without discussing governance or protective measures, it risks enabling rights violations.
FW Ratio: 67%
Observable Facts
The research demonstrates scalable deanonymization capabilities across platforms.
No discussion of how to prevent misuse for surveillance, targeting, or discrimination.
Inferences
The absence of discussion about safeguards and governance increases the risk that this capability will be used to undermine rather than protect rights.
-0.50
PreamblePreamble
Medium Framing
Editorial
-0.50
SETL
-0.57
Content frames deanonymization technology as a capability to be measured and scaled, without prominent framing regarding threats to human dignity or privacy as foundational rights. The preamble's emphasis on universal dignity and freedom is not directly engaged.
FW Ratio: 60%
Observable Facts
Article headline states 'We measure the capabilities of LLMs to deanonymize users online.'
The article is marked isAccessibleForFree: true in schema metadata.
TL;DR explicitly states 'LLM agents can figure out who you are from your anonymous online posts.'
Inferences
The framing emphasizes technical capability and scale rather than the threat to privacy as a foundational human right.
Free access supports knowledge dissemination, but the knowledge being shared is about a capability that undermines anonymity protections.
-0.50
Article 3Life, Liberty, Security
Medium Framing
Editorial
-0.50
SETL
-0.50
The right to security of person is directly threatened by large-scale deanonymization, which can expose individuals to targeting, harassment, or violence. The article does not address security implications or safeguards.
FW Ratio: 67%
Observable Facts
The research identifies individuals across multiple platforms (Hacker News, Reddit, LinkedIn, interviews).
No mention of security measures, warnings to users, or protections against misuse.
Inferences
The capability to identify individuals systematically increases vulnerability to targeting and harm, particularly for those seeking anonymity for protection.
-0.50
Article 28Social & International Order
Medium Framing
Editorial
-0.50
SETL
-0.50
The article describes a technology that undermines the social and international order necessary for the realization of all UDHR rights. Deanonymization at scale threatens the conditions for safe exercise of rights, particularly for vulnerable populations.
FW Ratio: 67%
Observable Facts
The research scales deanonymization across multiple platforms and datasets.
No discussion of international norms, consent frameworks, or safeguards against misuse.
Inferences
The technology described could undermine the international order necessary for rights protection, particularly for vulnerable populations.
-0.60
Article 2Non-Discrimination
High Framing
Editorial
-0.60
SETL
-0.60
The core subject—large-scale deanonymization—directly undermines the right to liberty. The article presents this as a technical achievement without grappling with the constraint on freedom of expression and movement that pervasive deanonymization imposes.
FW Ratio: 60%
Observable Facts
The research demonstrates that 'from a handful of comments, LLMs can infer where you live, what you do, and your interests.'
The method 'scales to tens of thousands of candidates,' indicating systemic application potential.
No discussion of consent, notice, or user choice in the deanonymization process.
Inferences
The capability to deanonymize at scale fundamentally constrains the exercise of liberty without explicit consent or knowledge.
The technical framing obscures the rights implications of systematic surveillance and identification.
-0.60
Article 29Duties to Community
High Framing
Editorial
-0.60
SETL
-0.60
The article does not discuss the limitations and responsibilities that accompany the right to develop and share scientific knowledge. The deanonymization technology is presented as an achievement without explicit consideration of ethical limits, consent, or protection of vulnerable rights-holders.
FW Ratio: 50%
Observable Facts
The article presents deanonymization capabilities as research achievements without discussing ethical constraints.
No mention of informed consent, institutional review, or protections for research subjects.
Inferences
The framing treats deanonymization as a technical achievement without grappling with the ethical and rights-based limitations on knowledge production and dissemination.
The article does not acknowledge the rights of the individuals whose anonymity is being compromised.
-0.70
Article 12Privacy
High Framing Practice
Editorial
-0.70
SETL
-0.53
The core subject is the systematic violation of privacy. The article presents deanonymization as an achievement without addressing the fundamental right to privacy in personal communications and data. No discussion of consent, minimization, or protective measures.
FW Ratio: 60%
Observable Facts
Research explicitly shows LLMs can deanonymize users 'from a handful of comments.'
The method infers personal attributes (location, profession, interests) from public posts without consent.
No privacy policy or user warning about deanonymization risk is present on the article page.
Inferences
The framing celebrates a capability that systematically violates privacy, treating it as a technical achievement rather than a rights violation.
The Substack platform enables the very surveillance described, without transparent protections or user agency.
ND
Article 4No Slavery
Not directly engaged.
ND
Article 5No Torture
Not directly engaged.
ND
Article 6Legal Personhood
Not directly engaged.
ND
Article 7Equality Before Law
Not directly engaged.
ND
Article 8Right to Remedy
Not directly engaged.
ND
Article 9No Arbitrary Detention
Not directly engaged.
ND
Article 10Fair Hearing
Not directly engaged.
ND
Article 11Presumption of Innocence
Not directly engaged.
ND
Article 14Asylum
Not directly engaged.
ND
Article 15Nationality
Not directly engaged.
ND
Article 16Marriage & Family
Not directly engaged.
ND
Article 17Property
Not directly engaged.
ND
Article 18Freedom of Thought
Not directly engaged.
ND
Article 20Assembly & Association
Not directly engaged.
ND
Article 22Social Security
Not directly engaged.
ND
Article 23Work & Equal Pay
Not directly engaged.
ND
Article 24Rest & Leisure
Not directly engaged.
ND
Article 25Standard of Living
Not directly engaged.
ND
Article 26Education
Not directly engaged.
Structural Channel
What the site does
Domain Context Profile
Element
Modifier
Affects
Note
Privacy
—
No explicit privacy policy visible on this article page; Substack platform TOS applies.
Terms of Service
—
Standard Substack terms; not directly observable on article.
Accessibility
—
Article text accessible; semantic HTML structure present.
Mission
—
Author tagline: 'Aligning AI is hard.' Suggests alignment/safety focus.
Editorial Code
—
Academic/technical publication format; no explicit editorial guidelines visible.
Ownership
—
Individual Substack publication by Simon Lermen; no corporate ownership signals.
Access Model
+0.15
Article 19 Article 27
Article marked isAccessibleForFree: true; open access lowers barriers to knowledge dissemination.
Ad/Tracking
—
Substack infrastructure may track users; not directly observable on this article.
+0.25
Article 19Freedom of Expression
High Framing Practice
Structural
+0.25
Context Modifier
+0.15
SETL
-0.41
The article is freely published and accessible (positive for Article 19). However, the underlying platform dynamics enable surveillance that constrains free expression through the chilling effect of deanonymization risk.
+0.15
PreamblePreamble
Medium Framing
Structural
+0.15
Context Modifier
0.00
SETL
-0.57
Article is freely accessible (isAccessibleForFree: true), supporting information access. However, the platform itself enables the very threat discussed—deanonymization risk—without explicit protections or warnings.
+0.15
Article 27Cultural Participation
Medium Practice
Structural
+0.15
Context Modifier
+0.15
SETL
+0.10
The article is freely accessible (isAccessibleForFree: true), lowering barriers to knowledge access and participation in scientific discourse.
0.00
Article 1Freedom, Equality, Brotherhood
Medium Framing
Structural
0.00
Context Modifier
0.00
SETL
-0.40
No observable structural discrimination in the article's presentation itself. Content is universally accessible in terms of format.
0.00
Article 2Non-Discrimination
High Framing
Structural
0.00
Context Modifier
0.00
SETL
-0.60
No structural issues with the article itself, but the technology it describes represents a systemic threat to Article 2 protections.
0.00
Article 3Life, Liberty, Security
Medium Framing
Structural
0.00
Context Modifier
0.00
SETL
-0.50
No structural safeguards observable on the article page itself.
0.00
Article 13Freedom of Movement
Medium Framing
Structural
0.00
Context Modifier
0.00
SETL
-0.40
No structural barriers to movement or association in the article itself.
0.00
Article 21Political Participation
Medium Framing
Structural
0.00
Context Modifier
0.00
SETL
-0.30
No structural barriers to participation observable on the article itself.
0.00
Article 28Social & International Order
Medium Framing
Structural
0.00
Context Modifier
0.00
SETL
-0.50
No observable structural violations on the article itself.
0.00
Article 29Duties to Community
High Framing
Structural
0.00
Context Modifier
0.00
SETL
-0.60
No observable community standards or ethical frameworks visible on the article page.
0.00
Article 30No Destruction of Rights
Medium Framing
Structural
0.00
Context Modifier
0.00
SETL
-0.40
No observable safeguards or governance frameworks are mentioned on the article.
-0.30
Article 12Privacy
High Framing Practice
Structural
-0.30
Context Modifier
0.00
SETL
-0.53
The platform itself (Substack/LLM ecosystem) enables the surveillance and deanonymization described, without transparent privacy safeguards or user control mechanisms observable on the article.
ND
Article 4No Slavery
Not directly engaged.
ND
Article 5No Torture
Not directly engaged.
ND
Article 6Legal Personhood
Not directly engaged.
ND
Article 7Equality Before Law
Not directly engaged.
ND
Article 8Right to Remedy
Not directly engaged.
ND
Article 9No Arbitrary Detention
Not directly engaged.
ND
Article 10Fair Hearing
Not directly engaged.
ND
Article 11Presumption of Innocence
Not directly engaged.
ND
Article 14Asylum
Not directly engaged.
ND
Article 15Nationality
Not directly engaged.
ND
Article 16Marriage & Family
Not directly engaged.
ND
Article 17Property
Not directly engaged.
ND
Article 18Freedom of Thought
Not directly engaged.
ND
Article 20Assembly & Association
Not directly engaged.
ND
Article 22Social Security
Not directly engaged.
ND
Article 23Work & Equal Pay
Not directly engaged.
ND
Article 24Rest & Leisure
Not directly engaged.
ND
Article 25Standard of Living
Not directly engaged.
ND
Article 26Education
Not directly engaged.
Supplementary Signals
Epistemic Quality
0.66high claims
Sources
0.7
Evidence
0.8
Uncertainty
0.4
Purpose
0.8
Propaganda Flags
2techniques detected
loaded language
Framing deanonymization as a 'capability' and 'achievement' that 'increasingly practical' without loaded language about rights violations or harm.
obfuscation
Technical framing of deanonymization obscures the rights implications and potential for misuse by focusing on methodology rather than impact.