This technical blog post documents a divergence between two Unicode specifications (confusables.txt and NFKC normalization) relevant to identifier validation security. Content contributes to scientific knowledge through detailed technical education, transparent correction practices, and submission of findings to the Unicode Consortium standards process. The post demonstrates commitment to equitable information access through free open-access publication, accessibility features, and open-source code release.
> The correct use is to check whether a submitted identifier contains characters that visually mimic Latin letters, and if so, reject it
That is a really bad and user-hostile thing to do. Many of those characters are perfectly valid characters in various non-latin scripts. If you want everyone to force Latin script for identifiers, then own up to it and say so. But rejecting just some them for being too similar to latin characters just makes the behaviour inconsistent and confusing for users.
Unicode is both the best thing that's ever happened to text encoding and the worst. The approach I take here is to treat any text coming from the user as toxic waste. Assume it will say "Administrator" or "Official Government Employee" or be 800 pixels tall because it was built only out of decorative combining characters. Then put it in a fixed box with overflow hidden, and use some other UI element to convey things like "this is an official account."
The worst part that this article doesn't even touch on with normalizing and remapping characters is the risk your login form doesn't do it but your database does. Suddenly I can re-register an existing account by using a different set of codepoints that the login system doesn't think exists but the auth system maps to somebody else's record.
Does the "removing dead code" advantage outweigh the additional complexity of having to maintain 2 different confusables lists: one for when NFKC has been applied first and one without? It didn't sound like applying one after the other caused any errors, just that some previously reachable states are unreachable.
Tangential - I'm aware of various types of, let's say, "swappability" that Unicode defines (broader than the Unicode concept of "equivalence"):
- Canonical (NF)
- Compatible (NFK)
- Composed vs decomposed
- Confusable (confusables.txt)
Does Unicode not define something like "fuzzy" equivalence? Like "confusable" but more broad, for search bar logic? The most obvious differences would be case and diacritic insensitivity (e, é). Case is easy since any string/regex API supports case insensitivity, but diacritic insensitivity is not nearly as common, and there are other categories of fuzzy equivalence too (e.g. ø, o).
I guess it makes sense for Unicode to not be interested in defining something like this, since it relates neither to true semantics nor security, but it's an incredibly common pattern, and if they offered some standard, I imagine more APIs would implement it.
My theory: The "long S" in "Congreſs" is an f. They used f instead of s because without modern dental care, a lot of people in the 1600's and 1700's were miffing teeth and fpoke with a lifp.
> If your application also runs NFKC normalization (which it should — ENS, GitHub, and Unicode IDNA all require it)
That's not right. Most of the web requires NFC normalization, not NFKC. NFC doesn't lose information in the original string. It reorders and combines code points into equivalent code point sequences, e.g. to simplify equality tests.
In NFKC, the K for "Compatibility" means some characters are replaced with similar, simpler code points. I've found NFKC useful for making text search indexes where you want matches to be forgiving, but it would be both obvious and wrong to use it in most of the web because it would dramatically change what the user has entered. See the examples in https://www.unicode.org/reports/tr15/.
What would make sense is to have a blacklist of usernames (like "admin" or "moderator"), then use the confusables map to see if a username or slug is visually confusable with a name from that blacklist.
I initially thought that must surely be what they are doing and they just worded it very, very poorly. But then of the 31 "disagreements" only one matters, the long s that's either f or s. All other disagreements map to visually similar symbols, like O and 0, which you should already treat as the same for this check
Not to mention that this is mostly an issue for URL slugs, so after NFKC normalization. In HTML this is more robustly solved by styling conventions. Even old bb-style forums will display admin and moderator user names in a different color or in bold to show their status. The modern flourish is to put a little icon next to these kinds of names, which also scales well to other identifiers.
This is an inexplicable, AI-written article and the obvious answer is no. There's no performance or complexity overhead to not removing a couple of dead characters. There is a complexity overhead to forking off the list or adding pointless special cases to your code.
For some sorts of "confusables", you don't even need Unicode in some cases. Depending on the cursed combination of font, kerning, rendering and display, `m` and `rn` are also very hard to distinguish.
I think we're expecting too much from an LLM generated article from a user that has been spending a lot of time spamming their content across multiple platforms and websites.
Thanks Josh - putting this article out there has pushed me to sharpen a lot of my thinking which hopefully should come across in my more recent work. I've updated the article to scope the NFKC recommendation to identifiers and added a note crediting your correction. Thanks for catching it.
I agree that rejecting valid non-Latin characters in valid contexts is user-hostile, but I should be clearer about scope: this is specifically about machine-readable identifiers (slugs, handles, ENS names) where the character set is intentionally restricted, not display names or user-facing text.
The approach there should be what wongarsu describes below (imo), to style the UI so official accounts are visually distinct (badges, colour, etc.) rather than policing the character set.
namespace-guard is deliberately opinionated for the slug/handle case where you've already decided the output should be ASCII-safe. If your use case is broader than that, confusables detection without rejection is the right call.
> or be 800 pixels tall because it was built only out of decorative combining characters
Also known as Zalgo. But it seems most renderers nowadays overlay multiple combining marks over each other rather than stack them, which makes it look far less eldritch than it used to.
Editorial Channel
What the content says
+0.40
Article 27Cultural Participation
Medium Framing Practice
Editorial
+0.40
SETL
+0.14
Article 27 protects right to participation in cultural and scientific life and share in benefits of scientific progress. Content contributes to shared scientific/technical knowledge of Unicode standards by documenting divergences between two independent Unicode specifications. Author participates in Unicode Consortium review process (PRI #540) and submits findings for official consideration. Framing emphasizes collaborative knowledge-building and scientific transparency.
FW Ratio: 50%
Observable Facts
Author submitted findings to Unicode Consortium, documented as 'accepted into PRI #540 for review by the UTC at meeting #187.'
Content credits community contributors: 'Thanks to ficiek, v4ss42' and acknowledges correction from 'joshdata on HN.'
Implementation is provided as open-source TypeScript library (namespace-guard) freely available to developers.
Inferences
Submission to Unicode Consortium standards process demonstrates participation in collaborative scientific and technical governance.
Free open-source code release enables broader technical community to benefit from documented research.
Attribution of community contributors recognizes and honors collaborative knowledge development.
+0.30
Article 19Freedom of Expression
Medium Framing Practice
Editorial
+0.30
SETL
-0.13
Article 19 protects freedom of opinion and expression, including receipt of information. Content provides detailed technical documentation on Unicode standards, security mechanisms, and normalization algorithms. Author shares knowledge freely and transparently, including acknowledged corrections and updates. Framing emphasizes clarity and accessibility of complex technical information.
FW Ratio: 60%
Observable Facts
Page content is freely accessible without registration or paywalls.
Author explicitly acknowledges corrections: 'Updated 25 February 2026: Clarified that NFKC is specifically recommended for identifiers, not general web text.'
Content includes detailed explanations of Unicode standards, security mechanisms, and practical implementation guidance.
Inferences
Open publication of technical security knowledge supports informed decision-making for developers implementing identifier validation systems.
Transparent acknowledgment of corrections demonstrates intellectual honesty about limitations of prior analysis.
+0.25
Article 26Education
Medium Framing
Editorial
+0.25
SETL
+0.22
Article 26 protects right to education and development of human personality. Content provides detailed technical education on Unicode standards, normalization, and security mechanisms. Framing emphasizes learning and understanding: clear explanations, practical examples, step-by-step walkthroughs, and generator scripts enable knowledge transfer and skill development.
FW Ratio: 50%
Observable Facts
Content includes structured education: definitions (NFKC, confusables), worked examples (Long S, styled characters), data tables, and implementation guidance.
Author provides reproducible generator script and open-source implementation (namespace-guard) for practical learning and application.
Inferences
Detailed technical education format supports reader development of expertise in Unicode security practices.
Provision of reproducible scripts and open-source code enables hands-on learning and practical skill development.
ND
PreamblePreamble
Low
Preamble concerns dignity, equality, and freedom from arbitrary discrimination. Content is technical documentation on Unicode normalization with no explicit engagement with dignity or discrimination frameworks.
FW Ratio: 50%
Observable Facts
The blog is hosted on GitHub Pages without registration requirements or access restrictions.
A theme toggle button allows users to select dark or light mode display.
Inferences
Open access to technical documentation supports equitable information distribution.
Theme toggle accessibility feature suggests awareness of user needs beyond basic functionality.
ND
Article 1Freedom, Equality, Brotherhood
Article 1 establishes universal equality and dignity. Content is technical; does not engage with equality or dignity frameworks.
ND
Article 2Non-Discrimination
Article 2 prohibits discrimination. Content is neutral technical documentation with no discriminatory framing.
ND
Article 3Life, Liberty, Security
Article 3 concerns right to life, liberty, and security of person. Not engaged by technical content.
ND
Article 4No Slavery
Article 4 prohibits slavery and servitude. Not engaged.
ND
Article 5No Torture
Article 5 prohibits torture and cruel treatment. Not engaged.
ND
Article 6Legal Personhood
Article 6 concerns right to recognition as person before law. Not engaged.
ND
Article 7Equality Before Law
Article 7 establishes equal protection before law. Not engaged.
ND
Article 8Right to Remedy
Article 8 concerns right to remedy by competent authority. Not engaged.
ND
Article 9No Arbitrary Detention
Article 9 prohibits arbitrary arrest/detention. Not engaged.
ND
Article 10Fair Hearing
Article 10 concerns right to fair and public hearing. Not engaged.
ND
Article 11Presumption of Innocence
Article 11 addresses criminal justice and presumption of innocence. Not engaged.
ND
Article 12Privacy
Article 12 protects privacy, family, and correspondence. Not engaged.
ND
Article 13Freedom of Movement
Article 13 establishes freedom of movement. Not engaged.
ND
Article 14Asylum
Article 14 concerns right to seek asylum. Not engaged.
ND
Article 15Nationality
Article 15 addresses right to nationality. Not engaged.
ND
Article 16Marriage & Family
Article 16 protects right to marry and family. Not engaged.
ND
Article 17Property
Article 17 concerns property rights. Not engaged.
ND
Article 18Freedom of Thought
Article 18 protects freedom of thought, conscience, and religion. Not engaged.
ND
Article 20Assembly & Association
Article 20 addresses right to peaceful assembly and association. Not engaged.
ND
Article 21Political Participation
Article 21 concerns political participation and voting. Not engaged.
ND
Article 22Social Security
Article 22 addresses economic, social, and cultural rights. Not engaged.
ND
Article 23Work & Equal Pay
Article 23 protects right to work and just conditions. Not engaged.
ND
Article 24Rest & Leisure
Article 24 addresses right to rest and leisure. Not engaged.
ND
Article 25Standard of Living
Low Practice
Article 25 addresses right to adequate standard of living, including food, clothing, housing, and social services. Not directly engaged by technical content.
FW Ratio: 50%
Observable Facts
Content is freely accessible without cost barriers to readers seeking technical knowledge.
Inferences
Free access to security-relevant technical documentation supports equitable distribution of knowledge for system developers.
ND
Article 28Social & International Order
Article 28 concerns social and international order for rights realization. Not engaged.
ND
Article 29Duties to Community
Article 29 addresses community duties and limitations on rights. Not engaged.
ND
Article 30No Destruction of Rights
Article 30 prohibits destruction of stated rights. Not engaged.
Structural Channel
What the site does
Domain Context Profile
Element
Modifier
Affects
Note
Privacy
—
No privacy-invasive tracking detected; static GitHub Pages site.
Terms of Service
—
Not applicable for technical blog.
Accessibility
+0.05
Article 19 Article 25 Article 26
Theme toggle present (dark/light mode) supports accessibility. No explicit WCAG compliance statement visible. Content is text-heavy without apparent alt text for Unicode characters shown.
Mission
—
Personal technical blog; no organizational mission statement.
Editorial Code
—
No explicit editorial guidelines visible.
Ownership
—
Author 'paultendo' identifiable from domain; ownership clear.
Access Model
+0.08
Article 19 Article 27
Free, open-access blog content supports information access. No paywall or registration barrier observed.
Ad/Tracking
—
No advertising or tracking pixels detected in provided content.
+0.35
Article 19Freedom of Expression
Medium Framing Practice
Structural
+0.35
Context Modifier
+0.13
SETL
-0.13
Blog operates as open-access platform with no paywalls, registration barriers, or content restrictions. Free and unrestricted distribution of technical knowledge supports information access. Dark/light mode accessibility feature enables broader reader access. Updated date stamps and correction notes demonstrate transparency about knowledge updates.
+0.35
Article 27Cultural Participation
Medium Framing Practice
Structural
+0.35
Context Modifier
+0.08
SETL
+0.14
Free open-access distribution of technical knowledge enables broader participation in scientific understanding of Unicode. Open-source implementation (namespace-guard) provides benefit-sharing. Author credits contributors and acknowledges influence of community feedback.
+0.08
Article 25Standard of Living
Low Practice
Structural
+0.08
Context Modifier
+0.05
SETL
ND
Free open-access technical resource with accessibility features (dark/light mode) supports equitable access to information that could improve systems design and security practices.
+0.05
PreamblePreamble
Low
Structural
+0.05
Context Modifier
0.00
SETL
ND
Free open-access blog with dark/light mode toggle supports broad information access (Article 19 component). No paywalls or registration barriers.
+0.05
Article 26Education
Medium Framing
Structural
+0.05
Context Modifier
+0.05
SETL
+0.22
Blog structure with free access supports educational access. Accessibility features (theme toggle) enable broader participation. No evidence of paywalls or registration barriers to educational content.
ND
Article 1Freedom, Equality, Brotherhood
No structural signals related to equal treatment or dignity.
ND
Article 2Non-Discrimination
No structural signals related to discrimination.
ND
Article 3Life, Liberty, Security
No structural signals.
ND
Article 4No Slavery
No structural signals.
ND
Article 5No Torture
No structural signals.
ND
Article 6Legal Personhood
No structural signals.
ND
Article 7Equality Before Law
No structural signals.
ND
Article 8Right to Remedy
No structural signals.
ND
Article 9No Arbitrary Detention
No structural signals.
ND
Article 10Fair Hearing
No structural signals.
ND
Article 11Presumption of Innocence
No structural signals.
ND
Article 12Privacy
No structural signals.
ND
Article 13Freedom of Movement
No structural signals.
ND
Article 14Asylum
No structural signals.
ND
Article 15Nationality
No structural signals.
ND
Article 16Marriage & Family
No structural signals.
ND
Article 17Property
No structural signals.
ND
Article 18Freedom of Thought
No structural signals.
ND
Article 20Assembly & Association
No structural signals.
ND
Article 21Political Participation
No structural signals.
ND
Article 22Social Security
No structural signals.
ND
Article 23Work & Equal Pay
No structural signals.
ND
Article 24Rest & Leisure
No structural signals.
ND
Article 28Social & International Order
No structural signals.
ND
Article 29Duties to Community
No structural signals.
ND
Article 30No Destruction of Rights
No structural signals.
Supplementary Signals
Epistemic Quality
0.82medium claims
Sources
0.8
Evidence
0.8
Uncertainty
0.8
Purpose
0.9
Propaganda Flags
0techniques detected
Solution Orientation
0.81solution oriented
Reader Agency
0.8
Emotional Tone
measured
Valence
+0.1
Arousal
0.3
Dominance
0.4
Stakeholder Voice
0.553 perspectives
Speaks: individualsinstitution
About: institutioncommunity
Temporal Framing
presentshort term
Geographic Scope
global
Complexity
experthigh jargondomain specific
Transparency
0.75
✓ Author
Event Timeline
20 events
2026-02-26 06:04
dlq
Dead-lettered after 1 attempts: Confusables.txt and NFKC disagree on 31 characters
--
2026-02-26 06:04
dlq
Dead-lettered after 1 attempts: Confusables.txt and NFKC disagree on 31 characters
--
2026-02-26 06:04
dlq
Dead-lettered after 1 attempts: Confusables.txt and NFKC disagree on 31 characters
--
2026-02-26 06:04
dlq
Dead-lettered after 1 attempts: Confusables.txt and NFKC disagree on 31 characters
--
2026-02-26 06:03
dlq
Dead-lettered after 1 attempts: Confusables.txt and NFKC disagree on 31 characters
--
2026-02-26 06:03
dlq
Dead-lettered after 1 attempts: Confusables.txt and NFKC disagree on 31 characters
--
2026-02-26 06:03
dlq
Dead-lettered after 1 attempts: Confusables.txt and NFKC disagree on 31 characters
--
2026-02-26 06:02
dlq
Dead-lettered after 1 attempts: Confusables.txt and NFKC disagree on 31 characters
--
2026-02-26 06:00
dlq
Dead-lettered after 1 attempts: Confusables.txt and NFKC disagree on 31 characters
--
2026-02-26 05:59
dlq
Dead-lettered after 1 attempts: Confusables.txt and NFKC disagree on 31 characters
--
2026-02-26 05:59
dlq
Dead-lettered after 1 attempts: Confusables.txt and NFKC disagree on 31 characters
--
2026-02-26 05:58
dlq
Dead-lettered after 1 attempts: Confusables.txt and NFKC disagree on 31 characters
--
2026-02-26 05:57
dlq
Dead-lettered after 1 attempts: Confusables.txt and NFKC disagree on 31 characters
--
2026-02-26 05:57
dlq
Dead-lettered after 1 attempts: Confusables.txt and NFKC disagree on 31 characters
--
2026-02-26 05:56
dlq
Dead-lettered after 1 attempts: Confusables.txt and NFKC disagree on 31 characters
--
2026-02-26 05:55
dlq
Dead-lettered after 1 attempts: Confusables.txt and NFKC disagree on 31 characters
--
2026-02-26 05:55
dlq
Dead-lettered after 1 attempts: Confusables.txt and NFKC disagree on 31 characters
--
2026-02-26 05:55
credit_exhausted
Credit balance too low, retrying in 338s
--
2026-02-26 05:55
dlq
Dead-lettered after 1 attempts: Confusables.txt and NFKC disagree on 31 characters