+0.30 Confusables.txt and NFKC disagree on 31 characters

Name: HRCB Evaluation: Confusables.txt and NFKC disagree on 31 characters
Item: Confusables.txt and NFKC disagree on 31 characters
Rating: 0.298
Author: HN HRCB

H	HN HRCB top \| past \| comments \| ask \| show \| jobs \| articles \| domains \| dashboard \| seldon \| network \| factions \| velocity \| about hrcb

home / paultendo.github.io / item 47121716

+0.30	Confusables.txt and NFKC disagree on 31 characters (paultendo.github.io)
	57 points by pimterry 2 days ago \| 38 comments on HN \| Mild positive Editorial · v3.7 · 2026-02-26

Summary Education & Technical Knowledge Access Acknowledges

This technical blog post documents a divergence between two Unicode specifications (confusables.txt and NFKC normalization) relevant to identifier validation security. Content contributes to scientific knowledge through detailed technical education, transparent correction practices, and submission of findings to the Unicode Consortium standards process. The post demonstrates commitment to equitable information access through free open-access publication, accessibility features, and open-source code release.

Article Heatmap

Negative Neutral Positive No Data

Aggregates

Weighted Mean	+0.30	Unweighted Mean	+0.26
Max	+0.46 Article 27	Min	+0.05 Preamble
Signal	5	No Data	26
Confidence	7%	Volatility	0.17 (Medium)
Negative	0	Channels	E: 0.6 S: 0.4
SETL	+0.08	Editorial-dominant
FW Ratio	52%	11 facts · 10 inferences

Evidence: High: 0 Medium: 3 Low: 2 No Data: 26

Theme Radar

HN Discussion 7 top-level · 13 replies

brazzy 2026-02-25 14:35 UTC link

> The correct use is to check whether a submitted identifier contains characters that visually mimic Latin letters, and if so, reject it

That is a really bad and user-hostile thing to do. Many of those characters are perfectly valid characters in various non-latin scripts. If you want everyone to force Latin script for identifiers, then own up to it and say so. But rejecting just some them for being too similar to latin characters just makes the behaviour inconsistent and confusing for users.

akersten 2026-02-25 14:52 UTC link

Unicode is both the best thing that's ever happened to text encoding and the worst. The approach I take here is to treat any text coming from the user as toxic waste. Assume it will say "Administrator" or "Official Government Employee" or be 800 pixels tall because it was built only out of decorative combining characters. Then put it in a fixed box with overflow hidden, and use some other UI element to convey things like "this is an official account."

The worst part that this article doesn't even touch on with normalizing and remapping characters is the risk your login form doesn't do it but your database does. Suddenly I can re-register an existing account by using a different set of codepoints that the login system doesn't think exists but the auth system maps to somebody else's record.

kccqzy 2026-02-25 14:58 UTC link

If you allow users to submit arbitrary Unicode string as text, why would you need to check confusables.txt? Whose confusion are you guarding against?

Liftyee 2026-02-25 15:18 UTC link

Does the "removing dead code" advantage outweigh the additional complexity of having to maintain 2 different confusables lists: one for when NFKC has been applied first and one without? It didn't sound like applying one after the other caused any errors, just that some previously reachable states are unreachable.

happytoexplain 2026-02-25 15:21 UTC link

Tangential - I'm aware of various types of, let's say, "swappability" that Unicode defines (broader than the Unicode concept of "equivalence"):

- Canonical (NF)

- Compatible (NFK)

- Composed vs decomposed

- Confusable (confusables.txt)

Does Unicode not define something like "fuzzy" equivalence? Like "confusable" but more broad, for search bar logic? The most obvious differences would be case and diacritic insensitivity (e, é). Case is easy since any string/regex API supports case insensitivity, but diacritic insensitivity is not nearly as common, and there are other categories of fuzzy equivalence too (e.g. ø, o).

I guess it makes sense for Unicode to not be interested in defining something like this, since it relates neither to true semantics nor security, but it's an incredibly common pattern, and if they offered some standard, I imagine more APIs would implement it.

csense 2026-02-25 15:30 UTC link

My theory: The "long S" in "Congreſs" is an f. They used f instead of s because without modern dental care, a lot of people in the 1600's and 1700's were miffing teeth and fpoke with a lifp.

joshdata 2026-02-25 15:37 UTC link

> If your application also runs NFKC normalization (which it should — ENS, GitHub, and Unicode IDNA all require it)

That's not right. Most of the web requires NFC normalization, not NFKC. NFC doesn't lose information in the original string. It reorders and combines code points into equivalent code point sequences, e.g. to simplify equality tests.

In NFKC, the K for "Compatibility" means some characters are replaced with similar, simpler code points. I've found NFKC useful for making text search indexes where you want matches to be forgiving, but it would be both obvious and wrong to use it in most of the web because it would dramatically change what the user has entered. See the examples in https://www.unicode.org/reports/tr15/.

orthoxerox 2026-02-25 14:54 UTC link

The correct approach is to accept [a-z][a-z0-9]* as identifiers and forbid everything else.

wongarsu 2026-02-25 15:16 UTC link

What would make sense is to have a blacklist of usernames (like "admin" or "moderator"), then use the confusables map to see if a username or slug is visually confusable with a name from that blacklist.

I initially thought that must surely be what they are doing and they just worded it very, very poorly. But then of the 31 "disagreements" only one matters, the long s that's either f or s. All other disagreements map to visually similar symbols, like O and 0, which you should already treat as the same for this check

Not to mention that this is mostly an issue for URL slugs, so after NFKC normalization. In HTML this is more robustly solved by styling conventions. Even old bb-style forums will display admin and moderator user names in a different color or in bold to show their status. The modern flourish is to put a little icon next to these kinds of names, which also scales well to other identifiers.

zahlman 2026-02-25 15:19 UTC link

I suppose: other users, if you store the first user's text and transmit it to another one.

lich_king 2026-02-25 15:24 UTC link

This is an inexplicable, AI-written article and the obvious answer is no. There's no performance or complexity overhead to not removing a couple of dead characters. There is a complexity overhead to forking off the list or adding pointless special cases to your code.

ElectricalUnion 2026-02-25 15:55 UTC link

For some sorts of "confusables", you don't even need Unicode in some cases. Depending on the cursed combination of font, kerning, rendering and display, `m` and `rn` are also very hard to distinguish.

nkrisc 2026-02-25 15:57 UTC link

https://en.wikipedia.org/wiki/Long_s

That’s not the case.

ZoneZealot 2026-02-25 16:26 UTC link

I think we're expecting too much from an LLM generated article from a user that has been spending a lot of time spamming their content across multiple platforms and websites.

advisedwang 2026-02-25 17:33 UTC link

You should tell ChatGPT your theory, then maybe you'll find someone that thinks it's worthwhile.

bawolff 2026-02-25 19:41 UTC link

I feel like for search, NFKD and then remove all the combining characters would be a better bet than NFKC.

Of course there are also purpose specific algorithms for preparing text for search that would be even better.

bawolff 2026-02-25 19:46 UTC link

I think UCA using a collation tailored for search would be the closest to what you are looking for

paultendo 2026-02-25 19:55 UTC link

Thanks Josh - putting this article out there has pushed me to sharpen a lot of my thinking which hopefully should come across in my more recent work. I've updated the article to scope the NFKC recommendation to identifiers and added a note crediting your correction. Thanks for catching it.

paultendo 2026-02-25 20:01 UTC link

I agree that rejecting valid non-Latin characters in valid contexts is user-hostile, but I should be clearer about scope: this is specifically about machine-readable identifiers (slugs, handles, ENS names) where the character set is intentionally restricted, not display names or user-facing text.

The approach there should be what wongarsu describes below (imo), to style the UI so official accounts are visually distinct (badges, colour, etc.) rather than policing the character set.

namespace-guard is deliberately opinionated for the slug/handle case where you've already decided the output should be ASCII-safe. If your use case is broader than that, confusables detection without rejection is the right call.

chuckadams 2026-02-25 20:36 UTC link

> or be 800 pixels tall because it was built only out of decorative combining characters

Also known as Zalgo. But it seems most renderers nowadays overlay multiple combining marks over each other rather than stack them, which makes it look far less eldritch than it used to.

Editorial Channel

What the content says

+0.40

Article 27 Cultural Participation

Medium Framing Practice

Editorial

+0.40

SETL

+0.14

Article 27 protects right to participation in cultural and scientific life and share in benefits of scientific progress. Content contributes to shared scientific/technical knowledge of Unicode standards by documenting divergences between two independent Unicode specifications. Author participates in Unicode Consortium review process (PRI #540) and submits findings for official consideration. Framing emphasizes collaborative knowledge-building and scientific transparency.

+0.30

Article 19 Freedom of Expression

Medium Framing Practice

Editorial

+0.30

SETL

-0.13

Article 19 protects freedom of opinion and expression, including receipt of information. Content provides detailed technical documentation on Unicode standards, security mechanisms, and normalization algorithms. Author shares knowledge freely and transparently, including acknowledged corrections and updates. Framing emphasizes clarity and accessibility of complex technical information.

+0.25

Article 26 Education

Medium Framing

Editorial

+0.25

SETL

+0.22

Article 26 protects right to education and development of human personality. Content provides detailed technical education on Unicode standards, normalization, and security mechanisms. Framing emphasizes learning and understanding: clear explanations, practical examples, step-by-step walkthroughs, and generator scripts enable knowledge transfer and skill development.

Preamble Preamble

Low

Preamble concerns dignity, equality, and freedom from arbitrary discrimination. Content is technical documentation on Unicode normalization with no explicit engagement with dignity or discrimination frameworks.

Article 1 Freedom, Equality, Brotherhood

Article 1 establishes universal equality and dignity. Content is technical; does not engage with equality or dignity frameworks.

Article 2 Non-Discrimination

Article 2 prohibits discrimination. Content is neutral technical documentation with no discriminatory framing.

Article 3 Life, Liberty, Security

Article 3 concerns right to life, liberty, and security of person. Not engaged by technical content.

Article 4 No Slavery

Article 4 prohibits slavery and servitude. Not engaged.

Article 5 No Torture

Article 5 prohibits torture and cruel treatment. Not engaged.

Article 6 Legal Personhood

Article 6 concerns right to recognition as person before law. Not engaged.

Article 7 Equality Before Law

Article 7 establishes equal protection before law. Not engaged.

Article 8 Right to Remedy

Article 8 concerns right to remedy by competent authority. Not engaged.

Article 9 No Arbitrary Detention

Article 9 prohibits arbitrary arrest/detention. Not engaged.

Article 10 Fair Hearing

Article 10 concerns right to fair and public hearing. Not engaged.

Article 11 Presumption of Innocence

Article 11 addresses criminal justice and presumption of innocence. Not engaged.

Article 12 Privacy

Article 12 protects privacy, family, and correspondence. Not engaged.

Article 13 Freedom of Movement

Article 13 establishes freedom of movement. Not engaged.

Article 14 Asylum

Article 14 concerns right to seek asylum. Not engaged.

Article 15 Nationality

Article 15 addresses right to nationality. Not engaged.

Article 16 Marriage & Family

Article 16 protects right to marry and family. Not engaged.

Article 17 Property

Article 17 concerns property rights. Not engaged.

Article 18 Freedom of Thought

Article 18 protects freedom of thought, conscience, and religion. Not engaged.

Article 20 Assembly & Association

Article 20 addresses right to peaceful assembly and association. Not engaged.

Article 21 Political Participation

Article 21 concerns political participation and voting. Not engaged.

Article 22 Social Security

Article 22 addresses economic, social, and cultural rights. Not engaged.

Article 23 Work & Equal Pay

Article 23 protects right to work and just conditions. Not engaged.

Article 24 Rest & Leisure

Article 24 addresses right to rest and leisure. Not engaged.

Article 25 Standard of Living

Low Practice

Article 25 addresses right to adequate standard of living, including food, clothing, housing, and social services. Not directly engaged by technical content.

Article 28 Social & International Order

Article 28 concerns social and international order for rights realization. Not engaged.

Article 29 Duties to Community

Article 29 addresses community duties and limitations on rights. Not engaged.

Article 30 No Destruction of Rights

Article 30 prohibits destruction of stated rights. Not engaged.

Structural Channel

What the site does

Domain Context Profile

Element	Modifier	Affects	Note
Privacy	—		No privacy-invasive tracking detected; static GitHub Pages site.
Terms of Service	—		Not applicable for technical blog.
Accessibility	+0.05	Article 19 Article 25 Article 26	Theme toggle present (dark/light mode) supports accessibility. No explicit WCAG compliance statement visible. Content is text-heavy without apparent alt text for Unicode characters shown.
Mission	—		Personal technical blog; no organizational mission statement.
Editorial Code	—		No explicit editorial guidelines visible.
Ownership	—		Author 'paultendo' identifiable from domain; ownership clear.
Access Model	+0.08	Article 19 Article 27	Free, open-access blog content supports information access. No paywall or registration barrier observed.
Ad/Tracking	—		No advertising or tracking pixels detected in provided content.

+0.35

Article 19 Freedom of Expression

Medium Framing Practice

Structural

+0.35

Context Modifier

+0.13

SETL

-0.13

Blog operates as open-access platform with no paywalls, registration barriers, or content restrictions. Free and unrestricted distribution of technical knowledge supports information access. Dark/light mode accessibility feature enables broader reader access. Updated date stamps and correction notes demonstrate transparency about knowledge updates.

+0.35

Article 27 Cultural Participation

Medium Framing Practice

Structural

+0.35

Context Modifier

+0.08

SETL

+0.14

Free open-access distribution of technical knowledge enables broader participation in scientific understanding of Unicode. Open-source implementation (namespace-guard) provides benefit-sharing. Author credits contributors and acknowledges influence of community feedback.

+0.08

Article 25 Standard of Living

Low Practice

Structural

+0.08

Context Modifier

+0.05

SETL

Free open-access technical resource with accessibility features (dark/light mode) supports equitable access to information that could improve systems design and security practices.

+0.05

Preamble Preamble

Low

Structural

+0.05

Context Modifier

0.00

SETL

Free open-access blog with dark/light mode toggle supports broad information access (Article 19 component). No paywalls or registration barriers.

+0.05

Article 26 Education

Medium Framing

Structural

+0.05

Context Modifier

+0.05

SETL

+0.22

Blog structure with free access supports educational access. Accessibility features (theme toggle) enable broader participation. No evidence of paywalls or registration barriers to educational content.

Article 1 Freedom, Equality, Brotherhood

No structural signals related to equal treatment or dignity.

Article 2 Non-Discrimination

No structural signals related to discrimination.

Article 3 Life, Liberty, Security

No structural signals.

Article 4 No Slavery

No structural signals.

Article 5 No Torture

No structural signals.

Article 6 Legal Personhood

No structural signals.

Article 7 Equality Before Law

No structural signals.

Article 8 Right to Remedy

No structural signals.

Article 9 No Arbitrary Detention

No structural signals.

Article 10 Fair Hearing

No structural signals.

Article 11 Presumption of Innocence

No structural signals.

Article 12 Privacy

No structural signals.

Article 13 Freedom of Movement

No structural signals.

Article 14 Asylum

No structural signals.

Article 15 Nationality

No structural signals.

Article 16 Marriage & Family

No structural signals.

Article 17 Property

No structural signals.

Article 18 Freedom of Thought

No structural signals.

Article 20 Assembly & Association

No structural signals.

Article 21 Political Participation

No structural signals.

Article 22 Social Security

No structural signals.

Article 23 Work & Equal Pay

No structural signals.

Article 24 Rest & Leisure

No structural signals.

Article 28 Social & International Order

No structural signals.

Article 29 Duties to Community

No structural signals.

Article 30 No Destruction of Rights

No structural signals.

Supplementary Signals

Epistemic Quality

0.82 medium claims

Sources		0.8
Evidence		0.8
Uncertainty		0.8
Purpose		0.9

Propaganda Flags

0 techniques detected

Solution Orientation

0.81 solution oriented

Reader Agency

0.8

Emotional Tone

measured

Valence		+0.1
Arousal		0.3
Dominance		0.4

Stakeholder Voice

0.55 3 perspectives

Speaks: individualsinstitution

About: institutioncommunity

Temporal Framing

present short term

Geographic Scope

global

Complexity

expert high jargon domain specific

Transparency

0.75

✓ Author

Event Timeline 20 events

2026-02-26 06:04	dlq	Dead-lettered after 1 attempts: Confusables.txt and NFKC disagree on 31 characters	- -
2026-02-26 06:04	dlq	Dead-lettered after 1 attempts: Confusables.txt and NFKC disagree on 31 characters	- -
2026-02-26 06:04	dlq	Dead-lettered after 1 attempts: Confusables.txt and NFKC disagree on 31 characters	- -
2026-02-26 06:04	dlq	Dead-lettered after 1 attempts: Confusables.txt and NFKC disagree on 31 characters	- -
2026-02-26 06:03	dlq	Dead-lettered after 1 attempts: Confusables.txt and NFKC disagree on 31 characters	- -
2026-02-26 06:03	dlq	Dead-lettered after 1 attempts: Confusables.txt and NFKC disagree on 31 characters	- -
2026-02-26 06:03	dlq	Dead-lettered after 1 attempts: Confusables.txt and NFKC disagree on 31 characters	- -
2026-02-26 06:02	dlq	Dead-lettered after 1 attempts: Confusables.txt and NFKC disagree on 31 characters	- -
2026-02-26 06:00	dlq	Dead-lettered after 1 attempts: Confusables.txt and NFKC disagree on 31 characters	- -
2026-02-26 05:59	dlq	Dead-lettered after 1 attempts: Confusables.txt and NFKC disagree on 31 characters	- -
2026-02-26 05:59	dlq	Dead-lettered after 1 attempts: Confusables.txt and NFKC disagree on 31 characters	- -
2026-02-26 05:58	dlq	Dead-lettered after 1 attempts: Confusables.txt and NFKC disagree on 31 characters	- -
2026-02-26 05:57	dlq	Dead-lettered after 1 attempts: Confusables.txt and NFKC disagree on 31 characters	- -
2026-02-26 05:57	dlq	Dead-lettered after 1 attempts: Confusables.txt and NFKC disagree on 31 characters	- -
2026-02-26 05:56	dlq	Dead-lettered after 1 attempts: Confusables.txt and NFKC disagree on 31 characters	- -
2026-02-26 05:55	dlq	Dead-lettered after 1 attempts: Confusables.txt and NFKC disagree on 31 characters	- -
2026-02-26 05:55	dlq	Dead-lettered after 1 attempts: Confusables.txt and NFKC disagree on 31 characters	- -
2026-02-26 05:55	credit_exhausted	Credit balance too low, retrying in 338s	- -
2026-02-26 05:55	dlq	Dead-lettered after 1 attempts: Confusables.txt and NFKC disagree on 31 characters	- -
2026-02-26 05:52	credit_exhausted	Credit balance too low, retrying in 311s	- -

build 59cf82e+tpso · deployed 2026-02-26 02:38 UTC · evaluated 2026-02-26 04:51:33 UTC