Summary Education & Information Access Acknowledges
Linguabase's 'Words with Spaces' page is a lexicographic educational resource that presents language information and translations freely without subscription barriers. The content advances Article 19 (free expression/information) and Article 26 (education) through open access to multilingual knowledge, supported by responsive design. However, undisclosed Google Analytics tracking without visible user consent creates a structural privacy concern under Article 12.
There are nearly half a million compound phrases that aren’t in any dictionary—simply because they contain spaces. “Boiling water.” “Saturday night.” “Help me.”
I would hope that none of those examples were taking up space in a dictionary.
Collocation dictionaries are lists of collocations. The reason they're absent from single word dictionaries is because there's about 25x more collocations than single words.
While 'this analysis would not have been possible without LLM', I am not sure the LLM analysis was well reviewed after it has been done. From the obscure/familiar word list, some of the n-grams, e.g. "is resource", "seq size", "db xref" surely happen in the wild (we well know), but I would doubt that we can argue they are missing from the dictionary. Knowing the realm, I would argue none of them are words, not even collocations. If "is resource" is, why not, "has resource"?
So while the path is surely interesting, this analysis does miss scrutiny, which you would expect from a high-level LLM analysis.
A compound word isn't just a phrase. The latter is a group of words that indicate a single concept. The former is a new word that has a distinct meaning from the subwords that compose it. "I love you" is an example of a clausal phrase. The meaning is entirely evident from the words that compose it. In contrast, a "hot dog" is not a particularly warm canine, and has its own OED entry [0] as a compound word.
And some of the entries on this list are wrong. "Good night" exists in OED as "goodnight" [1] because there are multiple ways it's used. One is the clausal phrase "I hope you have a good night", which can be modified by changing the adjective, e.g. "great night" or "terrible night". "Goodnight" the bedtime ritual can't be modified the same way, so OED chooses to write it as a compound word without spaces.
We act as if some languages have "compound words" that can encompass entire sentences (subject & object attaching to the verb as prefixes or suffixes) while others don't form compounds, and most are somewhere in between. But these are all statements about lexicographic conventions and say nothing about the languages. In reality all languages are muddles sprawling across a multidimensional continuum, and they abso-frigging-lutely do n't sit neatly in such pigeonholes.
> “Boiling water” isn’t “water that happens to be boiling.” It’s a hazard, a cooking stage, a state of matter
I guess we'll have to disagree then, because "boiling water" is "water that's boiling" to me. It's not a different state of matter to "water", that would be "steam". It being a hazard doesn't mean it's a singular concept, same as "wet floor"
As far as my limited knowledge of linguistics goes, the technical term is actually "collocations."
To me, any discussion of this topic that doesn't mention collocations signals an amateurish approach.
I also disagree with the premise that "this was not possible before LLM." That's nonsense. Linguists created many dictionaries of collocations for different languages, so that work is precisely what they did!
(Before any LLM zealots attack me, yes, it is now possible to have a more exhaustive list of collocations thanks to LLMs. This doesn't contradict my point.)
Surprised that no comment mentioned that there is a standard term (not a word :P) for the set of words that denominates a particular concept: nominal syntagm. Such as "boiling water" and also "that green parrot we saw yesterday over the left branch".
Also the slider examples are abysmal. "I love you", "Go home" and "How are you" are not words by any stretch of imagination. For someone who makes word games, I don't see a particularly deep love of words here.
The author of this article just hasn’t been taught how to use a dictionary. The words aren’t “missing”, they’re just indexed under one of their parts. For example “wait upon” would be located within the entry for “wait”.
> But roughly 15% are plausible: “wooden chair,” “morning coffee.” That’s still 30 billion sensible pairs.
(1) Who counted those? Whence those numbers?
(2) The examples are normal two-word phrases with one word modifying the other, often categorised as an adjective. The examples are counter-examples to the very claim made in that article.
(3) Using Clause to brainstorm s.t. is a weird thing to say...
(4) I would say the use of 'lexicalized' is wrong or at least uncommon. It usually refers to specialised semantics of something that could be interpreted generically, too. Like 'sleeping bag'. Or indeed 'cold feet'. Lexicalisation may involve deleting spaces, like 'hotdog'. And I am pretty sure lexicalised phrasal words are usually intensionally listed in dictionaries. And so 'ice' is not lexicalised 'frozen water', but it is not overtly a phrase but is a separate atomic word.
Off the top of my head, peanut butter, black hole, and amusement park are concepts that can't be easily intuited by just combining the two singular terms, but I also wouldn't consider them as phrases.
I'm currently reading Cormack McCarthy's Suttree (my first of his novels) — just an exceptional polymath capable of painting complicated scenery with words dozenly scattered throughout paragraphs [0].
My favorite adjective he's coördinated is "burntwing", used to describe moths spiraling downwards after passing through candleflames. If I had crafted such a descriptive contraction, my former styling would've been "burnt-wing", had I even been capable of generating such concise imagery [1].
McCarthy's stylings have helped me to reduce hyphenations in my own writings — reducing their usage mainly to contractedwords which might be all-too-confusing without them.
[0] pg104 has ten words that I do not know their definitions, yet through context they work to advance the storyline of character racists (book is set in 1950s).
[1] decades ago, during college burnout, I was searching for the essense of "burntwing" — reduced to writing a professor about "feeling like a burning airplane in tailspin." My trajectory back then was definitely burntwing.
In addition to what others have pointed out, many of these aren't actually missing from traditional dictionaries: they're just inflected differently. So your example lists phrases like "operating systems", "immune systems" and "solar systems" as missing from traditional dictionaries, but at least the online OED and M-W have "operating system", "immune system" and "solar system" in them. It's just that your script is apparently listing the plural as a separate phrase.
On languages other than English: in general, different languages do word division very differently. At least in German and Dutch, many of those phrasal verbs are separable, meaning that they are one word in the infinitive but are multiple words in the present tense. So for example, where in English you would say "I log in to the website", in Dutch it would be "Ik log in op de website". "Log in" is two words in both cases, but in Dutch it's the separated form of the single-word separable verb inloggen ("I must log in now" = "Ik moet nu inloggen"). The verb is indeed separable in that the two words often don't end up next to each other: "I log in quickly" = "Ik log snel in".
Dutch, like German, has lots of compounds. But there are also agglutinative languages, which have even more complex compound words, perhaps comprising a whole sentence in another language. Eg (from Wikipedia) Turkish "evlerinizdenmiş" = "(he/she/it) was (apparently/said to be) from your houses" or Plains Cree "paehtāwāēwesew" = "he is heard by higher powers"; and these aren't corner cases, that's how the language works.
One of the axes this analysis seems to be missing is the subtle spectrum from "multi-word expressions" to "idioms". Traditional lexicographers have long published separate idioms books, such as the Merriam-Webster New World American Idioms Handbook and the Oxford Dictionary of Idioms.
Wiktionary doesn't need to make that distinction between MWEs and Idioms and tends to conflate MWEs and Idioms as there is no separate "Wikidiom". Arguably, that multi-book confusion runs deep on the internet because Urban Dictionary should probably be fully titled the Urban Dictionary of Idioms and Slang.
It's not just page limits but also categorical limits and classic lexicographers would build multiple books/volumes, not just settle on one "dictionary". Classic scholars would often have a "reference shelf" with multiple dictionaries, books of idioms, thesauri, and more. The CD-ROM and then the internet has kind of tunnel visioned that this entire shelf can be merely "one app".
Dictionaries containing spaced compounds were not scalable with print media. The printed OED was encyclopedic in scale. Compound dictionaries are more than feasible now. Arguing whether a collection of commonly used words are expressions or concepts or even single "spaced words" is beside the point. Simply identify these differences and classify them in the compendium.
It appears to me that the author is trying too hard to make a point: "merry-go-round" is a single compound word that several dictionaries contain; "canned goods" is not commonly used[1] (more of a bureaucratic jargon), and people would just say "cans"(US) or "preserves" (UK); "household chores" is simply "chores", as the word is no longer commonly used outside the house context; "coffee break ritual" is not a concept in English-speaking countries so it would make no sense to have it in a dictionary, and so many of the examples are exactly that.
[1] I wonder how many here have ever been told something like "Prithee, husband, bring back a dozen canned goods from the market, for in the meanwhile I shall do my household chores".
The first two I kind of understand what the author means. But "help me" and "severe pain" made me think that I'm just not the right public for this text.
It's quite interesting that "boiling water" in many Slavic languages is actually a separate word (and not derived from "water", but from "boiling"; similar how the author mentions "ice" being used instead of "frozen water").
The very bottom of the slider is there to illustrate where LLM artifacts and Wiktionary noise live — it's not presented as legitimate vocabulary. The slider lets you see the full quality gradient, including where it breaks down.
This is a great comparison. We're arguing about the definition of "word", and attempting to expand it to include edge cases where two words with separate meanings have a different atomic meaning when combined.
We could have a similar debate about whether common suffixes and prefixes should be regarded as individual words.
Much like "planets" don't really exist as a separate natural object, words don't really exist in natural languages. They are artificial concepts, and therefore we will always have edge cases.
I would argue that it is still a useful discussion, as it sheds light on the nature of language (or of celestial bodies), even if the definitions defy the same rigour as mathematical concepts.
Your "to me" is actually problematic, because it legitimizes this nonsensical idea and turns words and their meaning into something purely individualistic, which cannot end well for the current, but even more so for the next generation.
I can confirm that "boiling water" definitively is "water that's boiling" and that two words, which are supposedly one word, definitely are not one word.
Yeah, if "boiling water" is one word, what about boiling sugar? Boiling milk? Boiling volcano? Boiling soup?
Adding two words together creates a new and different concept. The permutations necessary to represent every concept ever formed by combining two or more different words would be endless.
Some of them on the list, like black hole, do make sense. That's a very distinct thing. It's not a hole in the conventional sense and it's not really black. Boiling water, though, is water. And it's boiling.
AIUI, collocations are just "words that often go together". It doesn't signal any unconventional meaning to the construction, that would make it a proper idiom.
Boiling water is mostly same as boiling anything. So I would just have "boiling". No need for "boiling water". I see no reason why boiling water could not just be covered by whatever general boiling entry covers.
Added a note: "'I love you' isn't opaque, but it's tight enough to put on a tile." The familiar end of the spectrum picks up collocations that are transparent but loaded — I'm not claiming they're words in the traditional sense, but they're useful vocabulary for word games, which is where I'm coming from.
Ha — you're probably right that it would have been less controversial. But I kept it precisely because it's arguable. Added a parenthetical acknowledging the HN debate and framing it as on-the-fence by design
My bad. there's a little sidebar about it, but I put it lower after the chart because there wasn't room. You might still not find my logic on the 15% satisfying, but it's there.
> Traditional dictionaries skip almost all such phrases, because they contain spaces.
Yes, because they're phrases, not words. I don't even understand what's surprising about this. Sure, the entire article talks about how dictionaries contain _some_ phrases; but it's clear it's not many of them. Dictionaries are for words, not phrases.
"Peanut butter" would be dealt with by including a reference under the "butter" entry. Something like:
'N, culinary. A paste made of ground up nuts, sometimes with additional oils and other ingredients. E.g. "peanut butter", "almond butter".'
"Amusement park", same. Falls very much under the "place of recreation" definition of "park".
"Black hole" is maybe a bit different, because it's a scientific term - and certainly in a science dictionary would be included as a two-word item - but, for consistency, in a regular dictionary should be handled identically to the above, with a note on the word "hole".
While including noun phrases as singular entities in a word game is entirely appropriate, I don't think the OP has formed a rigorous definition of the concept that they are trying to describe. I agree with the other comment which suggests that they need some instruction / practice using a dictionary.
Oh, geeze. The progressive transparency effect on the words towards the "obscure" end of their spectrum made the later pages impossible for me to read.
I suspect the entire list was produced by an AI entity which had not been prompted to avoid giving offense. I predict a range of (tedious) opinions about whether a prohibition on that particular word is an appropriate inclusion in a system prompt.
That's also not a term I've - thankfully! - ever heard, so I've no idea if it's hallucinated. This is not an invitation, HN, to define or explain it to me.
Wait til you read Blood Meridian. The imagery he created with words, some of them his own creations, is just ... beyond compare. I'm reading The Road now, which comes from the same place. I can only read either in small doses. It's very intense, and the passages deserve to be read carefully.
Another contemporary writer who worked with new words in a very creative way was Gene Wolfe in The Book of the New Sun. Some were inventions using Greek, French, or Latin roots. Others were forgotten terms which he resurrected. Someone compiled a dictionary, Lexicon Urthus, which discusses the origins of certain terms and their placement within the series.
"Monkey wrench" is a word already found in the dictionary, so it wouldn't be a useful example. It already met the bar.
The article is questioning why some words don't meet the bar for inclusion in the dictionary. The word "boiling water" is one such word that it sees as being on the fence. The comments here demonstrate exactly why it is on the fence, but it remains unclear exactly what would be necessary for it to tip towards inclusion.
A nominal syntagm is a somewhat overlapping concept, but deviates slightly from the direct discussion taking place. The more appropriate standard term here is: open compound word. Or, as one might say casually: word.
Funnily enough, "nominal syntagm" is, itself, not in the OED or Wiktionary. But Wiktionary has "syntagme nominal" as the French translation for "noun phrase".
You really have to love the human messiness of language!
English has words with spaces. Boiling water isn't one of them, but in general, if you can't insert another adjective between an adjective-noun pair, it's linguistically a compound word that we happen to write with a space. "Fast food" is a good example. It's not simply an adjective-noun pair, as demonstrated by the fact that you sound like a crazy person if you try to insert literally anything between "fast" and "food" in "I eat too much fast food". The "fast food" can be modified all you like, as in "I eat too much lukewarm fast food", "I eat too much depressing fast food", but you can't treat "fast" as merely an adjective of "food", else "I eat too much fast, filling food" wouldn't strip the sentence of the implication I eat at McDonalds or whatever.
Editorial Channel
What the content says
+0.25
Article 26Education
Medium Framing Practice
Editorial
+0.25
SETL
+0.11
Content directly addresses language education and lexicographic learning, supporting development of knowledge and skills. Multilingual content (translations shown) promotes understanding across language communities.
FW Ratio: 67%
Observable Facts
Content presents language information, translations, and lexicographic analysis.
Translation interface shows words translated across multiple languages, supporting multilingual education.
No subscription or registration required to access educational content.
Accessible design allows users with different technical capabilities to access material.
Inferences
Free access to multilingual lexicographic content supports equal educational opportunity regardless of economic status.
Multilingual presentation promotes cultural and linguistic understanding across communities.
+0.20
Article 19Freedom of Expression
Medium Advocacy Framing
Editorial
+0.20
SETL
+0.10
Content disseminates lexicographic knowledge (language information, word definitions, translations) supporting informed communication and understanding across languages.
FW Ratio: 67%
Observable Facts
Page title indicates content about 'Words with Spaces,' a lexicographic topic.
No paywall, login requirement, or geographic restriction observed.
Responsive design with mobile breakpoints suggests accessibility across device types.
Content appears to disseminate linguistic and lexicographic information freely.
Inferences
Free access and responsive design facilitate information sharing and expression across diverse audiences.
Educational focus on language suggests support for informed communication.
ND
PreamblePreamble
Preamble principles (human dignity, equality, freedom) are not directly addressed in page content about lexicography.
ND
Article 1Freedom, Equality, Brotherhood
Content does not engage with equality and dignity themes.
ND
Article 2Non-Discrimination
No discussion of non-discrimination.
ND
Article 3Life, Liberty, Security
Life, liberty, and security not addressed in lexicographic content.
ND
Article 4No Slavery
Slavery and servitude not mentioned.
ND
Article 5No Torture
Torture and cruel treatment not addressed.
ND
Article 6Legal Personhood
Right to recognition as person not addressed.
ND
Article 7Equality Before Law
Equal protection before law not discussed.
ND
Article 8Right to Remedy
Remedy for violations not addressed.
ND
Article 9No Arbitrary Detention
Arbitrary arrest not addressed.
ND
Article 10Fair Hearing
Fair trial not addressed.
ND
Article 11Presumption of Innocence
Presumption of innocence not addressed.
ND
Article 12Privacy
Medium Practice
Editorial content does not discuss privacy.
FW Ratio: 60%
Observable Facts
Page source contains gtag('config', 'G-H9BT5HPYQP') Google Analytics tracking code.
No visible cookie consent banner or privacy notice in provided page content.
No opt-in or opt-out mechanism for analytics tracking observed on page.
Inferences
Tracking code suggests user behavior data is collected without explicit user-facing disclosure, indicating reduced privacy protection.
Absence of visible consent mechanism suggests analytics may not comply with privacy preference standards.
ND
Article 13Freedom of Movement
Freedom of movement not addressed.
ND
Article 14Asylum
Asylum and refuge not addressed.
ND
Article 15Nationality
Nationality not addressed.
ND
Article 16Marriage & Family
Marriage and family not addressed.
ND
Article 17Property
Property rights not addressed.
ND
Article 18Freedom of Thought
Freedom of thought and conscience not addressed.
ND
Article 20Assembly & Association
Assembly and association not addressed.
ND
Article 21Political Participation
Participation in government not addressed.
ND
Article 22Social Security
Social security not addressed.
ND
Article 23Work & Equal Pay
Work and employment not addressed.
ND
Article 24Rest & Leisure
Rest and leisure not addressed.
ND
Article 25Standard of Living
Low Practice
Health and well-being not directly addressed in lexicographic content.
FW Ratio: 67%
Observable Facts
Page CSS includes mobile-responsive breakpoints at 768px, 700px, 600px.
Semantic HTML structure with clearly labeled sections supports screen reader navigation.
Inferences
Responsive design facilitates access for users with visual or motor accessibility needs.
ND
Article 27Cultural Participation
Cultural participation not directly addressed.
ND
Article 28Social & International Order
Social order not addressed.
ND
Article 29Duties to Community
Duties not addressed.
ND
Article 30No Destruction of Rights
Limitation of rights not addressed.
Structural Channel
What the site does
Domain Context Profile
Element
Modifier
Affects
Note
Privacy
-0.15
Article 12
Google Analytics tracking (G-H9BT5HPYQP) implemented without explicit privacy policy visible in provided content. Data collection disclosed in code but no user consent mechanism observed.
Terms of Service
—
No Terms of Service or user agreement observed in provided content.
Accessibility
+0.10
Article 25 Article 26
Responsive design evident (mobile breakpoints at 768px, 700px, 600px); semantic HTML structure supports screen readers; alt text and ARIA attributes not visible in provided source.
Mission
+0.05
Article 19
Linguabase appears dedicated to lexicographic education and knowledge dissemination about language. No explicit mission statement in provided content, but structure suggests informational rather than commercial intent.
Editorial Code
—
No editorial code or policy observed.
Ownership
—
Ownership/authorship information not visible in provided content.
Access Model
+0.20
Article 26
No paywall, login requirement, or geofencing evident; content appears freely accessible to all users without registration or subscription.
Ad/Tracking
-0.10
Article 12
Google Analytics tracking present; no evidence of behavioral advertising pixels observed, but data collection for analytics purposes confirmed.
+0.20
Article 26Education
Medium Framing Practice
Structural
+0.20
Context Modifier
+0.30
SETL
+0.11
Free access without registration supports equal educational access. Responsive design supports users with different device capabilities.
+0.15
Article 19Freedom of Expression
Medium Advocacy Framing
Structural
+0.15
Context Modifier
+0.05
SETL
+0.10
Free access to information without login, paywall, or geofencing; responsive design; structured knowledge presentation supports accessibility of information.
+0.10
Article 25Standard of Living
Low Practice
Structural
+0.10
Context Modifier
+0.10
SETL
ND
Responsive design and accessible layout support inclusive access to information for users with varying physical accessibility needs.
-0.25
Article 12Privacy
Medium Practice
Structural
-0.25
Context Modifier
-0.25
SETL
ND
Google Analytics tracking (G-H9BT5HPYQP) implemented without explicit user consent mechanism visible on page. Data collection is undisclosed in user-facing interface.
ND
PreamblePreamble
No structural elements directly implement preamble commitments; site is informational resource without policy dimension.
ND
Article 1Freedom, Equality, Brotherhood
No observable structural signal regarding equal rights.
ND
Article 2Non-Discrimination
No discriminatory or anti-discriminatory design observed.
ND
Article 3Life, Liberty, Security
Not applicable to informational resource.
ND
Article 4No Slavery
Not applicable.
ND
Article 5No Torture
Not applicable.
ND
Article 6Legal Personhood
Not applicable.
ND
Article 7Equality Before Law
Not applicable.
ND
Article 8Right to Remedy
Not applicable.
ND
Article 9No Arbitrary Detention
Not applicable.
ND
Article 10Fair Hearing
Not applicable.
ND
Article 11Presumption of Innocence
Not applicable.
ND
Article 13Freedom of Movement
Not applicable.
ND
Article 14Asylum
Not applicable.
ND
Article 15Nationality
Not applicable.
ND
Article 16Marriage & Family
Not applicable.
ND
Article 17Property
Not applicable.
ND
Article 18Freedom of Thought
Not applicable.
ND
Article 20Assembly & Association
Not applicable.
ND
Article 21Political Participation
Not applicable.
ND
Article 22Social Security
Not applicable.
ND
Article 23Work & Equal Pay
Not applicable.
ND
Article 24Rest & Leisure
Not applicable.
ND
Article 27Cultural Participation
Not applicable.
ND
Article 28Social & International Order
Not applicable.
ND
Article 29Duties to Community
Not applicable.
ND
Article 30No Destruction of Rights
Not applicable.
Supplementary Signals
Epistemic Quality
0.45medium claims
Sources
0.5
Evidence
0.4
Uncertainty
0.3
Purpose
0.6
Propaganda Flags
0techniques detected
Solution Orientation
0.64solution oriented
Reader Agency
0.6
Emotional Tone
measured
Valence
+0.3
Arousal
0.2
Dominance
0.4
Stakeholder Voice
0.201 perspective
Speaks: institution
Temporal Framing
presentunspecified
Geographic Scope
global
Wiktionary, Wikipedia
Complexity
moderatemedium jargongeneral
Transparency
0.00
✗ Author
Event Timeline
20 events
2026-02-26 06:41
dlq
Dead-lettered after 1 attempts: Half million 'Words with Spaces' missing from dictionaries
--
2026-02-26 06:33
dlq
Dead-lettered after 1 attempts: Half million 'Words with Spaces' missing from dictionaries
--
2026-02-26 06:33
credit_exhausted
Credit balance too low, retrying in 244s
--
2026-02-26 06:33
dlq
Dead-lettered after 1 attempts: Half million 'Words with Spaces' missing from dictionaries
--
2026-02-26 06:31
dlq
Dead-lettered after 1 attempts: Half million 'Words with Spaces' missing from dictionaries
--
2026-02-26 06:31
dlq
Dead-lettered after 1 attempts: Half million 'Words with Spaces' missing from dictionaries
--
2026-02-26 06:28
credit_exhausted
Credit balance too low, retrying in 293s
--
2026-02-26 06:27
dlq
Dead-lettered after 1 attempts: Half million 'Words with Spaces' missing from dictionaries
--
2026-02-26 06:25
dlq
Dead-lettered after 1 attempts: Half million 'Words with Spaces' missing from dictionaries
--
2026-02-26 06:22
dlq
Dead-lettered after 1 attempts: Half million 'Words with Spaces' missing from dictionaries
--
2026-02-26 06:21
dlq
Dead-lettered after 1 attempts: Half million 'Words with Spaces' missing from dictionaries
--
2026-02-26 06:20
dlq
Dead-lettered after 1 attempts: Half million 'Words with Spaces' missing from dictionaries
--
2026-02-26 06:19
dlq
Dead-lettered after 1 attempts: Half million 'Words with Spaces' missing from dictionaries
--
2026-02-26 06:19
dlq
Dead-lettered after 1 attempts: Half million 'Words with Spaces' missing from dictionaries
--
2026-02-26 06:19
dlq
Dead-lettered after 1 attempts: Half million 'Words with Spaces' missing from dictionaries
--
2026-02-26 06:18
dlq
Dead-lettered after 1 attempts: Half million 'Words with Spaces' missing from dictionaries
--
2026-02-26 06:18
dlq
Dead-lettered after 1 attempts: Half million 'Words with Spaces' missing from dictionaries
--
2026-02-26 06:17
dlq
Dead-lettered after 1 attempts: Half million 'Words with Spaces' missing from dictionaries
--
2026-02-26 06:16
dlq
Dead-lettered after 1 attempts: Half million 'Words with Spaces' missing from dictionaries
--
2026-02-26 06:16
dlq
Dead-lettered after 1 attempts: Half million 'Words with Spaces' missing from dictionaries