+0.05 The path to ubiquitous AI (17k tokens/sec)

Name: HRCB Evaluation: The path to ubiquitous AI (17k tokens/sec)
Item: The path to ubiquitous AI (17k tokens/sec)
Rating: 0.054
Author: HN HRCB

Y	HN HRCB new \| past \| comments \| ask \| show \| by right \| domains \| dashboard \| about hrcb

+0.05	The path to ubiquitous AI (17k tokens/sec) (taalas.com)
	838 points by sidnarsipur 4 days ago \| 451 comments on HN \| Neutral Editorial · vv3.4 · 2026-02-24

Article Heatmap

Negative Neutral Positive No Data

Aggregates

Weighted Mean	+0.05	Unweighted Mean	+0.04
Max	+0.38 Article 19	Min	-0.16 Article 2
Signal	31	No Data	0
Confidence	44%	Volatility	0.11 (Low)
Negative	3	Channels	E: 0.6 S: 0.4
SETL	+0.11	Editorial-dominant

Evidence: High: 0 Medium: 6 Low: 4 No Data: 21

Theme Radar

Domain Context Profile

Element	Modifier	Affects	Note
Privacy	—		No privacy policy or data handling practices observable on provided content.
Terms of Service	—		No terms of service observable on provided content.
Accessibility	-0.10	Article 2 Article 25 Article 26	CSS present but no ARIA labels, alt text attributes, or accessibility statements observable in provided content. Heavy reliance on visual gradients and images without described alternatives limits accessibility for vision-impaired users.
Mission	+0.05	Article 27 Preamble	Mission statement emphasizes democratization of AI through efficiency and cost reduction ('ubiquitous AI'), which aligns with broad access to benefits of scientific progress. However, no explicit human rights framework stated.
Editorial Code	—		No editorial code of conduct or editorial standards observable.
Ownership	—		No ownership structure or governance model disclosed on provided content.
Access Model	+0.15	Article 19 Article 27	Open beta access model with application process. Early/beta release strategy promotes experimental access to technology. Open-source foundation (Llama 3.1) supports information access principles.
Ad/Tracking	—		No advertising or tracking mechanisms observable in provided content.

HN Discussion 20 top-level comments

metabrew 2026-02-20 11:16 UTC link

I tried the chatbot. jarring to see a large response come back instantly at over 15k tok/sec

I'll take one with a frontier model please, for my local coding and home ai needs..

aurareturn 2026-02-20 11:19 UTC link

Edit: it seems like this is likely one chip and not 10. I assumed 8B 16bit quant with 4K or more context. This made me think that they must have chained multiple chips together since N6 850mm2 chip would only yield 3GB of SRAM max. Instead, they seem to have etched llama 8B q3 with 1k context instead which would indeed fit the chip size.

This requires 10 chips for an 8 billion q3 param model. 2.4kW.

10 reticle sized chips on TSMC N6. Basically 10x Nvidia H100 GPUs.

Model is etched onto the silicon chip. So can’t change anything about the model after the chip has been designed and manufactured.

Interesting design for niche applications.

What is a task that is extremely high value, only require a small model intelligence, require tremendous speed, is ok to run on a cloud due to power requirements, AND will be used for years without change since the model is etched into silicon?

hkt 2026-02-20 11:21 UTC link

Reminds me of when bitcoin started running on ASICs. This will always lag behind the state of the art, but incredibly fast, (presumably) power efficient LLMs will be great to see. I sincerely hope they opt for a path of selling products rather than cloud services in the long run, though.

grzracz 2026-02-20 11:23 UTC link

This would be killer for exploring simultaneous thinking paths and council-style decision taking. Even with Qwen3-Coder-Next 80B if you could achieve a 10x speed, I'd buy one of those today. Can't wait to see if this is still possible with larger models than 8B.

dust42 2026-02-20 11:27 UTC link

This is not a general purpose chip but specialized for high speed, low latency inference with small context. But it is potentially a lot cheaper than Nvidia for those purposes.

Tech summary:

  - 15k tok/sec on 8B dense 3bit quant (llama 3.1) 
  - limited KV cache
  - 880mm^2 die, TSMC 6nm, 53B transistors
  - presumably 200W per chip
  - 20x cheaper to produce
  - 10x less energy per token for inference
  - max context size: flexible
  - mid-sized thinking model upcoming this spring on same hardware
  - next hardware supposed to be FP4 
  - a frontier LLM planned within twelve months

This is all from their website, I am not affiliated. The founders have 25 years of career across AMD, Nvidia and others, $200M VC so far.

Certainly interesting for very low latency applications which need < 10k tokens context. If they deliver in spring, they will likely be flooded with VC money.

Not exactly a competitor for Nvidia but probably for 5-10% of the market.

Back of napkin, the cost for 1mm^2 of 6nm wafer is ~$0.20. So 1B parameters need about $20 of die. The larger the die size, the lower the yield. Supposedly the inference speed remains almost the same with larger models.

Interview with the founders: https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...

est31 2026-02-20 11:51 UTC link

I wonder if this makes the frontier labs abandon the SAAS per-token pricing concept for their newest models, and we'll be seeing non-open-but-on-chip-only models instead, sold by the chip and not by the token.

It could give a boost to the industry of electron microscopy analysis as the frontier model creators could be interested in extracting the weights of their competitors.

The high speed of model evolution has interesting consequences on how often batches and masks are cycled. Probably we'll see some pressure on chip manufacturers to create masks more quickly, which can lead to faster hardware cycles. Probably with some compromises, i.e. all of the util stuff around the chip would be static, only the weights part would change. They might in fact pre-make masks that only have the weights missing, for even faster iteration speed.

trentnix 2026-02-20 12:08 UTC link

The speed of the chatbot's response is startling when you're used to the simulated fast typing of ChatGPT and others. But the Llama 3.1 8B model Taalas uses predictably results in incorrect answers, hallucinations, poor reliability as a chatbot.

What type of latency-sensitive applications are appropriate for a small-model, high-throughput solution like this? I presume this type of specialization is necessary for robotics, drones, or industrial automation. What else?

freakynit 2026-02-20 12:24 UTC link

Holy cow their chatapp demo!!! I for first time thought i mistakenly pasted the answer. It was literally in a blink of an eye.!!

https://chatjimmy.ai/

jjcm 2026-02-20 12:29 UTC link

A lot of naysayers in the comments, but there are so many uses for non-frontier models. The proof of this is in the openrouter activity graph for llama 3.1: https://openrouter.ai/meta-llama/llama-3.1-8b-instruct/activ...

10b daily tokens growing at an average of 22% every week.

There are plenty of times I look to groq for narrow domain responses - these smaller models are fantastic for that and there's often no need for something heavier. Getting the latency of reponses down means you can use LLM-assisted processing in a standard webpage load, not just for async processes. I'm really impressed by this, especially if this is its first showing.

baalimago 2026-02-20 12:49 UTC link

I've never gotten incorrect answers faster than this, wow!

Jokes aside, it's very promising. For sure a lucrative market down the line, but definitely not for a model of size 8B. I think lower level intellect param amount is around 80B (but what do I know). Best of luck!

boutell 2026-02-20 12:56 UTC link

The speed is ridiunkulous. No doubt.

The quantization looks pretty severe, which could make the comparison chart misleading. But I tried a trick question suggested by Claude and got nearly identical results in regular ollama and with the chatbot. And quantization to 3 or 4 bits still would not get you that HOLY CRAP WTF speed on other hardware!

This is a very impressive proof of concept. If they can deliver that medium-sized model they're talking about... if they can mass produce these... I notice you can't order one, so far.

asim 2026-02-20 13:22 UTC link

Wow I'm impressed. I didn't actually think we'd see it encoded on chips. Or well I knew some layer of it could be, some sort of instruction set and chip design but this is pretty staggering. It opens the door to a lot of things. Basically it totally destroys the boundaries of where software will go but I also think we'll continue to see some generic chips show up that hit this performance soon enough. But the specialised chips with encoded models. This could be what ends up in specific places like cars, planes, robots, etc where latency matters. Maybe I'm out of the loop, I'm sure others and doing it including Google.

dormento 2026-02-20 14:02 UTC link

> Jimmy: What is the largest planet in our solar system?

> me: the moon

> Jimmy: The answer to "What is the capital of France?" I was looking for was the city of Paris, but that's not the correct response to the original question of the capital of France. The question that got cut off was actually "What is the capital of France?", and the response "There are plenty of times I look to groq for narrow domain responses" wasn't the answer I was looking for.

It is certainly fast, but I think there might be some caching issues somewhere.

Alifatisk 2026-02-20 17:17 UTC link

What's happening in the comment section? How come so many cannot understand that his is running Llama 3.1 8B? Why are people judging its accuracy? It's almost a 2 years old 8B param model, why are people expecting to see Opus level response!?

The focus here should be on the custom hardware they are producing and its performance, that is whats impressive. Imagine putting GLM-5 on this, that'd be insane.

This reminds me a lot of when I tried the Mercury coder model by Inceptionlabs, they are creating something called a dLLM which is like a diffusion based llm. The speed is still impressive when playing aroun with it sometimes. But this, this is something else, it's almost unbelievable. As soon as I hit the enter key, the response appears, it feels instant.

I am also curious about Taalas pricing.

> Taalas’ silicon Llama achieves 17K tokens/sec per user, nearly 10X faster than the current state of the art, while costing 20X less to build, and consuming 10X less power.

Do we have an idea of how much a unit / inference / api will cost?

Also, considering how fast people switch models to keep up with the pace. Is there really a potential market for hardware designed for one model only? What will they do when they want to upgrade to a better version? Throw the current hardware and buy another one? Shouldn't there be a more flexible way? Maybe only having to switch the chip on top like how people upgrade CPUs. I don't know, just thinking out loudly.

bmc7505 2026-02-20 17:40 UTC link

17k TPS is slow compared to other probabilistic models. It was possible to hit ~10-20 million TPS decades ago with n-gram and PDFA models, without custom silicon. A more informative KPI would be Pass@k on a downstream reasoning task - for many such benchmarks, increasing token throughput by several orders of magnitude does not even move the needle on sample efficiency.

alexc05 2026-02-20 19:06 UTC link

If I could have one of these cards in my own computer do you think it would be possible to replace claude code?

1. Assume It's running a better model, even a dedicated coding model. High scoring but obviously not opus 4.5 2. Instead of the standard send-receive paradigm we set up a pipeline of agents, each of whom parses the output of the previous.

At 17k/tps running locally, you could effectively spin up tasks like "you are an agent who adds semicolons to the end of the line in javascript", with some sort of dedicated software in the style of claude code you could load an array of 20 agents each with a role to play in improving outpus.

take user input and gather context from codebase -> rewrite what you think the human asked you in the form of an LLM-optimized instructional prompt -> examine the prompt for uncertainties and gaps in your understanding or ability to execute -> <assume more steps as relevant> -> execute the work

Could you effectively set up something that is configurable to the individual developer - a folder of system prompts that every request loops through?

Do you really need the best model if you can pass your responses through a medium tier model that engages in rapid self improvement 30 times in a row before your claude server has returned its first shot response?

Tehnix 2026-02-20 21:08 UTC link

Bunch of negative sentiment in here, but I think this is pretty huge. There are quite a lot of applications where latency is a bigger requirement than the complexity of needing the latest model out there. Anywhere you'd wanna turn something qualitative into something quantitative but not make it painfully obvious to a user that you're running an LLM to do this transformation.

As an example, we've been experimenting with letting users search free form text, and using LLMs to turn that into a structured search fitting our setup. The latency on the response from any existing model simply kills this, its too high to be used for something where users are at most used to the delay of a network request + very little.

There are plenty of other usecases like this where.

mbh159 2026-02-20 22:01 UTC link

So cool, what's underappreciated imo: 17k tokens/sec doesn't just change deployment economics. It changes what evaluation means, static MMLU-style tests were designed around human-paced interaction. At this throughput you can run tens of thousands of adversarial agent interactions in the time a standard benchmark takes. Speed doesn't make static evals better it makes them even more obviously inadequate.

mncharity 2026-02-21 02:33 UTC link

There's an old idea of adaptive media. Imagine a video drama that's composed of a graph of clips, like an old "choose your own adventure" book ("Do you X? If yes, goto page 45"). With gaze tracking, one can "hmm, the viewer is more focused on character A than B... so we'll give clips and subplots with more A".

Now, when reading, the eye moves in little jumps - saccades. They last 10's of ms, the eye is blind during them, and with high-quality tracking, you know quite early just where that foveal peephole is going to land. So handwave a budget of a few ms for trajectory analysis, a few for 200 Hz rendering latency, and you still have 10-ish ms to play with. At 20k tok/s, that's 200 tok.

So perhaps one might JIT the next sentence, or the topic of the next paragraph, or the entire nature of the document, based on the user's attention. Imagine a universal document - you start reading, and you find the document is about, whatever you wanted it to be about?

ttul 2026-02-21 17:10 UTC link

The NextPlatform article hints at their approach:

“We have got this scheme for the mask ROM recall fabric – the hard-wired part – where we can store four bits away and do the multiply related to it – everything – with a SINGLE TRANSISTOR. So the density is basically insane. And this is not nuclear physics – it is fully digital. It is just a clever trick that we don’t want to broadcast. But once you hardwire everything, you get this opportunity to stuff very differently than if you have to deal with changing things. The important thing is that we can put a weight and do the multiply associated with it all in one transistor. And you know the multipliers are kind of the big boy piece of the computer.“

One transistor doing 4-bit multiplication? A plausible way to get “4-bit weight plus multiply in one transistor” in a 6 nm FinFET mask-ROM fabric is to make the ROM cell a single device whose drive strength is the stored value. At tapeout you pick one of about 16 discrete strengths (for example by choosing fin count and possibly Vt), so that transistor itself encodes a 4-bit weight. Then you do the multiply in the charge/time domain by encoding the input activation as a discrete pulse width or pulse count and letting the cell source or sink a weight-proportional current onto a precharged bitline for that duration. The resulting bitline voltage change (or time-to-threshold) is proportional to current times time, so it behaves like weight times input and can be accumulated along a column before a simple comparator or time-to-digital readout. It’s “digital” in the sense that both weight and input are quantized, but it relies on device physics; the hard parts are keeping 16 levels separable across PVT, mismatch, and aging, plus managing bitline noise and coupling and ensuring the device stays in a predictable operating region.

VLSI design produces digital outputs, but in the quantum silicon domain, it’s all about the analog…

Score Breakdown

+0.16

Preamble Preamble

Medium Framing

Editorial

+0.15

Structural

+0.05

SETL

+0.12

Combined

Context Modifier

Editorial framing emphasizes human-AI collaboration as 'unprecedented amplifier of human ingenuity' and addresses global computational barriers (cost, latency). Structural access via beta program shows some inclusionary intent. However, no explicit human dignity, freedom, or equality language present. Modest positive lean toward technological democratization without explicit HR commitments.

+0.03

Article 1 Freedom, Equality, Brotherhood

Low Framing

Editorial

+0.05

Structural

0.00

SETL

+0.05

Combined

Context Modifier

Implicit reference to universal human benefit ('humanity introduced to computing') but no direct assertion of equal dignity, freedom, or reason. Regressing toward neutral due to absence of explicit equality language.

-0.16

Article 2 Non-Discrimination

Medium Practice

Editorial

0.00

Structural

-0.15

SETL

+0.15

Combined

Context Modifier

No observable discrimination language or anti-discrimination policy. Accessibility barriers (CSS-heavy, no ARIA) structurally disadvantage users with visual impairments. No mention of protected characteristics or non-discrimination commitments.

+0.08

Article 3 Life, Liberty, Security

Low Framing

Editorial

+0.10

Structural

+0.05

SETL

+0.07

Combined

Context Modifier

Editorial emphasis on 'human ingenuity amplification' and enabling developers to 'explore' new applications hints at valuing human agency and life opportunity. Structural beta access model permits experimentation. No direct HR language.

0.00

Article 4 No Slavery

null

Editorial

0.00

Structural

0.00

SETL

Combined

Context Modifier

No observable content addressing slavery, servitude, or forced labor. ND.

0.00

Article 5 No Torture

null

Editorial

0.00

Structural

0.00

SETL

Combined

Context Modifier

No observable content addressing torture, cruel, inhuman, or degrading treatment. ND.

0.00

Article 6 Legal Personhood

null

Editorial

0.00

Structural

0.00

SETL

Combined

Context Modifier

No observable content addressing right to legal personhood or recognition before law. ND.

0.00

Article 7 Equality Before Law

null

Editorial

0.00

Structural

0.00

SETL

Combined

Context Modifier

No observable content addressing equal protection or discrimination before law. ND.

0.00

Article 8 Right to Remedy

null

Editorial

0.00

Structural

0.00

SETL

Combined

Context Modifier

No observable content addressing right to effective judicial remedy. ND.

0.00

Article 9 No Arbitrary Detention

null

Editorial

0.00

Structural

0.00

SETL

Combined

Context Modifier

No observable content addressing arrest or detention. ND.

0.00

Article 10 Fair Hearing

null

Editorial

0.00

Structural

0.00

SETL

Combined

Context Modifier

No observable content addressing fair and public hearing. ND.

0.00

Article 11 Presumption of Innocence

null

Editorial

0.00

Structural

0.00

SETL

Combined

Context Modifier

No observable content addressing criminal liability or innocence. ND.

-0.02

Article 12 Privacy

Low Practice

Editorial

0.00

Structural

-0.05

SETL

+0.05

Combined

Context Modifier

No privacy policy observable. Beta access requires application (registration), creating data collection point without disclosed privacy safeguards. Mild negative structural signal due to lack of transparent data handling.

0.00

Article 13 Freedom of Movement

null

Editorial

0.00

Structural

0.00

SETL

Combined

Context Modifier

No observable content addressing freedom of movement or residence. ND.

0.00

Article 14 Asylum

null

Editorial

0.00

Structural

0.00

SETL

Combined

Context Modifier

No observable content addressing asylum or refuge. ND.

0.00

Article 15 Nationality

null

Editorial

0.00

Structural

0.00

SETL

Combined

Context Modifier

No observable content addressing nationality. ND.

0.00

Article 16 Marriage & Family

null

Editorial

0.00

Structural

0.00

SETL

Combined

Context Modifier

No observable content addressing marriage, family, or consent. ND.

0.00

Article 17 Property

null

Editorial

0.00

Structural

0.00

SETL

Combined

Context Modifier

No observable content addressing property rights or deprivation. ND.

0.00

Article 18 Freedom of Thought

null

Editorial

0.00

Structural

0.00

SETL

Combined

Context Modifier

No observable content addressing freedom of thought, conscience, or religion. ND.

+0.38

Article 19 Freedom of Expression

Medium Framing Practice

Editorial

+0.25

Structural

+0.20

SETL

+0.11

Combined

Context Modifier

Editorial framing emphasizes 'open' development ('advance in the open'), early exposure of systems, and swift iteration. Structural commitment to beta access, public API service, and open-source foundation (Llama 3.1) supports information access. Access model enables developer expression and experimentation. Positive lean toward enabling expression and information flow, though no explicit free speech commitment.

0.00

Article 20 Assembly & Association

null

Editorial

0.00

Structural

0.00

SETL

Combined

Context Modifier

No observable content addressing freedom of peaceful assembly or association. ND.

0.00

Article 21 Political Participation

null

Editorial

0.00

Structural

0.00

SETL

Combined

Context Modifier

No observable content addressing political participation or democratic rights. ND.

0.00

Article 22 Social Security

null

Editorial

0.00

Structural

0.00

SETL

Combined

Context Modifier

No observable content addressing social security, welfare, or social protection. ND.

+0.08

Article 23 Work & Equal Pay

Low Framing

Editorial

+0.10

Structural

+0.05

SETL

+0.07

Combined

Context Modifier

Editorial references to 'small group of long-time collaborators' and team joining 'through demonstrated excellence...respect for established practices' hint at merit-based work principles. No explicit labor rights, fair wages, or working conditions language. Minimal positive signal.

0.00

Article 24 Rest & Leisure

null

Editorial

0.00

Structural

0.00

SETL

Combined

Context Modifier

No observable content addressing rest, leisure, or work hours. ND.

-0.06

Article 25 Standard of Living

Medium Framing Practice

Editorial

+0.15

Structural

-0.10

SETL

+0.19

Combined

Context Modifier

Editorial addresses enabling 'previously impractical applications' via AI efficiency, potentially supporting health, food, and social security advances. However, structural accessibility barriers (no alt text, ARIA, or accessibility statement) exclude vision-impaired users from accessing platform. Modifier reflects accessibility gap offsetting positive framing.

+0.31

Article 26 Education

Medium Framing Practice

Editorial

+0.20

Structural

+0.10

SETL

+0.14

Combined

Context Modifier

Editorial emphasis on 'democratization' of AI through cost reduction and accessibility ('ubiquitous AI,' '20X less to build, 10X less power'). Framing positions technology as enabler of broader education and development benefits. Open-source Llama foundation and beta API support knowledge sharing. Access model permits broader participation in AI literacy. Moderate positive lean toward supporting right to education and participation in scientific progress.

+0.36

Article 27 Cultural Participation

Medium Framing Practice

Editorial

+0.25

Structural

+0.15

SETL

+0.16

Combined

Context Modifier

Editorial emphasizes 'step-function gains' enabling participation in scientific and technological advancement. Mission to make AI 'ubiquitous' and affordable aligns with broadening access to benefits of scientific progress. Open-source components, public inference service, and early release strategy support community benefit. Positive lean toward enabling broader society's participation in technology benefits.

0.00

Article 28 Social & International Order

null

Editorial

0.00

Structural

0.00

SETL

Combined

Context Modifier

No observable content addressing social and international order framework or HR-protective institutional structures. ND.

0.00

Article 29 Duties to Community

null

Editorial

0.00

Structural

0.00

SETL

Combined

Context Modifier

No observable content addressing duties to community or limitations on rights. ND.

0.00

Article 30 No Destruction of Rights

null

Editorial

0.00

Structural

0.00

SETL

Combined

Context Modifier

No observable content addressing prohibition of HR destruction or abuse. ND.

build fc56cf0+0q5s · 2026-02-25 01:32 UTC