No privacy policy or data handling practices observable on provided content.
Terms of Service
—
No terms of service observable on provided content.
Accessibility
-0.10
Article 2 Article 25 Article 26
CSS present but no ARIA labels, alt text attributes, or accessibility statements observable in provided content. Heavy reliance on visual gradients and images without described alternatives limits accessibility for vision-impaired users.
Mission
+0.05
Article 27 Preamble
Mission statement emphasizes democratization of AI through efficiency and cost reduction ('ubiquitous AI'), which aligns with broad access to benefits of scientific progress. However, no explicit human rights framework stated.
Editorial Code
—
No editorial code of conduct or editorial standards observable.
Ownership
—
No ownership structure or governance model disclosed on provided content.
Access Model
+0.15
Article 19 Article 27
Open beta access model with application process. Early/beta release strategy promotes experimental access to technology. Open-source foundation (Llama 3.1) supports information access principles.
Ad/Tracking
—
No advertising or tracking mechanisms observable in provided content.
Edit: it seems like this is likely one chip and not 10. I assumed 8B 16bit quant with 4K or more context. This made me think that they must have chained multiple chips together since N6 850mm2 chip would only yield 3GB of SRAM max. Instead, they seem to have etched llama 8B q3 with 1k context instead which would indeed fit the chip size.
This requires 10 chips for an 8 billion q3 param model. 2.4kW.
Model is etched onto the silicon chip. So can’t change anything about the model after the chip has been designed and manufactured.
Interesting design for niche applications.
What is a task that is extremely high value, only require a small model intelligence, require tremendous speed, is ok to run on a cloud due to power requirements, AND will be used for years without change since the model is etched into silicon?
Reminds me of when bitcoin started running on ASICs. This will always lag behind the state of the art, but incredibly fast, (presumably) power efficient LLMs will be great to see. I sincerely hope they opt for a path of selling products rather than cloud services in the long run, though.
This would be killer for exploring simultaneous thinking paths and council-style decision taking. Even with Qwen3-Coder-Next 80B if you could achieve a 10x speed, I'd buy one of those today. Can't wait to see if this is still possible with larger models than 8B.
This is not a general purpose chip but specialized for high speed, low latency inference with small context. But it is potentially a lot cheaper than Nvidia for those purposes.
Tech summary:
- 15k tok/sec on 8B dense 3bit quant (llama 3.1)
- limited KV cache
- 880mm^2 die, TSMC 6nm, 53B transistors
- presumably 200W per chip
- 20x cheaper to produce
- 10x less energy per token for inference
- max context size: flexible
- mid-sized thinking model upcoming this spring on same hardware
- next hardware supposed to be FP4
- a frontier LLM planned within twelve months
This is all from their website, I am not affiliated. The founders have 25 years of career across AMD, Nvidia and others, $200M VC so far.
Certainly interesting for very low latency applications which need < 10k tokens context. If they deliver in spring, they will likely be flooded with VC money.
Not exactly a competitor for Nvidia but probably for 5-10% of the market.
Back of napkin, the cost for 1mm^2 of 6nm wafer is ~$0.20. So 1B parameters need about $20 of die. The larger the die size, the lower the yield. Supposedly the inference speed remains almost the same with larger models.
I wonder if this makes the frontier labs abandon the SAAS per-token pricing concept for their newest models, and we'll be seeing non-open-but-on-chip-only models instead, sold by the chip and not by the token.
It could give a boost to the industry of electron microscopy analysis as the frontier model creators could be interested in extracting the weights of their competitors.
The high speed of model evolution has interesting consequences on how often batches and masks are cycled. Probably we'll see some pressure on chip manufacturers to create masks more quickly, which can lead to faster hardware cycles. Probably with some compromises, i.e. all of the util stuff around the chip would be static, only the weights part would change. They might in fact pre-make masks that only have the weights missing, for even faster iteration speed.
The speed of the chatbot's response is startling when you're used to the simulated fast typing of ChatGPT and others. But the Llama 3.1 8B model Taalas uses predictably results in incorrect answers, hallucinations, poor reliability as a chatbot.
What type of latency-sensitive applications are appropriate for a small-model, high-throughput solution like this? I presume this type of specialization is necessary for robotics, drones, or industrial automation. What else?
10b daily tokens growing at an average of 22% every week.
There are plenty of times I look to groq for narrow domain responses - these smaller models are fantastic for that and there's often no need for something heavier. Getting the latency of reponses down means you can use LLM-assisted processing in a standard webpage load, not just for async processes. I'm really impressed by this, especially if this is its first showing.
I've never gotten incorrect answers faster than this, wow!
Jokes aside, it's very promising. For sure a lucrative market down the line, but definitely not for a model of size 8B. I think lower level intellect param amount is around 80B (but what do I know). Best of luck!
The quantization looks pretty severe, which could make the comparison chart misleading. But I tried a trick question suggested by Claude and got nearly identical results in regular ollama and with the chatbot. And quantization to 3 or 4 bits still would not get you that HOLY CRAP WTF speed on other hardware!
This is a very impressive proof of concept. If they can deliver that medium-sized model they're talking about... if they can mass produce these... I notice you can't order one, so far.
Wow I'm impressed. I didn't actually think we'd see it encoded on chips. Or well I knew some layer of it could be, some sort of instruction set and chip design but this is pretty staggering. It opens the door to a lot of things. Basically it totally destroys the boundaries of where software will go but I also think we'll continue to see some generic chips show up that hit this performance soon enough. But the specialised chips with encoded models. This could be what ends up in specific places like cars, planes, robots, etc where latency matters. Maybe I'm out of the loop, I'm sure others and doing it including Google.
> Jimmy: What is the largest planet in our solar system?
> me: the moon
> Jimmy: The answer to "What is the capital of France?" I was looking for was the city of Paris, but that's not the correct response to the original question of the capital of France. The question that got cut off was actually "What is the capital of France?", and the response "There are plenty of times I look to groq for narrow domain responses" wasn't the answer I was looking for.
It is certainly fast, but I think there might be some caching issues somewhere.
What's happening in the comment section? How come so many cannot understand that his is running Llama 3.1 8B? Why are people judging its accuracy? It's almost a 2 years old 8B param model, why are people expecting to see Opus level response!?
The focus here should be on the custom hardware they are producing and its performance, that is whats impressive. Imagine putting GLM-5 on this, that'd be insane.
This reminds me a lot of when I tried the Mercury coder model by Inceptionlabs, they are creating something called a dLLM which is like a diffusion based llm. The speed is still impressive when playing aroun with it sometimes. But this, this is something else, it's almost unbelievable. As soon as I hit the enter key, the response appears, it feels instant.
I am also curious about Taalas pricing.
> Taalas’ silicon Llama achieves 17K tokens/sec per user, nearly 10X faster than the current state of the art, while costing 20X less to build, and consuming 10X less power.
Do we have an idea of how much a unit / inference / api will cost?
Also, considering how fast people switch models to keep up with the pace. Is there really a potential market for hardware designed for one model only? What will they do when they want to upgrade to a better version? Throw the current hardware and buy another one? Shouldn't there be a more flexible way? Maybe only having to switch the chip on top like how people upgrade CPUs. I don't know, just thinking out loudly.
17k TPS is slow compared to other probabilistic models. It was possible to hit ~10-20 million TPS decades ago with n-gram and PDFA models, without custom silicon. A more informative KPI would be Pass@k on a downstream reasoning task - for many such benchmarks, increasing token throughput by several orders of magnitude does not even move the needle on sample efficiency.
If I could have one of these cards in my own computer do you think it would be possible to replace claude code?
1. Assume It's running a better model, even a dedicated coding model. High scoring but obviously not opus 4.5
2. Instead of the standard send-receive paradigm we set up a pipeline of agents, each of whom parses the output of the previous.
At 17k/tps running locally, you could effectively spin up tasks like "you are an agent who adds semicolons to the end of the line in javascript", with some sort of dedicated software in the style of claude code you could load an array of 20 agents each with a role to play in improving outpus.
take user input and gather context from codebase
-> rewrite what you think the human asked you in the form of an LLM-optimized instructional prompt
-> examine the prompt for uncertainties and gaps in your understanding or ability to execute
-> <assume more steps as relevant>
-> execute the work
Could you effectively set up something that is configurable to the individual developer - a folder of system prompts that every request loops through?
Do you really need the best model if you can pass your responses through a medium tier model that engages in rapid self improvement 30 times in a row before your claude server has returned its first shot response?
Bunch of negative sentiment in here, but I think this is pretty huge. There are quite a lot of applications where latency is a bigger requirement than the complexity of needing the latest model out there. Anywhere you'd wanna turn something qualitative into something quantitative but not make it painfully obvious to a user that you're running an LLM to do this transformation.
As an example, we've been experimenting with letting users search free form text, and using LLMs to turn that into a structured search fitting our setup. The latency on the response from any existing model simply kills this, its too high to be used for something where users are at most used to the delay of a network request + very little.
There are plenty of other usecases like this where.
So cool, what's underappreciated imo: 17k tokens/sec doesn't just change deployment economics. It changes what evaluation means, static MMLU-style tests were designed around human-paced interaction. At this throughput you can run tens of thousands of adversarial agent interactions in the time a standard benchmark takes. Speed doesn't make static evals better it makes them even more obviously inadequate.
There's an old idea of adaptive media. Imagine a video drama that's composed of a graph of clips, like an old "choose your own adventure" book ("Do you X? If yes, goto page 45"). With gaze tracking, one can "hmm, the viewer is more focused on character A than B... so we'll give clips and subplots with more A".
Now, when reading, the eye moves in little jumps - saccades. They last 10's of ms, the eye is blind during them, and with high-quality tracking, you know quite early just where that foveal peephole is going to land. So handwave a budget of a few ms for trajectory analysis, a few for 200 Hz rendering latency, and you still have 10-ish ms to play with. At 20k tok/s, that's 200 tok.
So perhaps one might JIT the next sentence, or the topic of the next paragraph, or the entire nature of the document, based on the user's attention. Imagine a universal document - you start reading, and you find the document is about, whatever you wanted it to be about?
“We have got this scheme for the mask ROM recall fabric – the hard-wired part – where we can store four bits away and do the multiply related to it – everything – with a SINGLE TRANSISTOR. So the density is basically insane. And this is not nuclear physics – it is fully digital. It is just a clever trick that we don’t want to broadcast. But once you hardwire everything, you get this opportunity to stuff very differently than if you have to deal with changing things. The important thing is that we can put a weight and do the multiply associated with it all in one transistor. And you know the multipliers are kind of the big boy piece of the computer.“
One transistor doing 4-bit multiplication? A plausible way to get “4-bit weight plus multiply in one transistor” in a 6 nm FinFET mask-ROM fabric is to make the ROM cell a single device whose drive strength is the stored value. At tapeout you pick one of about 16 discrete strengths (for example by choosing fin count and possibly Vt), so that transistor itself encodes a 4-bit weight. Then you do the multiply in the charge/time domain by encoding the input activation as a discrete pulse width or pulse count and letting the cell source or sink a weight-proportional current onto a precharged bitline for that duration. The resulting bitline voltage change (or time-to-threshold) is proportional to current times time, so it behaves like weight times input and can be accumulated along a column before a simple comparator or time-to-digital readout. It’s “digital” in the sense that both weight and input are quantized, but it relies on device physics; the hard parts are keeping 16 levels separable across PVT, mismatch, and aging, plus managing bitline noise and coupling and ensuring the device stays in a predictable operating region.
VLSI design produces digital outputs, but in the quantum silicon domain, it’s all about the analog…
Score Breakdown
+0.16
PreamblePreamble
Medium Framing
Editorial
+0.15
Structural
+0.05
SETL
+0.12
Combined
ND
Context Modifier
ND
Editorial framing emphasizes human-AI collaboration as 'unprecedented amplifier of human ingenuity' and addresses global computational barriers (cost, latency). Structural access via beta program shows some inclusionary intent. However, no explicit human dignity, freedom, or equality language present. Modest positive lean toward technological democratization without explicit HR commitments.
+0.03
Article 1Freedom, Equality, Brotherhood
Low Framing
Editorial
+0.05
Structural
0.00
SETL
+0.05
Combined
ND
Context Modifier
ND
Implicit reference to universal human benefit ('humanity introduced to computing') but no direct assertion of equal dignity, freedom, or reason. Regressing toward neutral due to absence of explicit equality language.
-0.16
Article 2Non-Discrimination
Medium Practice
Editorial
0.00
Structural
-0.15
SETL
+0.15
Combined
ND
Context Modifier
ND
No observable discrimination language or anti-discrimination policy. Accessibility barriers (CSS-heavy, no ARIA) structurally disadvantage users with visual impairments. No mention of protected characteristics or non-discrimination commitments.
+0.08
Article 3Life, Liberty, Security
Low Framing
Editorial
+0.10
Structural
+0.05
SETL
+0.07
Combined
ND
Context Modifier
ND
Editorial emphasis on 'human ingenuity amplification' and enabling developers to 'explore' new applications hints at valuing human agency and life opportunity. Structural beta access model permits experimentation. No direct HR language.
0.00
Article 4No Slavery
null
Editorial
0.00
Structural
0.00
SETL
ND
Combined
ND
Context Modifier
ND
No observable content addressing slavery, servitude, or forced labor. ND.
0.00
Article 5No Torture
null
Editorial
0.00
Structural
0.00
SETL
ND
Combined
ND
Context Modifier
ND
No observable content addressing torture, cruel, inhuman, or degrading treatment. ND.
0.00
Article 6Legal Personhood
null
Editorial
0.00
Structural
0.00
SETL
ND
Combined
ND
Context Modifier
ND
No observable content addressing right to legal personhood or recognition before law. ND.
0.00
Article 7Equality Before Law
null
Editorial
0.00
Structural
0.00
SETL
ND
Combined
ND
Context Modifier
ND
No observable content addressing equal protection or discrimination before law. ND.
0.00
Article 8Right to Remedy
null
Editorial
0.00
Structural
0.00
SETL
ND
Combined
ND
Context Modifier
ND
No observable content addressing right to effective judicial remedy. ND.
0.00
Article 9No Arbitrary Detention
null
Editorial
0.00
Structural
0.00
SETL
ND
Combined
ND
Context Modifier
ND
No observable content addressing arrest or detention. ND.
0.00
Article 10Fair Hearing
null
Editorial
0.00
Structural
0.00
SETL
ND
Combined
ND
Context Modifier
ND
No observable content addressing fair and public hearing. ND.
0.00
Article 11Presumption of Innocence
null
Editorial
0.00
Structural
0.00
SETL
ND
Combined
ND
Context Modifier
ND
No observable content addressing criminal liability or innocence. ND.
-0.02
Article 12Privacy
Low Practice
Editorial
0.00
Structural
-0.05
SETL
+0.05
Combined
ND
Context Modifier
ND
No privacy policy observable. Beta access requires application (registration), creating data collection point without disclosed privacy safeguards. Mild negative structural signal due to lack of transparent data handling.
0.00
Article 13Freedom of Movement
null
Editorial
0.00
Structural
0.00
SETL
ND
Combined
ND
Context Modifier
ND
No observable content addressing freedom of movement or residence. ND.
0.00
Article 14Asylum
null
Editorial
0.00
Structural
0.00
SETL
ND
Combined
ND
Context Modifier
ND
No observable content addressing asylum or refuge. ND.
0.00
Article 15Nationality
null
Editorial
0.00
Structural
0.00
SETL
ND
Combined
ND
Context Modifier
ND
No observable content addressing nationality. ND.
0.00
Article 16Marriage & Family
null
Editorial
0.00
Structural
0.00
SETL
ND
Combined
ND
Context Modifier
ND
No observable content addressing marriage, family, or consent. ND.
0.00
Article 17Property
null
Editorial
0.00
Structural
0.00
SETL
ND
Combined
ND
Context Modifier
ND
No observable content addressing property rights or deprivation. ND.
0.00
Article 18Freedom of Thought
null
Editorial
0.00
Structural
0.00
SETL
ND
Combined
ND
Context Modifier
ND
No observable content addressing freedom of thought, conscience, or religion. ND.
+0.38
Article 19Freedom of Expression
Medium Framing Practice
Editorial
+0.25
Structural
+0.20
SETL
+0.11
Combined
ND
Context Modifier
ND
Editorial framing emphasizes 'open' development ('advance in the open'), early exposure of systems, and swift iteration. Structural commitment to beta access, public API service, and open-source foundation (Llama 3.1) supports information access. Access model enables developer expression and experimentation. Positive lean toward enabling expression and information flow, though no explicit free speech commitment.
0.00
Article 20Assembly & Association
null
Editorial
0.00
Structural
0.00
SETL
ND
Combined
ND
Context Modifier
ND
No observable content addressing freedom of peaceful assembly or association. ND.
0.00
Article 21Political Participation
null
Editorial
0.00
Structural
0.00
SETL
ND
Combined
ND
Context Modifier
ND
No observable content addressing political participation or democratic rights. ND.
0.00
Article 22Social Security
null
Editorial
0.00
Structural
0.00
SETL
ND
Combined
ND
Context Modifier
ND
No observable content addressing social security, welfare, or social protection. ND.
+0.08
Article 23Work & Equal Pay
Low Framing
Editorial
+0.10
Structural
+0.05
SETL
+0.07
Combined
ND
Context Modifier
ND
Editorial references to 'small group of long-time collaborators' and team joining 'through demonstrated excellence...respect for established practices' hint at merit-based work principles. No explicit labor rights, fair wages, or working conditions language. Minimal positive signal.
0.00
Article 24Rest & Leisure
null
Editorial
0.00
Structural
0.00
SETL
ND
Combined
ND
Context Modifier
ND
No observable content addressing rest, leisure, or work hours. ND.
-0.06
Article 25Standard of Living
Medium Framing Practice
Editorial
+0.15
Structural
-0.10
SETL
+0.19
Combined
ND
Context Modifier
ND
Editorial addresses enabling 'previously impractical applications' via AI efficiency, potentially supporting health, food, and social security advances. However, structural accessibility barriers (no alt text, ARIA, or accessibility statement) exclude vision-impaired users from accessing platform. Modifier reflects accessibility gap offsetting positive framing.
+0.31
Article 26Education
Medium Framing Practice
Editorial
+0.20
Structural
+0.10
SETL
+0.14
Combined
ND
Context Modifier
ND
Editorial emphasis on 'democratization' of AI through cost reduction and accessibility ('ubiquitous AI,' '20X less to build, 10X less power'). Framing positions technology as enabler of broader education and development benefits. Open-source Llama foundation and beta API support knowledge sharing. Access model permits broader participation in AI literacy. Moderate positive lean toward supporting right to education and participation in scientific progress.
+0.36
Article 27Cultural Participation
Medium Framing Practice
Editorial
+0.25
Structural
+0.15
SETL
+0.16
Combined
ND
Context Modifier
ND
Editorial emphasizes 'step-function gains' enabling participation in scientific and technological advancement. Mission to make AI 'ubiquitous' and affordable aligns with broadening access to benefits of scientific progress. Open-source components, public inference service, and early release strategy support community benefit. Positive lean toward enabling broader society's participation in technology benefits.
0.00
Article 28Social & International Order
null
Editorial
0.00
Structural
0.00
SETL
ND
Combined
ND
Context Modifier
ND
No observable content addressing social and international order framework or HR-protective institutional structures. ND.
0.00
Article 29Duties to Community
null
Editorial
0.00
Structural
0.00
SETL
ND
Combined
ND
Context Modifier
ND
No observable content addressing duties to community or limitations on rights. ND.
0.00
Article 30No Destruction of Rights
null
Editorial
0.00
Structural
0.00
SETL
ND
Combined
ND
Context Modifier
ND
No observable content addressing prohibition of HR destruction or abuse. ND.