The startup Taalas is launching a demonstration chip that outperforms all existing AI accelerators by a massive margin in speed. This is possible thanks to a new and more affordable approach, where the AI model is hardcoded directly into the hardware.
Canadian startup Taalas is drawing attention with the launch of the Taalas HC1 demonstration chip. This is an AI accelerator for inference that performs in a completely different category than the fastest AI chips on the market today.
Speed record
The HC1 processes prompts at just under 17,000 tokens per second (16,960). The Nvidia B200 manages 594 tokens per second. The most high-performance AI chip currently available, from Cerebras, is still more than eight times slower at 1,981 tokens per second.

Note: these figures come from Taalas itself. They roughly align with what the competition indicates, although there are scenarios and models where the Nvidia B200, for example, performs faster. However, the orders of magnitude remain correct. Furthermore, the Taalas HC1 is significantly cheaper. The chip, featuring 53 billion transistors, is manufactured on the slightly older TSMC 6 nm process and is not packed with expensive HBM memory like other accelerators.
LLM as hardware vs. software
This is possible because Taalas has chosen a radically different approach. You cannot load your own LLMs onto the HC1 accelerator. Instead, Taalas engineers have hardcoded an LLM into the chip’s hardware. In this demonstration model, it is the slightly older Llama 3.1 8B model.
Taalas notes that AI models today are software and, as such, require a vast amount of computing power. This observation isn’t unfounded: increasingly powerful AI data centers from major AI companies are straining the available power supply and have placed extreme pressure on the RAM and SSD supply chain.
Furthermore, investments for such data centers are reaching tens of billions and more, putting pressure on the business model for AI consumption. A realistic path to short-term returns on investment remains elusive.
The Canadians argue that things must change. To truly change the world with AI, efficiency must increase by a factor of 1,000. According to Taalas, this cannot be achieved by running LLMs as software on general systems. The model should not be simulated on a traditional computer but should become the computer itself.
No HBM required
The Taalas HC1 is a demonstration chip of that idea. Llama 3.1 8B is not a very large model, which made it the preferred choice for the first iteration of the concept. The model is hardcoded into the accelerator at the hardware level, eliminating the bridge between chip and memory as a bottleneck.

As a result, the Taalas HC1 does not need HBM memory to load the LLM, nor does it require 3D stacking or special I/O. Efficiency increases by a factor of 10 and the chip’s heat generation decreases, meaning liquid cooling is no longer a necessity.
Taalas spent 2.5 years working on the chip with a team of 24 people and a budget of barely 30 million dollars. The company claims it can convert a new model into hardware on a custom chip in two months. To this end, it developed the Taalas Foundry. The production cost of the chips is said to be 20 times lower than that of traditional AI accelerators.
Answers in a heartbeat
In addition to lower costs, both operational and in production, speed is a massive advantage of the Taalas HC1. Taalas has launched an online chatbot (chatjimmy) where you can experience this for yourself. Even longer responses from the chatbot appear instantly.
For example, we asked it to explain the advantage of a hardware-optimized AI chip in about 300 words. A response of 256 words appeared as soon as we pressed Enter. Taalas displays a speed statistic: our prompt was processed at 15,735 tok/s, generating the answer in 0.031 seconds.
This eliminates the classic experience of chatting with an LLM, where text is generated before your eyes. You typically see longer answers being drafted in real-time. With complex prompts, this can quickly take several seconds or more.
AI that talks
For a written conversation, such a delay isn’t a disaster yet, but Taalas is looking further ahead. Many AI applications require low latency. Think primarily of spoken conversations. If you ask an AI system a question, the experience is much better when there isn’t a five-second pause before the answer.
Just because the model is hardcoded doesn’t mean there is no flexibility left. For instance, you can still adjust parameters such as the context window via LoRAs (low rank adapters). RAG (retrieval augmented generation) also remains possible. The RAG component can run perfectly well as software on top of the LLM in the chip.
Disadvantages
There are, of course, disadvantages. By hardcoding the model, an investment in a chip is inextricably linked to a single model. Given the rapid evolution of the AI field, this is a significant drawback. An LLM can be state-of-the-art today but outdated within six months.
The test chip with Llama 3.1 8B is a speed demon but not a genius. The model, with a relatively modest eight billion parameters, cannot compete with large LLMs such as GPT 5.2.
Tailored to a specific use case
Taalas’s approach therefore requires customers to devise a business case linked to a single model, where that model remains adequate until the investment has been depreciated. This isn’t so strange: large LLMs steal the show with their capabilities, but operational efficiency gains come from targeted applications.
If an LLM is smart enough to support the functionality of a frequently used agent in a well-defined workflow, there isn’t much reason to tinker with it. For a mature and well-thought-out business case, Taalas’s approach offers a way to hardcode inference with maximum efficiency and therefore minimum cost.
The hardware-based AI chip is also potentially very interesting for robotics. The more efficient accelerator with negligible latency opens new doors. Think not only of exotic machines but also, for example, future cars.
Moving toward a larger model
Before that happens, Taalas still has some work to do. The Taalas HC1 is a demonstration product. Later this year, Taalas plans to launch a second demonstration chip, based on the HC1 but with a slightly larger model hardcoded.
Then comes the major work. In a second generation of its platform (HC2), Taalas plans to apply the technology to a frontier model: a modern and large LLM on the level of the GPT 5.2s and Opus 4.6s of the world.
Breakthrough
The numbers speak for themselves: with a small team, limited time, and limited resources, Taalas has successfully demonstrated that there is enormous value in hardware-coded LLMs. Such brain-chips excel in efficiency, both in terms of production and use. This is immediately clear with the Taalas HC1 and the accompanying chatbot demo.
Whether the approach will catch on at this stage remains to be seen. Taalas itself still has work to do to demonstrate that more capable models also function in this way. Conversely, the market must become mature enough to confidently link business cases to a single model.
In any case, the launch of the Taalas HC1 is a significant new breakthrough in the further development and democratization of AI.
