Radically Accelerate Frontier Model Inference

Run the most advanced models up to 25x faster and at 1/10th the cost.

Dots pattern

At Fractile, we are revolutionising compute to build the engine that can power the next generation of AI.

Dots pattern

Maximum throughput per Watt (and per dollar)

>10x throughput efficiency (tokens/s/MW)*

The world will soon be bottlenecked on inference throughput. Available data centre power cannot scale as fast as our aspirations. Fractile’s chips are optimised for performance-per-Watt, meaning every kWh of energy generates 10x the tokens of a frontier GPU system, transforming the economics of AI inference.

* Projected performance per megawatt on reference models versus GB200 NVL72.

Fractile

Nvidia NVL72

World-leading inference speed without sacrifices

25x faster*

Every other hardware platform serving AI models trades off throughput and latency. Fractile doesn’t. A Fractile system can serve hundreds or thousands of concurrent user queries while providing generation latencies (TTFT, TBOT) up to two orders of magnitude faster. Across frontier models from 7B to 7T+ parameters, and from dense models to highly sparse MoEs.

* Projected performance on reference models versus GB200 NVL72.

Fractile

Nvidia NVL72

The advantages

In-Memory Compute Technology

Turning GPUs inside-out: the world’s fastest memory system

Today’s frontier AI models have been created through the relentless pursuit of ‘scaling laws’. However, the demand for inference has scaled massively, outstripping hardware capacity to serve it. Today’s AI chips face a massive mismatch, which hamstrings their performance and efficiency.

Over the past 25 years peak compute (measured in FLOPS) has scaled 1,000,000x while bandwidth to DRAM (including HBM) has increased just 40x — meaning memory bandwidth has become the key bottleneck for GPUs and accelerators. The mitigation has been constructing workloads which allow us to re-use memory access across hundreds or thousands of computations. AI training looks like this — every time we load a model’s weights from DRAM, we run thousands of tokens through them, balancing compute and memory access. Inference presents a very different challenge — suddenly, we care as much about latency (the time to return an answer to a single user, for instance) as throughput (the aggregate tokens crunched by our processors).

What was a growing computer architecture issue has now become a crisis.

Fractile is building the first of a new generation of processors, where memory and compute are physically interleaved and innately balanced, driving faster, more efficient compute, without trade offs.

Frontier AI inference: up and to the right

The number of tokens we are processing with frontier AI models is growing by more than 10x every year. Compounding effects of increasingly widespread applications, the advent of reasoning models that generate many more tokens in pursuit of the best quality answers, and AI agents that autonomously crunch through thousands of tokens while carrying out tasks means that this exponential is set to continue.

Frontier model inference has two critical requirements that existing hardware cannot satisfy simultaneously: low latency and high throughput. To meet even modest latency objectives (like generating enough tokens per second per user to provide search results in a couple of seconds), today’s hardware drops to below 1% of its peak efficiency. Again, the compute-memory mismatch is largely responsible for this.

Fractile’s hardware delivers both, simultaneously — serving thousands of tokens per second to thousands of concurrent users, for many tens of millions of tokens per second per rack of compute (on a frontier MoE reasoning model).

Scalable, composable, compatible

We’re proud to be building a new paradigm for AI hardware that will transform the future of AI scaling. However, ‘better’ AI hardware is not better if it cannot adapt and grow as new models and architectures are developed and introduced. Algorithmic advances in AI are — for now — outstripping the pace of hardware improvement, delivering more than 10x performance-per-dollar improvements every year.

In order for Fractile’s advantages to be truly multiplicative with those of our customers, we have invested heavily in developing a hardware architecture and a software stack that lets us compose hardware, from a single chip up to data-centre scale orchestration, and to provide a turnkey drop-in for existing model inference stacks on competitor hardware.

Team & Jobs

Join us and build the future of AI

Fractile’s hardware performance is only possible because of the full-stack approach we take to building the next class of processors for AI acceleration. Our team spans transistor-level circuit design up to cloud inference server logic, and everything in between.

Fractile is home to some of the world’s most talented, driven and energetic technologists and thinkers, who are inspired to take on some of the world’s most impactful technical challenges in a deeply collaborative environment.

If you are interested in being a part of the Fractile mission, then we would love to hear from you.