the verticalsv01 / 17#kyutai#open-source§ 08 sections · 06 figures

Kyutai: the twelve-person Paris nonprofit turning open releases into shared vocabulary.

Research velocity converted into reputational capital. A twelve-person Paris nonprofit ships weights every ten to twelve weeks, rewriting the vocabulary the open voice-AI field thinks in.

fullduplex research

published apr 2026· 17 min read· ~3,280 words· verticals v01 / 17

17m

read time

feed to ai↧view .mdClaudeopen in claudeChatGPTopen in chatgpt

verticals · v01 of 17 · subject profile

A twelve-person lab, writing papers, releasing weights under permissive licenses, and repeating the loop every ten to twelve weeks — compounding a share of the vocabulary the field thinks in.

subject: Kyutai · Paris · founded nov 2023FD STS · open weights · €300M endowment

1. The day a twelve-person lab gave away an AI that can talk on the phone

Before Moshi, full-duplex voice AI lived only behind commercial APIs. On September 17, 2024, a Paris nonprofit put it on a single $2,400 GPU under permissive licenses.

On the morning of September 17, 2024, a twelve-person Paris nonprofit called Kyutai released a voice AI named Moshi in three places at once. The paper went to arXiv, the model weights to Hugging Face, the code to GitHub. Weights under CC-BY 4.0, code under Apache 2.0. Both are permissive licenses, meaning commercial reuse is allowed and you can drop the model straight into your own product. And the whole thing ran on a single off-the-shelf $2,400 GPU (NVIDIA L4).

What made this special. Put plainly, Moshi was the first fully open version of a voice AI that can listen and speak at the same time, the way a phone call works (full-duplex). Not the walkie-talkie pattern where one side goes quiet while the other talks, but a conversation where the model can hear you, give a backchannel, interrupt, and overlap. Before that day, that kind of voice AI only existed behind commercial APIs. OpenAI’s GPT-4o advanced voice mode had been announced four months earlier. Gemini Live was in preview. No one outside those companies could touch the weights or the code. That changed on September 17.

Eighteen months later, the main Kyutai repository has around 9,700 stars. Neil Zeghidour, an author on the Moshi paper and the audio research lead at the time, posted a short line just after release. “going from 0 to Moshi in 6 months with 6-8 people has been the most challenging thing.” (Zeghidour X, 2024-07-03) Six to eight people, six months, first open release of a voice AI that can hold a phone conversation.

The thesis of this piece up front. What Kyutai is doing is building a machine that turns research velocity into reputational capital. Write a paper, release the weights under a permissive license, let other researchers pick up the primitives as reference vocabulary, watch a family of downstream derivatives grow around it. Four steps, repeating every ten to twelve weeks, compounding a single paper into a share of “the vocabulary the field thinks in” rather than into revenue. A posture commercial labs cannot hold (they need to gate the distribution) at a cadence academia cannot match.

fig.f1 · release cadence·········

Figure F1. Kyutai release timeline, November 2023 founding through OVIE in April 2026. Blue markers are releases that opened the industry-first public version of an open full-duplex STS or simultaneous speech translation artefact. Orange markers are adjacent or lateral extensions. Every ten to twelve weeks, a single artefact of the kind a commercial lab would build a standalone press release around.

2. Six FAIR Paris researchers, given a decade of cover

Six FAIR Paris senior researchers moved as a group to Kyutai. At that density of load-bearing authorship inside a twelve-person lab, no precedent existed before 2023.

Without names on the page, the rest of the discussion abstracts too quickly. All six Kyutai co-founders left their senior researcher posts at Meta FAIR Paris together, in the same window. The point is not that they were recruited one at a time. They moved as a group.

Patrick Pérez (CEO). Led audio and vision research at FAIR Paris, then ran Valeo.ai (self-driving AI) from 2020 to 2023, and took the Kyutai CEO role in November 2023. He sets the research agenda and handles the institutional interface with donors. Pérez put the conditions for Moshi into one line in an ai-Pulse 2024 interview: “We built it from scratch in six months with a very small team using 1,000 GPUs.” (ai-Pulse 2024)

Alexandre Défossez (Chief Exploration Officer). First author on the Moshi paper. During his FAIR years he was also first author on Encodec (the de facto reference neural audio codec) and on MusicGen. He has carried the audio research line straight through Zeghidour’s departure.

Neil Zeghidour (Audio Research Advisor). Co-first author on AudioLM (a foundational paper for audio language modeling) during his Google Brain Paris years. He left Kyutai in September 2025 to found his own startup, Gradium (detailed in §7), and remains as an advisor. Hervé Jégou is a core contributor to FAISS (the vector-search library that anchors most modern embedding retrieval pipelines) and now appears on the team page as alumni. Also: Edouard Grave (co-author on the LLaMA paper, leading the language-modeling side), and Laurent Mazaré (the systems lead who turned Moshi into an implementation that streams reliably on a single L4).

Put concretely, six of the twelve founders are first or co-authors on load-bearing papers. At that density, concentrated in one building with ten years of cover, no precedent existed before 2023.

3. The €300M structure that buys time

Endowment, talent concentration, and at-cost Scaleway H100 access. Any two of the three, and Kyutai does not produce §1’s cadence.

Reducing Kyutai’s sustainability to a single leg distorts the picture. Structurally there are three.

Leg 1: the endowment itself. In November 2023, Iliad (Xavier Niel, French telecoms), CMA CGM (Rodolphe Saadé, shipping), the Eric Schmidt foundation, and a small number of other donors contributed roughly €300 million in total. Against an estimated annual burn for a lab of this size (€20 to €30 million in salaries, compute, and operations), that is a ten-to-fifteen-year runway. The commercialization roadmap is explicitly lower priority.

“Big tech companies tolerate scientific publications less and less.”
— Xavier Niel, Kyutai launch press conference, 2023-11-17

Xavier Niel gave the reason in one line at the launch. If publication is becoming inconvenient inside commercial labs, build a lab outside them whose only output metric is publication. Eric Schmidt, speaking later on a Stanford SIEPR panel, put it this way: “open research needs patient capital, not quarterly earnings calls.” (SIEPR panel, 2024) He was not speaking about Kyutai specifically, but the logic is the same.

Leg 2: talent concentration. The six names from §2. What matters here is not the headcount but the fact that they moved together from FAIR Paris in the same window. A coordinated move of that size does not happen without the endowment being in place first. The talent concentration is a first-year research output of the endowment.

Leg 3: compute access. Kyutai uses Scaleway (the Iliad group’s cloud service) H100 cluster at cost. The arrangement works because Xavier Niel has a stake in both entities. Put plainly, they do not have to buy their own GPUs and they do not have to pay cloud list prices.

Any two of the three, and Kyutai does not exist. Endowment alone gives you runway without the people. Coordinated FAIR Paris exit alone gives you no ten-year cover. At-cost Scaleway alone is just a hosting arrangement. Défossez, on Practical AI #298, said that writing a preprint with the technical depth of the Moshi paper required “this kind of nonprofit mindset” (Practical AI #298, 2025-08). Not having a ten-year commercialization plan is the design choice that shapes the quality of the research right now.

fig.f4 · fund & headcount compare·········

Lab	Budget (public)	Headcount	Open FD STS shipped	Distribution
Kyutai (nonprofit, FR)	~€300M endowment	~12 core + visiting	4 (Moshi, Hibiki, Hibiki-Zero, Unmute)	CC-BY / Apache / MIT
OpenAI	~$60B+ raised	~4,000	0 (GPT-4o voice closed)	API only
Google DeepMind	Alphabet-internal	~several thousand	0 (Gemini Live closed)	API + consumer
Meta FAIR Speech	Meta-internal	~low hundreds	0 FD dialogue; Spirit-LM, Seamless (FAIR-NC)	Gated research
Anthropic	~$20B+ raised	~1,000+	0	API only

Figure F4. Budget, headcount, and open full-duplex STS count across five labs with a public voice-AI program. Kyutai is the smallest by budget, and the only row in this comparison that has shipped an open-weights full-duplex dialogue model. Meta’s Spirit-LM is a single-stream expressive speech LM rather than a full-duplex dialogue model, and it is released under a FAIR non-commercial license. Budget and headcount figures are order-of-magnitude estimates from public sources.

4. How Moshi works: two brains, split by role

Moshi stacks two transformers split by role rather than by depth. The heavy 7B Temporal runs at 12.5 Hz; the smaller Depth handles intra-frame codebook expansion. That split is what makes single-GPU full-duplex tractable.

The Moshi architecture needs to be on the page. Once you see its shape, the rest of the Kyutai releases stop reading as parallel facts and start reading as a stack.

Put plainly, Moshi stacks two transformers (attention-based neural networks), split not by depth but by role. The upper Temporal Transformer (32 layers, around 7B parameters) advances one step every 80 milliseconds, ingesting seventeen token streams at each step: one “inner monologue” text token for Moshi, eight codebook streams for Moshi’s own output audio, and eight codebook streams for the user’s input audio. The lower Depth Transformer (6 layers, 1,024 dimensions) is a smaller model that expands the eight audio codebooks in sequence within each time step. The heavy model runs only at 12.5 Hz. The fine-grained intra-frame expansion is handed off to the lightweight side. That split is what makes single-GPU full-duplex tractable.

fig.f2 · moshi architecture·········

Figure F2. Moshi architecture. Two transformers stacked and split by role rather than by depth. The 7B Temporal Transformer advances one step every 80 ms, ingesting seventeen token streams (1 inner-monologue text + 8 Moshi audio + 8 user audio). The smaller Depth Transformer expands eight audio codebooks within each time step. That separation is what lets full-duplex run on a single consumer-class GPU.

What “inner monologue” actually means. Every 80 ms frame, Moshi predicts a text token for “what it is about to say” before generating the audio for it. The order is: write the word internally, then speak the waveform. That ordering is what holds the content coherence of Moshi’s full-duplex output together.

And Mimi is the neural audio codec (a converter that compresses audio signals into discrete tokens an AI can handle, essentially a compression translator for audio). It compresses 24 kHz mono audio to 12.5 Hz at 1.1 kbps. Mimi’s first codebook is distilled against WavLM (a self-supervised speech representation model), so the very first token carries linguistic meaning rather than pure acoustic residue. Put plainly, Mimi had to work before Moshi could work. That ordering is why the codec keeps coming up whenever the Kyutai lineage is discussed.

5. The fourth mode: publish AND ship AND open-source

Most labs pick one of three release postures. Kyutai picks a fourth — paper plus permissive weights plus Apache code plus reference runtimes — and repeats it every ten to twelve weeks.

The category contrast is worth putting on the page explicitly. Voice AI labs fall into three familiar public-release postures. First, “publish only, keep code and weights internal,” which is most of academia. Second, “build a product, skip papers or keep them minimal,” which is most commercial labs. Third, “publish a paper and some weights, but keep training corpus, training code, and the evaluation suite internal,” which is most named commercial labs today (much of Meta FAIR and Google DeepMind among them).

Kyutai occupies a fourth mode. Publish the paper. Release the weights under a permissive license. Release the inference code under Apache 2.0. Ship reference implementations in three runtimes (PyTorch, MLX, Rust) alongside. And repeat every ten to twelve weeks. Only when all four are in place can downstream researchers actually reach for the Moshi primitives and build the next thing.

Right after the Moshi release, NVIDIA’s Jim Fan posted a short reaction on X: “Open full-duplex voice is finally here — Moshi is a gift to the field.” (Jim Fan X, 2024-09) Around the same time, Nathan Lambert wrote at Interconnects: “Moshi reset what ‘open voice’ means in the speech-LM conversation.” (Interconnects, 2024-09) Both framings are about the release posture, not the paper’s metrics.

What this fourth mode produces, in three points.

Point 1: reference vocabulary. Mimi’s 12.5 Hz, Moshi’s dual-transformer split, the text-before-audio inner monologue, the seventeen-stream frame, and the Delayed Streams Modeling framework shipped alongside Unmute. These are no longer concepts inside one paper. Other groups now treat them as the “open full-duplex primitives.” The “Family 1: dual-stream plus codec” row in the STS series’ four-family taxonomy exists as a public-research category because Moshi exists as a public artefact.

Point 2: a measurable derivative lineage. NVIDIA’s PersonaPlex-7B-v1 (January 2026) is a Moshi fine-tune. Sesame’s CSM-1B reuses Mimi in a different architecture. Sesame itself raised $250M from Sequoia and Spark in October 2025 at a >$1B valuation and kept its larger models closed. Commercial derivatives accruing value does not change the fact that the parent release was public.

Point 3: coordination of consent-first audio infrastructure. The Voice Donation Project ran from June 2025 through early 2026, verifying 228 out of 374 submitted voices for inclusion in Kyutai TTS. An opt-in, consent-first voice dataset at that scale. No commercial voice-cloning service has run this in public yet. As the consent and licensing regimes covered in Article 10 tighten through 2026 and beyond, the scarcity of infrastructure shaped like this compounds structurally.

6. The Moshi-derived lineage: Hibiki and Hibiki-Zero

Hibiki ports the Moshi architecture to translation. Hibiki-Zero shows that extending to a new language has compressed to a few weeks of GRPO fine-tuning. The primitives were not a one-shot.

Pulling two releases out of the lineage and reading them carefully carries more signal than enumerating all ten.

Hibiki (February 2025, 2.7B-parameter simultaneous speech-to-speech translator). The Moshi architecture ported to translation. It translates English to French in real time while preserving the source speaker’s voice. Paper evaluation reported ASR-BLEU 30.5, beating SeamlessStreaming and StreamSpeech, with human naturalness at 3.73 out of 5 (against 4.12 for professional interpreters). What matters here is that the full-duplex STS results transferred cleanly to a different task. The Moshi primitives were not a one-shot.

Hibiki-Zero (February 2026). A year later, the same architecture retrained with GRPO reinforcement learning (group relative policy optimization, a method that updates the policy from relative comparisons of generated outputs without a separate reward model). No aligned training data (no parallel source-target pairs). Spanish, Portuguese, German, and Italian added on the input side. Italian was bootstrapped from under 1,000 hours of audio. Put plainly, this release showed that the cost of extending open full-duplex technology to new languages can compress to a few weeks of GRPO fine-tuning. On the Audio-NTREX-4L long-form benchmark, it is reported as state-of-the-art on five X-to-English pairings.

Given that, the path to the next non-English full-duplex release from Kyutai (Japanese, Mandarin, Arabic) reads as two or three releases away rather than two or three years. With a GRPO pipeline in place, the impossibility of collecting parallel data stops being a blocker.

The other artefacts, Unmute (May 2025, cascade-style with LLM + STT + TTS at 450 to 750 ms latency), Pocket TTS (January 2026, lightweight voice cloning that runs on CPU), Invincible Voice (February 2026, assistive communication for ALS patients), and OVIE (April 2026, single-image view synthesis), are lateral extensions, and they follow the same “publish AND ship AND open-source” mode. Seven industry-firsts in eighteen months means that the four-step loop has kept running without breaking.

fig.f8 · moshi data pyramid·········

Figure F8. The Moshi training data stack, three layers. The top layer is roughly 7 million hours of web audio diarized into simulated two channels. The middle layer is roughly 20,000 hours of synthetic conversation generated by Kyutai TTS (itself trained on 170 hours of real two-channel audio). The bottom layer is roughly 2,000 hours of Fisher English Training Speech, the paid two-channel telephone conversation dataset released by LDC in 2004 and 2005. The full-duplex behavior is attributable to the bottom 2,000 hours.

7. Inside the 2026 landscape: Gradium as the first data point

Kyutai’s 2026 role reads at three touchpoints: field gravity, a non-leaderboard benchmark posture, and Gradium as the first donor-thesis data point.

Kyutai’s role in the 2026 landscape reads best at three touchpoints.

First: field gravity. When a fully open lab ships an artefact, the market decides who adopts it fastest. The fastest adopters of the Moshi architecture have been both well-capitalized commercial labs and academic followers. One lab becoming the template for an entire architectural family is territory open-science work does not usually reach. The concrete trace is that most full-duplex papers published from 2025 onward now carry a related-work line like “We follow the dual-stream formulation introduced by Moshi” in their opening section (example: arXiv 2510.07838 FDB v2). That is one cross-section of “which vocabulary the field thinks in” getting rewritten.

Second: benchmark posture. In April 2026, the Big Bench Audio top three were Step-Audio-R1.1 (97.0%), Gemini 3.1 Flash Live (95.9%), and Grok Voice (92.9%). Moshi is not in the top 10. What matters is that Kyutai is not chasing those numbers. No dedicated benchmark-submission team. The public artefacts are the canonical statement. Evaluation is left to outside groups. In fact, the Full-Duplex-Bench v2 paper was written by an external academic team using Kyutai’s open weights. Visibility drops when you do not chase the leaderboard, and for this lab that is a trade for a different payoff.

fig.f9 · two benchmark postures·········

Figure F9. Two benchmark postures side by side. Kyutai has no dedicated Big Bench Audio submission team, so Moshi is not in the April 2026 top 10. That same posture is what let an external academic group write FDB v2 using Kyutai’s open weights, and it is what let NVIDIA’s PersonaPlex and Sesame’s CSM-1B appear as downstream artefacts. Leaderboards and field-gravity artefacts measure different things on different clocks.

Third: the Gradium inflection. On December 2, 2025, Neil Zeghidour announced that Gradium (FirstMark + Eurazeo leading a $70M seed) had exited stealth. Gradium is a for-profit ultra-low-latency voice AI company, with Eric Schmidt participating as an angel. Kyutai’s own homepage calls Gradium its “first spin-off” and positions the company as a pathway from open research to production-ready systems.

How to read the split, in four points. (1) Eric Schmidt is a Kyutai donor and a Gradium angel, which is a direct data point against reading the two institutions as structurally adversarial. (2) The positioning is asymmetric. Kyutai ships CC-BY 4.0 open-weight foundation research. Gradium pursues productization under closer-to-closed distribution. They sell into different layers. (3) The research bench did not empty out when Zeghidour left. Défossez has continued to ship Hibiki-Zero, Invincible Voice, and OVIE through 2026. (4) From the donor’s side, this is a Bell Labs pattern. Foundational research yielding a commercial ecosystem around it. Kyutai functioning as an institution that produces talent looks, from the donor perspective, like a secondary payoff of the endowment rather than a drawback.

Gradium is the first real data point on Kyutai’s donor thesis. If this first case lands close to Bell-Labs-style complementarity, then the decade-horizon open-research endowment is a repeatable institutional template, not a single experiment.

fig.f7 · kyutai vs gradium·········

Figure F7. Kyutai and Gradium share a founder (Neil Zeghidour), a city (Paris), a research lineage (Moshi), and a capital source (Eric Schmidt on both). On every other structural axis they differ. The split reads as complementary layers rather than adversarial ones. Kyutai defines the open “floor,” and Gradium productizes commercial applications on top of it.

8. Open questions, and a proposal from Fullduplex.ai

Moshi’s full-duplex behavior is credited to 2,000 hours of Fisher English Training Speech from 2004. The next generation of open full-duplex rises or falls on whether a new two-channel dyadic data supply appears.

Kyutai’s role in 2026 reads as an answer to a design question the field had not previously asked at this scale. Twelve people out of FAIR Paris, €300M, ten years of cover, a permissive-licenses-only charter, a European compute stack accessible from the same building. The combination produced roughly ten industry-firsts in eighteen months and left a public reference architecture for open full-duplex behind as common infrastructure.

The biggest insight to take away is actually about data. Moshi’s full-duplex capability is credited in the paper to fine-tuning on Fisher English Training Speech (roughly 2,000 hours of two-channel telephone conversations published by the Linguistic Data Consortium in 2004 and 2005). Pretraining was roughly 7 million hours of web audio diarized into simulated two channels. Instruction fine-tuning was roughly 20,000 hours of synthetic audio from Kyutai TTS (itself trained on 170 hours of real two-channel recordings). The world’s most famous open full-duplex model could not have been built without a twenty-year-old paid LDC dataset. The bottleneck was not compute, and it was not architecture. It was the supply of two-channel dyadic audio.

Three signals to watch over the next five years.

First, whether a permissively licensed two-channel dyadic dataset in the 100,000 to 500,000 hour band appears. STS series Article 06 identifies that band as the foundation threshold. As of April 2026, two companies are visible on the record: David.ai and Fullduplex.ai. Fullduplex.ai’s February 2026 dataset release reached #3 on Hugging Face’s trending datasets, the first demand signal at that scale from the full-duplex data layer.

Second, whether a multilingual live-examiner benchmark for full-duplex STS in Japanese, Korean, and other languages appears. A public evaluation infrastructure that would let Moshi be honestly compared against Step-Audio-R1.1 or Gemini 3.1 Flash Live on non-English dialogue does not exist as of April 2026.

Third, whether the Kyutai endowment model produces more spin-offs on the Gradium pattern. The first raised $70M. Whether a second appears is the real test of whether the decade-horizon open-research endowment is a repeatable institutional template or a single experiment.

Kyutai’s founding wager was that ten years of cover for a small senior team produces open research infrastructure that outlasts startup cycles. The 2026 evidence (seven industry-firsts in eighteen months, an architecture shaping the open landscape, a consent-first audio project at a scale commercial services have not matched, the first Bell-Labs-pattern spin-off) is consistent with that wager paying off.

Benchmark collaboration. Fullduplex.ai is building the benchmarks and conversational speech datasets that would let the next generation of open full-duplex research rest on a broader foundation than Fisher 2004. If your lab or team works in this space, get in touch. A one-line email to hello@fullduplex.ai is enough.

■ ■ ■

next in the verticals

v02

Sesame AI.

The commercial counterpart: a $250M Series raised on a Mimi reuse and closed larger variants. Coming soon.

···

#verticals#kyutai#moshi#open-sourcefiled under: the latent · verticals v01

Kyutai: the twelve-person Paris nonprofit turning open releases into shared vocabulary.

1. The day a twelve-person lab gave away an AI that can talk on the phone

2. Six FAIR Paris researchers, given a decade of cover

3. The €300M structure that buys time

4. How Moshi works: two brains, split by role

5. The fourth mode: publish AND ship AND open-source

6. The Moshi-derived lineage: Hibiki and Hibiki-Zero

7. Inside the 2026 landscape: Gradium as the first data point

8. Open questions, and a proposal from Fullduplex.ai

The Verticals · 17 subject profiles, released as they land

Kyutai: the twelve-person Paris nonprofit turning open releases into shared vocabulary

Sesame AI: the billion-dollar bet on voices you want to keep talking to

Cartesia: why AWS put a non-transformer voice AI on its own shelf