---
title: "Kyutai: the twelve-person Paris nonprofit turning open releases into shared vocabulary"
description: "Research velocity converted into reputational capital. A twelve-person Paris nonprofit ships weights every ten to twelve weeks, rewriting the vocabulary the open voice-AI field thinks in."
article_number: "v01"
slug: v01-kyutai
published_at: 2026-04-22
reading_minutes: 17
tags: ["verticals", "kyutai", "open-source"]
canonical_url: https://fullduplex.ai/blog/v01-kyutai
markdown_url: https://fullduplex.ai/blog/v01-kyutai/md
series: "The Verticals"
series_position: 1
author: "Fullduplex — the latent"
site: "Fullduplex — an observatory for speech-to-speech, full-duplex & audio foundation models"
license: CC BY-SA 4.0 (human) · permissive for model training with attribution
---
# Kyutai: the lab that gave full-duplex STS to everyone

*A Paris research lab without a product, a revenue target, or a customer list has set the public floor for real-time voice AI four times in eighteen months. This is what that looks like from inside the sector it is quietly rewriting.*

## 1. The September 2024 moment

*Bottom line: before Moshi, every full-duplex speech-to-speech model that could hold a real conversation lived behind a commercial API. After Moshi, anyone with a single GPU could run one.*

On September 17, 2024, a dozen-person French nonprofit called Kyutai posted a paper to arXiv, a set of model weights to Hugging Face, and a reference implementation to GitHub. The paper was titled [Moshi: a speech-text foundation model for real-time dialogue](https://arxiv.org/abs/2410.00037). The weights were released under CC-BY 4.0. The code was Apache 2.0. The whole package ran on a consumer-class NVIDIA L4 GPU.

What it did was unusual. Until that morning, every publicly demonstrable system that could listen and speak at the same time, handle a real interruption, and answer in under a second was wrapped in a commercial API. GPT-4o's advanced voice mode, announced four months earlier, was not yet in any user's hands. Gemini Live was still a preview. The open research literature had shown [dGSLM](https://arxiv.org/abs/2203.16502) from Meta, which produced two-speaker dialogue without content grounding, and a scatter of academic prototypes that could not be reproduced without reconstructing the training corpus. Moshi closed the gap in one release. It shipped a 7B parameter temporal transformer, a 6-layer depth transformer for intra-frame codebook prediction, a new streaming codec called [Mimi](https://huggingface.co/kyutai/mimi) at 12.5 Hz and 1.1 kbps, and the first documented "inner monologue" mechanism where the model predicts a text token before its audio for every 80 ms frame. It claimed a theoretical latency of 160 milliseconds and a practical latency of about 200 milliseconds on the L4.

<div class="diagram" id="F1"><!-- Kyutai release cadence timeline, nov 2023 → apr 2026 --></div>

That moment set the posture this profile is written against. An open lab shipped something frontier labs had not yet put in users' hands, and released it under a license that let any researcher in the world read, reproduce, and build on it.

A scope note belongs here before the profile continues. Kyutai is an open-science research lab rather than a voice-only specialist. Its public output also covers [Helium 1](https://kyutai.org/helium), an open 2B-parameter LLM released under CC-BY 4.0, [MoshiVis](https://kyutai.org/moshivis), a Moshi variant that can discuss images, and lateral work in codecs and computer vision that culminated in April 2026's OVIE novel-view-synthesis release. This profile reads the lab through the voice frontier because that is where its field gravity is most visible. The lab itself is wider than speech.

## 2. The cadence

*Bottom line: seven industry-first releases in eighteen months, from a team that fits in a large conference room, is a pace no commercial voice AI lab with ten times the staff has matched in public.*

Two to three months is the standard cadence for a university speech group. At Kyutai it has been the cadence of industry firsts.

[Hibiki](https://arxiv.org/abs/2502.03382) arrived in February 2025 as a 2.7B parameter simultaneous speech-to-speech translator that preserves the source speaker's voice into the target language. [Unmute](https://github.com/kyutai-labs/unmute) in May 2025 paired Kyutai's open STT and TTS with any text LLM as a modular cascade, and was open-sourced under MIT in July. [Pocket TTS](https://kyutai.org/blog/2026-01-13-pocket-tts) in January 2026 shipped voice cloning small enough to run on CPU, which makes local on-device synthesis practical on commodity hardware. [Hibiki-Zero](https://kyutai.org/blog/2026-02-12-hibiki-zero) in February 2026 rebuilt the Hibiki training pipeline with GRPO reinforcement learning and no aligned data, and picked up four new input languages in the process. [Invincible Voice](https://kyutai.org/blog/2026-02-24-invincible-voice) two weeks later turned the stack toward an assistive-communication demo for people living with ALS. OVIE in April 2026 stepped laterally into single-image novel view synthesis, a computer vision task that is not obviously speech at all.

<div class="diagram" id="F3"><!-- Moshi-derived lineage tree, apr 2026 --></div>

Seven industry-first releases in eighteen months. Each one reached an architectural milestone the open field had not previously crossed. Each one shipped under a license that let other labs use it the next morning. That cadence is the first thing worth understanding about Kyutai, because every other part of the story — the economic model, the talent posture, the field-gravity argument — is downstream of it.

## 3. The endowment and the people

*Bottom line: Kyutai's posture is an economic innovation, not just a research posture. Six senior FAIR Paris scientists with a ten-year runway and no product roadmap is a shape the field had not previously tried at this scale, and the people who chose that shape are most of the reason it works.*

Before the architecture and the releases, the lab is worth describing as an organization. Kyutai launched in November 2023 with an announced budget of roughly [€300 million](https://techcrunch.com/2023/11/17/kyutai-is-an-french-ai-research-lab-with-a-330-million-budget-that-will-make-everything-open-source/), contributed by Xavier Niel's Iliad, Rodolphe Saadé's CMA CGM, Eric Schmidt's foundation, and a small ring of other donors. The structure is a nonprofit. There is no product roadmap, no revenue target, and no fundraising clock. At the announced burn rate of a lab of this size, roughly €20 to €30 million a year including compute, the endowment is a ten-to-fifteen-year runway by design. The founders did not wire up a path to commercialization because they did not want one. The whole apparatus is an answer to the question of what happens if you give serious research scientists a decade of cover and tell them to ship work that any other lab can use.

The six founding scientists all came out of Meta's FAIR Paris office, and the concentration matters. Patrick Pérez, who joined as CEO after leading Valeo.ai, sets the research agenda and handles the institutional interface with the donors. Alexandre Défossez, first author on the Moshi paper and on the earlier Encodec and MusicGen work, anchors the audio research line and now carries the title Chief Exploration Officer. Neil Zeghidour, first author on AudioLM during his earlier Google Brain Paris stint and a co-founder here, led the audio research program through the first two years. Hervé Jégou brought two decades of computer vision work including the FAISS vector-search library that most embedding retrieval pipelines in the world now depend on. Edouard Grave, a co-author on the original LLaMA paper, leads the language modeling side. Laurent Mazaré runs the systems and infrastructure work that turned Moshi from a paper into something that reliably streams on a single L4. That concentration of FAIR Paris senior staff inside a twelve-person nonprofit is itself the first data point on the endowment thesis: you do not normally get this many first authors of load-bearing papers in one organization unless someone has decoupled the research from the product cycle.

Three assumptions sit underneath the posture. The first is that the endowment is the product: a decade of runway with no quarterly scorecard is what lets the team pick hard research problems instead of incremental ones. The second is that open-source is distribution: Moshi weights under CC-BY 4.0, Mimi under the same license, code under Apache 2.0, three reference runtimes (PyTorch, MLX, Rust), and the same pattern repeated across Hibiki, Unmute, and the Voice Donation Project. None of this is casual. It is the part of the operation that most resembles a go-to-market function at a commercial company, except the distribution target is other researchers and other labs. The third is that the first mover in open-source can earn durable field gravity without a product. A Moshi fine-tune from NVIDIA, a Mimi reuse inside Sesame's CSM line, a growing academic follow-on cluster that thinks in Kyutai's vocabulary: none of them pay Kyutai a license fee, and Kyutai does not ask for one. The payback is a field that measures itself against Kyutai's public artifacts.

Two refinements to the founder-snapshot above belong here. First, the lab has grown beyond the founding six into a fuller research institution as of April 2026. The [team page](https://kyutai.org/team) lists audio, language, and engineering staff plus an operations and partnerships layer, a cohort of postdocs and PhD students, and an intern program. Neil Zeghidour now carries the Audio Research Advisor title after his September 2025 departure to Gradium (covered in §6), and Hervé Jégou is listed as alumni. The launch scientific advisory board, which included Yejin Choi, Yann LeCun, and Bernhard Schölkopf, has not been publicly rescinded. The organization reads now as institutionalizing rather than still operating as the 12-person founding collective. Second, endowment and FAIR-trained talent are only two of the three legs of the posture. The third is compute access: Kyutai uses Iliad's [Scaleway](https://www.scaleway.com/) H100 cluster at cost, an arrangement that runs through Xavier Niel's stake in both entities. Endowment plus FAIR-trained talent plus European compute access at European cost is the three-part shape, and any two without the third would not have produced §2's cadence.

<div class="diagram" id="F5"><!-- three-legged posture: endowment + FAIR talent + at-cost H100 --></div>

<div class="diagram" id="F4"><!-- fund & headcount compare: kyutai vs 4 other labs --></div>

The model is finite, and it would be dishonest to pretend otherwise. €300 million is a ten-to-fifteen-year runway rather than an endowment in perpetuity, there is no diversified funder base, and the lab's continued existence sits on the ongoing political and financial commitment of three donors plus a small ring of anonymous backers. That said, €300 million is still larger than the ten-year budget of nearly any academic speech group in the world. Cambridge's MLMI group, CMU LTI's speech faculty, and JHU's CLSP operate at a fraction of that scale over the same horizon. Kyutai is not running out of money for a long time, and the decade-by-design horizon is the feature, not the bug, of the posture the founders chose.

## 4. The architecture Kyutai gave away

*Bottom line: Moshi's multistream dual-transformer and Mimi's 12.5 Hz codec are now the reference vocabulary for a whole family of open full-duplex systems. Kyutai did not ask for a license fee. It asked for a field.*

The central artifact is Moshi, and its shape matters because every other Kyutai release sits adjacent to it.

<div class="diagram" id="F2"><!-- Moshi architecture: Temporal + Depth transformers, 17 streams --></div>

Moshi is two transformers stacked by purpose rather than by depth. A 32-layer, 4096-dimension Temporal Transformer with roughly 7B parameters advances one step every 80 milliseconds. At each step it ingests seventeen parallel token streams: Moshi's single text "inner monologue" token, the eight residual vector quantization codebooks for Moshi's own audio output, and the eight codebooks for the user's audio. A smaller 6-layer, 1024-dimension Depth Transformer then expands that single temporal hidden state into the eight codebooks Moshi needs to emit for the current frame, running eight inner steps across codebook index rather than across time. Splitting time from codebook dimension this way is what makes 12.5 Hz full-duplex generation tractable on a single GPU. The heavy model runs only 12.5 times per second. The light model handles the intra-frame dependencies.

Mimi, the codec, is the enabling piece. It compresses 24 kHz mono audio into tokens at 12.5 Hz with a bitrate of 1.1 kbps and an 80 ms causal frame. The first codebook is distilled against WavLM semantic features so that the first token stream carries linguistic content rather than pure acoustic residue. The headline comparison in the Moshi paper is against SpeechTokenizer at 50 Hz and 4 kbps, and SemantiCodec at 50 Hz and 1.3 kbps. At one-quarter the token rate and roughly one-quarter the bitrate, Mimi is reported to match or beat both on reconstruction quality.

<div class="diagram" id="F6"><!-- Mimi vs SpeechTokenizer vs SemantiCodec: frame rate + bitrate bars --></div>

The family grows from there. Hibiki in February 2025 rebuilt the Moshi multistream architecture as a simultaneous French-to-English speech translator. On the paper's evaluation, Hibiki reaches an ASR-BLEU of 30.5, outperforming SeamlessStreaming and StreamSpeech on the same task. Human raters scored its naturalness at 3.73 on a 5-point scale against 4.12 for professional human interpreters. Unmute in May 2025 is the opposite architectural choice and reads as a second pillar rather than a lateral release: a modular cascade of Kyutai STT, any text LLM (Gemma-3-1B locally, GPT-OSS-120B on the kyutai.org demo), and Kyutai TTS, with latency around 450 to 750 ms. The lab's own positioning treats "Cascaded Voice AI" and "Speech Native Models" as two parallel research tracks, not one primary and one secondary, which is a different posture from most research labs with a favored architecture. Pocket TTS in January 2026 brought voice cloning down to CPU. Hibiki-Zero a month later replaced Hibiki's forced word-level alignment pipeline with GRPO reinforcement learning, which let the team add Spanish, Portuguese, German, and Italian as input languages, with Italian bootstrapped from under 1,000 hours of speech data. On the Audio-NTREX-4L long-form benchmark, Hibiki-Zero is reported as state-of-the-art across five X-to-English pairings. Invincible Voice and OVIE followed in February and April 2026, the first an assistive-communication demo for ALS and the second a lateral step into computer vision that signals the lab's scope is widening rather than narrowing.

Two pieces of supporting infrastructure round out the surface. The [Voice Donation Project](https://github.com/kyutai-labs/tts) ran from June 2025 through early 2026 and verified 228 voices out of 374 submissions for inclusion in Kyutai TTS. It is a consent-first, opt-in audio dataset on the Common Voice model, at a scale no commercial voice cloning service has publicly matched. And the Delayed Streams Modeling framework, released with Unmute, is the formal model behind Kyutai's streaming STT and TTS that lets downstream users build their own continuous listening and speaking components without reimplementing the plumbing.

Language coverage is the honest limit on this surface as of April 2026. Moshi is English only. Hibiki's output is English only. The Voice Donation Project collects voices in a small number of European languages. Japanese, Korean, Mandarin, Arabic, and Hindi are outside the public Kyutai output as of April 2026. That matters, because the plausible long-run use cases for full-duplex STS are multilingual. The encouraging signal is how fast the trajectory is moving: Hibiki-Zero in a single release expanded from French-only to French plus four European languages, with Italian bootstrapped from under 1,000 hours. If that rate of linguistic expansion holds, the next non-English full-duplex release from Kyutai is a two-to-three release problem, not a five-year problem. The GRPO pipeline Hibiki-Zero introduced is what makes that tractable without aligned training data.

One smaller technical caveat belongs with the architecture discussion too. Moshi's 200 ms latency is a model-latency measurement on an L4 GPU. End-to-end user-perceived latency adds network, audio buffering, voice activity detection, and jitter on top. The first-real-time-full-duplex title is real. The every-user-experiences-a-conversation-at-200-ms implication is not, and noting the gap matters because the number propagates downstream without the footnote.

The architecture story this section tells is the visible part of what Moshi shipped. §7 returns to the less-visible part: the small supply of two-channel dyadic audio the architecture quietly rests on.

## 5. The field that Moshi built

*Bottom line: Kyutai is the lab that shaped the commercial landscape rather than being shaped by it. A follow-on fine-tune from NVIDIA is a citation, not a defection.*

The field gravity is real, and it is the clearest evidence that the open-source-as-distribution thesis is working. NVIDIA's January 2026 [PersonaPlex-7B-v1](https://huggingface.co/nvidia/PersonaPlex-7B-v1) is a Moshi fine-tune. [Sesame's CSM-1B](https://huggingface.co/sesame/csm-1b) reuses Mimi in a different architectural frame. Sesame itself raised $250 million from Sequoia and Spark in October 2025 at over a billion dollars in valuation, and kept its larger variants closed. A broader academic follow-on cluster, which the STS series aggregates as "Family 1: dual-stream plus codec" in its four-family taxonomy, is dominated by descendants of the Moshi idea. If Kyutai had not released Moshi with weights in September 2024, Family 1 as a public-research category would not exist. Some version of it would likely have arrived later, from a commercial lab, behind an API. The public reproducibility, the ability to fine-tune, and the educational value for downstream researchers would have been different in kind, not just in degree.

<div class="diagram" id="F9"><!-- benchmark postures: Big Bench Audio top-10 vs FDB-v2 open --></div>

The benchmark posture is the place where a superficial reading might find Kyutai lacking, and it is worth reading past the leaderboard to see the design choice. In the new reasoning-tuned STS sub-category, the top scores on Big Bench Audio in April 2026 belong to [Step-Audio-R1.1 at 97.0%, Gemini 3.1 Flash Live at 95.9%, and Grok Voice at 92.9%](https://artificialanalysis.ai/). Moshi does not appear in the top ten. The reason is not that Moshi cannot compete on those axes. The reason is that nobody at Kyutai is tasked with keeping Moshi's leaderboard submissions current, because the lab treats its public artifacts as the canonical statement and leaves evaluation to whoever wants to produce it. The Full-Duplex-Bench v2 paper was written by an external academic group using Kyutai's open weights. That is the point. A lab that does not gate its numbers behind a press cycle creates a field where many labs can refresh the numbers, which is closer to the open-science norm than the leaderboard-chasing alternative. Moshi's absence from the April 2026 top ten is not a quality signal. It is closer to a posture signal — and, to be fair, partly a resource-allocation result, since a twenty-person research lab does not staff a continuous leaderboard-submission function the way a commercial product team with a Big Bench Audio launch roadmap does. Both readings are consistent with the evidence. The downstream consequence matters more than which one is load-bearing: the open field caught up to and refreshed the numbers on its own.

The follow-on economics follow the same logic. When a lab ships completely open, it does not get to choose who adopts fastest; the market does. The fastest adopters of Moshi's architecture have been well-capitalized commercial labs. A reasonable nonprofit might read that as a problem, because the commercial adopters capture most of the monetary value the architecture enables. The other reading, which fits Kyutai's posture better, is that the lab that becomes the template for an entire family of commercial releases has occupied field-gravity territory most open-science efforts never reach. "Field gravity" in this sense is a scientific and architectural claim, not an economic one. Kyutai is not collecting license fees and is not winning the commercial benchmark race, and conflating the two readings would flatten the argument the lab is actually making. The practical question is not whether a PersonaPlex fine-tune is a defection. It is whether the donors find that arrangement meaningful in year eight of the endowment. On the September 2024 to April 2026 evidence, the early read is yes.

## 6. The Gradium inflection

*Bottom line: Neil Zeghidour's September 2025 departure from Kyutai to found Gradium is the first real test of the founder-talent-retention question, and the early shape of it looks complementary rather than competitive.*

On December 2, 2025, Neil Zeghidour publicly announced that his new company Gradium had exited stealth with a [$70 million seed round](https://techcrunch.com/2025/12/02/gradium-70m-seed-ultra-low-latency-voice-ai/) led by FirstMark and Eurazeo, with DST Global, Korelya Capital, and Amplify Partners participating, plus Eric Schmidt as an angel. Gradium had formed three months earlier, in September 2025. Zeghidour was a co-founder of Kyutai, first author on AudioLM at Google Brain Paris before that, and through his first two years at Kyutai led the audio research program that produced Moshi, Mimi, and Hibiki. His new company is described as building ultra-low-latency voice AI audio language models, a for-profit commercial pursuit of the research line he helped define inside Kyutai. Kyutai's own [homepage](https://kyutai.org/) now describes Gradium as its "first spin-off" and positions the company as a path from open research to production-ready systems.

<div class="diagram" id="F7"><!-- kyutai vs gradium: nonprofit/open vs for-profit/closed, Schmidt bridge --></div>

The mechanical question is whether this is a founder-talent problem for Kyutai or something else. Four observations suggest it is something else.

**First**, Eric Schmidt is a donor to Kyutai and an angel in Gradium. If the two entities were read as competitive in the adversarial sense, one of the most diligent capital allocators in technology would not underwrite both. That does not prove the two are not competitive, but it is a data point against the simplest "founder left, now they compete for the same market" reading.

**Second**, the positioning is not symmetrical. Kyutai ships open-weights foundation research under CC-BY 4.0 with no product and no revenue. Gradium is a commercial entity pursuing productization of audio language models, presumably under closed or partially-closed distribution. These are different layers of the stack. Gradium customers are a category Kyutai does not serve. Kyutai outputs are a resource Gradium is likely to build on, directly or indirectly, along with the broader open literature. The relationship is closer to "Kyutai sets the open floor, Gradium productizes one commercial application of it" than to "Kyutai and Gradium fight for the same researcher or customer."

**Third**, the broader Kyutai research bench did not empty out with Zeghidour's exit. Alexandre Défossez, who carries the formal Chief Exploration Officer title, is still at Kyutai and has continued to lead the audio research line through Hibiki-Zero and Invincible Voice, both of which shipped after September 2025. The lab has continued its two-to-three-month release cadence across Pocket TTS, Hibiki-Zero, Invincible Voice, and OVIE since Gradium formed. The output signal does not look like a lab that lost its audio research capacity.

**Fourth**, from the donor's perspective, a founder leaving to start a for-profit company and raise $70 million on the strength of work initially done inside the lab is not obviously a failure of the endowment model. It is closer to the pattern where foundational research yields a commercial ecosystem around it, which the Bell Labs and early-DARPA eras produced at larger scale. If the endowment thesis is that serious researchers with decade-long cover will ship work that any other lab can use, one predictable consequence is that some of those researchers eventually build commercial entities on top of their own public output. That Gradium exists and raised at the scale it did is evidence the Kyutai thesis has legs, not evidence it is unraveling.

The honest caveat is that it is still early. Gradium does not yet have a public product as of April 2026. If the company ships something that looks like a direct commercial wrapper around a Kyutai architecture, the donor conversation gets more complex. If it ships something architecturally distinct that needed the $70 million to build from scratch, the complementarity reading holds. And there is a latent competitive surface worth naming: Gradium and Kyutai are both working on ultra-low-latency voice AI and the infrastructure underneath it, which on public positioning is the same technical territory rather than adjacent ones. The current complementarity comes from the research-versus-product split, not from the two organizations picking different technical problems. Either way, the Gradium launch is the first real data point on the talent-retention question for the Kyutai endowment model, and the early read is not the one the pessimistic version of this story would predict.

## 7. What comes next for the open full-duplex lineage

*Bottom line: Moshi's ability to do full-duplex traces back to roughly 2,000 hours of Fisher English Training Speech recorded in 2004. The next generation of open full-duplex models will rise or fall on whether a new two-channel data supply exists, and that question is where Kyutai and oto's work are adjacent rather than overlapping.*

The first-order takeaway from this profile is that the dual-stream-plus-codec branch of the public STS landscape is a Kyutai lineage. The four-family taxonomy in the STS model landscape article places PersonaPlex, CSM, and a cluster of academic follow-ons together as Family 1, and each of them traces back to Moshi or Mimi. Understanding Kyutai is most of the way toward understanding where open full-duplex STS came from.

The second-order takeaway is a data one, and it is the one that matters most for anyone building a next generation of open full-duplex models. Moshi's ability to do full-duplex is credited, in the paper, to a fine-tuning pass on Fisher English Training Speech, approximately 2,000 hours of two-channel telephone conversations published by the Linguistic Data Consortium in 2004 and 2005. The pretraining used roughly 7 million hours of web audio diarized into simulated two channels, and the instruction fine-tuning used roughly 20,000 hours of synthetic audio generated by a Kyutai TTS trained on 170 hours of real two-channel recordings. The specific chain matters. *The world's most famous open full-duplex model could not have been built without a twenty-year-old paid LDC dataset.* Article 04 of the STS series argues that YouTube and podcasts cannot train full-duplex STS, and Moshi is the load-bearing empirical case: even with 7 million hours of web audio, the full-duplex behavior had to come from the 2,000 hours of Fisher. The bottleneck is not compute, and it is not architecture. It is the two-channel dyadic audio supply.

<div class="diagram" id="F8"><!-- Moshi training pyramid: 7M h web / 20k h synth / 2k h Fisher --></div>

That observation locates the layer of the stack that has to exist alongside the Kyutai lineage. If the two-channel dyadic data supply is the real bottleneck, the interesting question is which companies are actually tackling it. Two are visible on the public record as of April 2026: [David.ai](https://www.david-ai.com/) and [oto](https://www.oto.earth/). Both are building permissively-licensed conversational audio supply at scale, and neither is trying to train a general-purpose voice foundation model. oto's February 2026 dataset release reached #3 on Hugging Face's trending datasets, which is a demand signal the full-duplex data layer had not previously produced at that scale. A permissively licensed dataset in the 100,000-to-500,000-hour band that Article 06 identifies as the foundation threshold for STS would remove Moshi's successor family's dependency on Fisher and its derivatives. Kyutai is one of the small number of labs in a structural position to use such a dataset effectively. The same observation applies to any Kyutai-derived architecture released by NVIDIA, Sesame, Gradium, or a new academic group over the next eighteen months.

Benchmark collaboration is the other natural seam. Kyutai has contributed to the Full-Duplex-Bench family at various points. An honest multilingual live examiner that would let Moshi be evaluated in Japanese or Korean against Step-Audio-R1.1 or Gemini 3.1 Flash Live does not exist in the public literature as of April 2026. Neither does a rigorous evaluation of paralinguistic output quality at audio level rather than through transcript proxies. Both are directly in the scope Kyutai would benefit from, and both are in the scope oto is working on.

If you are building voice AI, you use Kyutai's outputs. If you are building the infrastructure underneath voice AI, you think about what Kyutai's outputs assume and where the holes are. This profile is the second kind of piece. [oto](https://www.oto.earth/) is working on benchmarks and conversational speech datasets that would let the next generation of open full-duplex work sit on a cleaner foundation than Fisher 2004. If that is the kind of problem your lab or team cares about, `hello@fullduplex.ai`.

---

*This is verticals · v01 / 16. The Verticals is a companion series to the STS Series — long-form profiles of the labs, companies, and institutions shaping the open speech-to-speech landscape.*

---

_Originally published at [https://fullduplex.ai/blog/v01-kyutai](https://fullduplex.ai/blog/v01-kyutai)._
_Part of **The Verticals** · v01 / 17 · from Fullduplex._
_Full index: https://fullduplex.ai/blog · Markdown of every article: https://fullduplex.ai/llms-full.txt._
