the verticalsv09 / 17#meta#fair-speech#open-weights§ 08 sections · 05 figures

Meta FAIR Speech: six years, nine papers, and the field’s default citations.

Between June 2020 and October 2024, one corporate lab with no audio P&L shipped Wav2Vec2, HuBERT, dGSLM, MMS, Seamless, and Spirit-LM. By 2026, open full-duplex research thinks in the vocabulary Meta left behind. This piece walks the release cadence, the talent diaspora to Kyutai and Gradium, and why the floor this lab set outlasts its own release calendar.

fullduplex research

published apr 2026· 17 min read· ~3,350 words· verticals v09 / 17

17m

read time

feed to ai↧view .mdClaudeopen in claudeChatGPTopen in chatgpt

verticals · v09 of 17 · subject profile

A corporate lab that does not monetize audio can still design the field’s base layer as a shared, openly published vocabulary. The loop has three moves — publish, release weights under commercially usable licenses, let the downstream lineage grow. Meta FAIR Speech ran that loop once a year for five years.

subject: Meta FAIR Speech · Menlo Park + Paris · 2020-20249 releases · 1,100+ languages (MMS) · Nature 2025

1. June 2020: the paper that reviewers started asking why you did not cite

On June 20, 2020, four researchers at what was still called Facebook AI Research — Alexei Baevski, Henry Zhou, Abdel-rahman Mohamed, and Michael Auli — posted a paper titled Wav2Vec2 to arXiv. The headline claim looks modest on the surface. A transformer pretrained on unlabeled audio via self-supervised learning — learning features from the internal structure of the audio itself without human-provided labels — and then finetuned on just ten minutes of labeled audio reaches WER 4.8 on the LibriSpeech ASR benchmark. No demo video. No product announcement.

Five and a half years later, in April 2026, the paper is the de facto default citation across the public audio-AI literature. Most speech-to-speech papers since 2021 either cite Wav2Vec2 directly or depend on it through a descendant tokenizer. Semantic Scholar flags Wav2Vec2 as highly cited. Its successor, HuBERT (2021), sits in the same position.

Put plainly, a researcher building audio AI in 2026 walks into a furnished room. The representation-layer weights are open. The tokenizer — which converts audio into a sequence of symbols, essentially a word splitter for speech — is open. The skeleton for two-speaker dialogue is open. Multilingual translation recipes are open. The lab that shipped most of these pieces was Meta’s, FAIR Speech.

What Meta FAIR Speech’s six years add up to is one institutional template: a corporate research lab that does not monetize audio directly can still design the field’s base layer as a shared, openly published vocabulary.

The loop has three moves. Point 1, publish papers. Point 2, release weights under commercially usable licenses like CC-BY 4.0 or Apache 2.0 (licenses that explicitly allow other companies to build products on top). Point 3, let a downstream lineage of derivative models grow. That loop turned over at roughly one major release per year from 2020 to 2024, and by 2026 open full-duplex research is thinking in the vocabulary Meta left behind.

Joelle Pineau, who led FAIR, left Meta at the end of May 2025 and became Cohere’s Chief AI Officer in August that year. She wrote a short line on X: “Helping advance cutting-edge research and product development.” (Pineau X, 2025-08). The departure is telling. The lead left, but the 2020-to-2024 release cadence did not stop. As of 2026, the reference vocabulary has settled into the field so thoroughly that it no longer depends on any single person.

fig.f1 · release cadence, 2020-2026·········

Figure F1. Nine public audio foundation models shipped from June 2020 to October 2024. Color marks license posture (orange fully open, yellow restricted, red weights closed). Source: items enumerated in Sections 2 and 3.

2. Twelve researchers, and the six who left Paris together

FAIR Speech reads more accurately as a set of research lines carried by about a dozen named scientists than as a single team. Naming them here keeps the rest of the argument concrete.

Alexei Baevski is the first author of Wav2Vec2 and the person who opened the self-supervised representation line. Wei-Ning Hsu is the first author of HuBERT and appears on nearly every FAIR speech-LM paper since. In late 2024 he gave a talk at JHU CLSP titled “Large Scale Universal Speech Generative Models,” a signal that next-generation audio generative models are in progress internally. Michael Auli is the senior author on Wav2Vec2 and is now Principal Research Scientist and Director at Meta FAIR Menlo Park, the person who scaled speech technology past 1,000 languages through the MMS project. Tu Anh Nguyen is the first author of the dGSLM paper and the lead author on Spirit-LM. He is the person who pushed the speech-LM line toward dialogue and interleaved modality (interleaving means weaving text and audio tokens into a single stream, essentially stringing written and spoken tokens on one thread so the model can digest both at once).

The load-bearing fact here is that six researchers from the FAIR Paris office moved to Kyutai together in 2023 and 2024. They did not leave one by one. They moved as a group. Alexandre Défossez was the first author on Encodec (now the de facto standard neural audio codec) and on MusicGen at FAIR Paris, then joined Kyutai as Co-founder and Chief Exploration Officer. Eugene Kharitonov and Jade Copet were co-authors on GSLM and dGSLM and also went to Kyutai. Hervé Jégou, Edouard Grave (co-author on the LLaMA paper), and Laurent Mazaré joined in the same window. That collective move is the direct reason v01 Kyutai exists at all.

fig.f2 · named scientists grid·········

Figure F2. Four still at Meta (left column, orange border) and six FAIR Paris movers plus Mohamed (right column). The left column carries the floor-setting continuity of Sections 3 and 4, the right column carries the Bell Labs pattern of Section 5.

Pineau’s departure, Yann LeCun’s announcement in November 2025, and the May 2025 split between AI Products and AGI Foundations. The 2020-to-2024 release cadence kept running through every one of these organizational changes. The cadence was not tied to any specific director. The institutional setup was designed to produce research output beyond any individual.

3. Nine papers in four and a half years, across three layers

The release calendar reads more clearly when sorted into three layers. A layered read produces a denser signal than a flat list of individual releases.

Layer 1 — representation (2020 to 2023)

Wav2Vec2 is the anchor, HuBERT cleans it up, and MMS (Massively Multilingual Speech, May 2023) scales the approach past 1,100 languages. HuBERT uses a two-stage k-means teacher strategy: the first stage clusters classical MFCC acoustic features, the second stage uses that output to re-cluster the transformer’s internal representation. Essentially, sort roughly first and then use that rough sort as a template to resort cleanly. The result is cleaner phonetic-unit labels than Wav2Vec2 (labels at the level of individual sounds, “this stretch is /a/, this stretch is /i/”). MIT Technology Review wrote in January 2025: “Seamless can translate text with 23% more accuracy than the top existing models.” MMS versus Whisper puts Meta at roughly half the WER while covering roughly eleven times the languages.

Layer 2 — speech-LM and generative (2021 to 2024)

GSLM (2021) introduced the template. Train a causal language model (the GPT-style “predict the next token” recipe) directly on discretized audio tokens, using no text at all. dGSLM (2022) extended that into two-speaker dialogue with a dual-tower cross-attention transformer (two towers that watch each other while speaking in parallel — one brain for speaker A and one for speaker B, each listening to the other) trained on about 2,000 hours of Fisher telephone conversation data. dGSLM is now recognized as the architectural ancestor of the dual-stream plus codec family that runs Moshi to Sesame CSM to NVIDIA PersonaPlex. Voicebox (June 2023) handled synthesis, denoising, and style transfer in a single model using flow matching (a mathematical recipe for nudging noise into a target audio waveform in small steps). Spirit-LM (October 2024) is a continued-pretraining model built on Llama-2 7B that interleaves text BPE tokens with HuBERT audio tokens, and it shipped alongside an Expressive variant (adding pitch and style tokens to preserve prosody — keeping speed and intonation on the far side of a translation).

Layer 3 — multilingual speech translation (2023)

The SeamlessM4T family. Covered in detail in Section 4.

fig.f3 · releases per year, by layer·········

Figure F3. Releases per year, color-coded by layer. 2020 to 2022 is the representation layer and its first speech-LM descendants. 2023 is the peak year, with SeamlessM4T v1, v2, Expressive, and Streaming arriving in a single burst. Source: release notes for each model.

Nine papers in four and a half years. That is a striking pace for a research lab with no audio P&L. “Floor-setter” can sound abstract. The concrete version is this. If you write an audio paper in 2026 and cite none of these nine, a reviewer will ask why. That is the practical definition of a foundation model’s reach.

Two observations follow. First, the HuBERT tokenizer is the most widely inherited artefact in this portfolio, used in Spirit-LM and in most 2025 academic audio LMs. Second, Sesame CSM is the exception that does not use HuBERT. CSM inherits Kyutai’s Mimi codec (a descendant of Encodec, detailed in Section 4). The 2026 open full-duplex lineage splits between a HuBERT line (Spirit-LM and most academic audio LMs) and a Mimi-codec line (Moshi, CSM, PersonaPlex). Both ultimately trace back to Meta, but the route differs.

4. The SeamlessM4T bet, and the Encodec that preceded Mimi

SeamlessM4T is the largest-scale single release in this portfolio, and the cleanest worked example of what floor-setting looks like.

SeamlessM4T v1 shipped in August 2023 and targeted speech-to-text translation across roughly 100 languages and speech-to-speech translation from 100 sources to 35 targets. v2 followed in November 2023 with the UnitY2 architecture (which first predicts character-level text and then hierarchically upsamples to audio units), improving latency and translation quality at the same time. SeamlessExpressive preserves speaking rate, pauses, and voice quality through translation. SeamlessStreaming adds real-time streaming operation. No prior open speech model had attempted all these dimensions at once, and closed commercial systems had not published at this scope either.

fig.f4 · SeamlessM4T in five cards·········

Figure F4. The SeamlessM4T bet broken into five cards: S2T scope, S2S scope, architecture, variants, and the February 2025 Nature publication. No prior open speech model had attempted all these dimensions at once. Sources: arXiv 2308.11596, arXiv 2312.05187, Nature 41586-024-08359-z.

The credentialing signal landed in February 2025. SeamlessM4T was published in Nature. Nature is a rare venue for an audio foundation model, which usually goes to Interspeech, ICASSP, or NeurIPS. Meta is treating this line as a scientific contribution rather than a product enablement. Spain’s Science Media Centre offered a sharp counter-observation: “The paper does not seem to differ from what Meta has already made openly available on its github repository.” A reaction that questions the ritual of re-publishing open-source work in a journal. We note it here as one of the two-sided responses an open-weights posture tends to generate.

A word on license posture. v1 shipped under a noncommercial research license on Hugging Face. Expressive and Streaming are more restrictive than v1. This is a defensible research-ethics posture focused on voice-cloning misuse risk at the generative frontier, not a retreat from the portfolio’s open-weights trajectory overall. Against a 2026 landscape where Moshi ships CC-BY 4.0 and Qwen3-Omni ships Apache 2.0, Meta’s choice reads as “open, but cautious at the frontier.”

An attribution that occasionally gets muddled in the field deserves to be stated clearly here. Sesame CSM-1B inherits the Mimi codec (built by Kyutai as a descendant of Encodec), not HuBERT units. On the architecture side, CSM borrows dGSLM’s dual-stream template by way of Moshi. In short, CSM’s acoustic tokenizer is Kyutai-lineage, and its dialogue skeleton is Meta-lineage. A mixed inheritance. Earlier drafts that described CSM as using HuBERT were factually wrong, and v6 corrects that explicitly.

5. The Bell Labs pattern: researchers who moved and extended the lines they started

The strongest evidence that Meta’s open-weights bet is paying off is not the lab’s own citation count. It is what the researchers who left are now building.

Alexandre Défossez shipped Encodec and MusicGen at FAIR Paris, then left to become Kyutai’s Chief Exploration Officer. Neil Zeghidour first-authored AudioLM at Google Brain Paris, joined Kyutai, then left in September 2025 to found Gradium on a $70M seed. Hervé Jégou, Edouard Grave, Laurent Mazaré, and Jade Copet left FAIR Paris in the same window and co-founded Kyutai. Eugene Kharitonov joined the Kyutai cohort. As of April 2026, Kyutai is effectively a FAIR Paris speech reunion tour, and Gradium is the commercial spinout from that reunion.

fig.f5 · talent diaspora flow·········

Figure F5. FAIR Paris alumni flow. Six moved to Kyutai in 2023 and 2024, Zeghidour later moved on to Gradium. Each destination extends a Meta-origin idea (Encodec to Mimi, dGSLM to Moshi dual-stream). Sources: Kyutai and Gradium founding announcements, model architecture papers.

The research lines being extended at each destination are lines Meta itself started. Moshi’s dual-stream architecture is a direct descendant of dGSLM. Mimi, the codec Moshi stands on, is a descendant of Encodec. Hibiki inherits the same DNA. NVIDIA PersonaPlex fine-tunes Moshi. Sesame CSM-1B reuses Mimi (not HuBERT units).

Read mechanically, this is the Bell Labs pattern. When a corporate lab keeps shipping foundational research under permissive licenses, the researchers who produced it become independently employable inside the ecosystem that research created.

Bell Labs did it with the transistor and Unix. Xerox PARC did it with personal computing. Meta FAIR is restaging the same structure for self-supervised audio representation, discrete audio tokenization, and the dual-stream full-duplex template. The sponsoring company does not capture all the commercial upside, but the field advances faster than the sponsor could alone.

In an October 2024 Business Today interview, Yann LeCun put it in a single line: “Open source will win.” He repeated it more tersely in a January 2025 post on X: “Open source models are surpassing closed ones.” Layer in Nathan Lambert’s State of Open Models 2025 on Interconnects and the picture becomes this. From 2025 onward, open foundation models are being treated as a serious route to ecosystem influence, and DeepSeek R1 accelerated the trend. What Meta FAIR Speech has been doing in parallel since 2020 is the audio version of the same bet.

The corollary for Meta is that the alumni network works as an amplifier. Every time Kyutai ships Hibiki-Zero, every time Gradium builds a commercial product, every time Sesame fine-tunes CSM on top of Mimi, the citation graph points back to research that started at FAIR. Meta’s floor-setting keeps setting the floor even when the new work ships under a different flag. That is a more durable payoff than trying to win the full-duplex STS market release by release.

6. How to read the 2024-to-2026 slowdown

Here we take the strongest counterargument head on. Compared with the 2020-to-2023 peak, release velocity in 2024 to 2026 has visibly slowed. Is the open-weights commitment getting old?

Fact-wise: no public release in calendar 2025 matched Spirit-LM or SeamlessM4T in scope. Spirit-LM in October 2024 is the most recent visible foundation release. In the 18 months since, the organization went through Pineau’s departure, LeCun’s announced exit, and the AI Products / AGI Foundations split. Three readings follow.

Point 1 — possible structural shift

A corporate lab that does not treat audio as a P&L can maintain a decade-scale commitment only while the parent company’s strategic environment stays stable. Meta from 2024 to 2026 has been running a continuous cycle of GenAI integration and reorganization, and we cannot rule out that FAIR Speech’s priorities are being absorbed into product capabilities (Meta AI, Llama 4 omni, Ray-Ban Meta). Zuckerberg wrote in January 2025: “Llama 4 will be natively multimodal — it’s an omni-model” (Stratechery via Simon Willison). That points in the absorption direction. If speech capability gets absorbed as an internal feature of the Llama line, standalone “FAIR Speech releases” diminish.

Point 2 — normal variance

SeamlessM4T v2, Expressive, and Streaming all landed together in November 2023. Spirit-LM landed in October 2024. For a lab that ships one foundation-scope release per year, an 18-month gap is within normal variance. Wei-Ning Hsu’s 2024 JHU CLSP talk, “Large Scale Universal Speech Generative Models,” signals that next-generation speech generative work is in progress inside Meta.

Point 3 — the constructive read

The nine artefacts shipped from 2020 to 2024 keep contributing to field progress even while 2025 and 2026 look quiet. Moshi was released in September 2024. NVIDIA PersonaPlex fine-tuned Moshi and went public in January 2026. Sesame released CSM-1B on top of Mimi in March 2025, then raised $250M from Sequoia and Spark in October that year. None of these are new Meta releases, but they are evidence that Meta’s 2020-to-2024 releases are still moving the field.

Our reading is Point 2 layered on Point 3. The slowdown may be a signal of structural shift, but the influence of the 2020-to-2024 artefacts as reference vocabulary operates independently of whether new releases land. If Meta ships a Spirit-LM-scope or SeamlessM4T-scope release during 2026, Point 1 weakens. If it passes silently through 2027, Point 1 strengthens. Either way, the cumulative impact of 2020 to 2024 does not retroactively diminish.

7. The position in the 2026 STS landscape

Why the Meta FAIR Speech story matters beyond the lab itself, in three points.

First, the corporate-research template itself. The template of subsidizing an open foundation model line without expecting direct monetization from audio. It is now being imitated by Kyutai (nonprofit), Alibaba Qwen3-Omni (cloud and commerce revenue), and Tencent Covo-Audio (gaming and payments). Meta’s version is the oldest and the most richly documented. It is the reference point in the heads of researchers and investors evaluating any new open-foundation lab. Put plainly, FAIR Speech was the prototype for combining industrial scale (several to ten times Kyutai’s size) with an open-weights posture.

Second, the pretraining-versus-post-training data disconnect, as worked example. The central claim of Article 04 in the STS series is that pretraining-stage and post-training-stage data have different requirements. Meta’s internal portfolio illustrates that cleanly. Wav2Vec and HuBERT saturated on Libri-Light, 60,000 hours of monaural read speech. dGSLM had to leave the read-speech corpus behind for Fisher (two-channel telephone conversation recordings released by LDC in 2004). Pretraining accepts monaural read speech at web scale, post-training needs two-channel conversational speech and that does not exist at web scale. The “foundation threshold” argument in Article 06 uses this disconnect as evidence of a scaling bottleneck.

Third, the product-integration side: rich ingredients and rich channels, with no bridge yet. Meta owns one of the most valuable audio hardware distribution channels in the industry. Ray-Ban Meta hit 1 million units in 2024, approximately 2 million cumulative units by February 2025, and more than tripled H1 sales in 2025. But the Meta AI assistant runs on a half-duplex “Hey Meta” prompt-response flow today. Gen 3 glasses are expected in 2026 with better microphones and a Qualcomm Snapdragon AR chipset. Meta’s Llama 4 announcement in April 2025 claimed natively multimodal “omni” capability including audio, but as of April 2026 there is no public evidence of a user-facing full-duplex deployment. Put plainly, the research portfolio supplies rich ingredients, the hardware channel supplies rich consumption, and the bridge between them has not been announced.

The implication for data infrastructure is specific. If Meta returns to aggressive investment in full-duplex speech modeling (through Llama 5 voice or on-device conversational AI on Ray-Ban Meta Gen 3), it will need training data that looks like Fisher and not like Libri-Light. And at a scale that neither LDC nor any single research consortium has ever supplied. The hundreds of thousands of hours of public two-channel conversational audio that Article 06 argues are required do not exist today. That supply gap is the central thesis of Fullduplex.ai’s dataset work.

8. Summing up: staying ancestral

The most accurate read on Meta FAIR Speech’s role in 2026 is ancestral rather than competitive. The lab that shipped Wav2Vec2, HuBERT, and dGSLM still anchors the tokenizer stack and the dual-stream dialogue template that almost every 2026 open full-duplex STS model stands on. The 2020-to-2024 artefacts, spanning representation, speech LM, generative, and multilingual translation, are the quiet infrastructure the next decade of STS research will stand on.

One takeaway is worth isolating. A corporate research lab that does not monetize audio directly can set the floor for an entire field, so long as it keeps shipping openly and consistently over a long horizon. Once a state is reached in which the next generation of researchers can inherit ideas without asking permission, the lab’s influence becomes independent of its release cadence. Meta’s bet paid off twice. First, inside the academic citation graph, where Wav2Vec2 and HuBERT are treated as default base layers. Second, inside the talent diaspora, where FAIR Paris alumni are advancing the same architectural DNA at Kyutai and Gradium. The two readings do not compete. They compound.

Three signals to watch over the next five years. First, whether Meta ships a Spirit-LM-scope or SeamlessM4T-scope FAIR audio release during 2026. Shipping weakens Point 1, going silent through 2027 strengthens it. Second, whether Llama 5 voice or Ray-Ban Meta Gen 3 carries FAIR audio research onto a user-facing full-duplex surface. The hardware is ready, the research is ready, the bridge is not announced. Third, whether the FAIR Paris bench recovers or keeps thinning. The 2023-to-2025 outflow to Kyutai and Gradium visibly thinned the roster. Watch how hiring and retention move from here.

Whether the 2024-to-2026 slowdown and the Pineau / LeCun departures point to a structural shift or to normal variance in the research cycle will be decided by releases over the rest of 2026 and into 2027. Either way, the cumulative impact of 2020 to 2024 does not retroactively diminish. That is the single most important invariant in this story.

Benchmark collaboration. Fullduplex.ai is building permissively licensed two-channel full-duplex conversational datasets and honest full-duplex evaluation infrastructure so that next-generation STS models can stand on a broader foundation than Fisher 2004. If you are a Meta FAIR researcher or work in an affiliated academic lab on full-duplex benchmarks or post-training recipes that need two-channel data, a one-line email to hello@fullduplex.ai is welcome. Benchmark collaboration proposals on listener behavior, turn-taking, or paralinguistic output scoring are welcome.