---
title: "Foundation before vertical"
description: "Full-duplex STS sits between the GPT-2 and GPT-3 moments. Asking “which vertical wins first?” in 2026 is a category error — the constraint is whether the foundation the verticals will sit on exists yet. A thesis essay on the foundation threshold, the 30×–150× data gap, and six plausible routes to 100,000+ hours of two-channel dialogue."
article_number: "05"
slug: foundation-before-vertical
published_at: 2026-04-26
reading_minutes: 14
tags: ["foundation", "investment", "data"]
canonical_url: https://fullduplex.ai/blog/foundation-before-vertical
markdown_url: https://fullduplex.ai/blog/foundation-before-vertical/md
series: "The STS Series"
series_position: 5
author: "Fullduplex — the latent"
site: "Fullduplex — an observatory for speech-to-speech, full-duplex & audio foundation models"
license: CC BY-SA 4.0 (human) · permissive for model training with attribution
---
# Foundation before vertical

Investors evaluating voice AI in 2026 keep asking a reasonable-sounding question. **Which vertical will speech-to-speech AI win first? Call centers? Medical documentation? Legal? Education? Gaming?** The premise of that question is that the foundation is ready and the remaining problem is product-market fit. Eighteen months of data inventory, benchmark results, and model releases suggest the premise is wrong. Full-duplex speech-to-speech sits somewhere between the GPT-2 moment and the GPT-3 moment. Asking which vertical to chase is like asking in 2019 whether the first billion-dollar LLM company would be in legal or in medical. The answer then was "neither, because the foundation is not ready." The answer for STS in 2026 is the same.

## 1. The VC question and the wrong answer

The vertical-first framing comes naturally to people who have financed a decade of SaaS. Pick a vertical with pain, ship a narrow product faster than the incumbents, compound through distribution. For speech-to-speech AI in 2026, this framing is a category error. **The constraint is not which market to address. The constraint is whether the foundation that verticals will sit on exists yet.**

Text LLMs went through the same confusion in 2019 and 2020. GPT-2 (2019) could write paragraphs but not reliably answer domain questions. Vertical LLM startups at that stage either built their own domain foundation from scratch (and lost) or waited. GPT-3 (2020) flipped the economics. Post-GPT-3, Harvey raised its Series A five months after ChatGPT shipped ([press coverage](https://techcrunch.com/2023/04/26/harvey-21m-series-a/)). Hippocratic AI raised a $50M seed six months after ([press coverage](https://www.reuters.com/business/healthcare-pharmaceuticals/generative-ai-startup-hippocratic-ai-raises-50-million-seed-round-2023-05-16/)). Neither would have been financeable eighteen months earlier.

**The right question for STS in 2026 is not "which vertical?" but "is the foundation data bottleneck closing, and on what timeline?"** The rest of this article works through what is known about the answer.

One note on epistemic status before continuing. **This article is a thesis essay, not a research summary.** Facts and hypotheses are tagged differently. Facts include the public supply total (2,000 to 3,000 hours), Fisher's 1,960 hours, Abaka's vendor-claimed 20,000 hours, the LDC license structure, and the funding rounds cited in §2 and §6. Hypotheses include the 100,000 to 500,000 hour foundation threshold estimate (§3), the 30x to 150x supply gap that follows from it (§3), the 3x post-foundation compression factor applied to the ASR arc (§8), and the 2027 to 2029 sequencing reading (§8). Readers should hold the hypotheses loosely and update them against new data as it arrives.

*(Figure F1: two framings side by side. Left panel shows the SaaS-style vertical-first question. Right panel shows the foundation-first question with the current supply gap marked.)*

## 2. Foundation threshold, a concept worth naming

The **foundation threshold** is the data-and-parameter scale at which a single pretrained model generalizes well enough, zero-shot or with light fine-tuning, that domain-specialized products can be built as adapters on that model rather than from scratch. Below the threshold, each vertical must solve its own data, model, and product. Above it, the foundation is a commodity input and the vertical becomes a distribution problem.

The threshold is visible across three domains that have already crossed it.

**Text LLMs**. GPT-1 (2018) at 117M parameters and 0.8B tokens required fine-tuning for any task ([Radford et al. 2018](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf)). GPT-2 (2019) at 1.5B and 10B tokens had zero-shot performance that was interesting but unreliable ([Radford et al. 2019](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)). GPT-3 (2020) at 175B and 300B tokens had few-shot in-context learning robust enough that a vertical startup could build a product by prompting alone ([Brown et al. 2020](https://arxiv.org/abs/2005.14165)). Post-threshold vertical adapters confirmed the pattern: Med-PaLM scored 67.6% on MedQA ([Singhal et al. 2022](https://arxiv.org/abs/2212.13138)), Med-PaLM 2 scored 86.5% ([Singhal et al. 2023](https://arxiv.org/abs/2305.09617)), both built on PaLM and PaLM 2 respectively, not trained from scratch. Code Llama added 500B code tokens on top of Llama 2, roughly ten percent additional training ([Rozière et al. 2023](https://arxiv.org/abs/2308.12950)). Specialization was additive and cheap on top of a proven base.

**Computer vision**. CLIP (2021) trained on 400M image-text pairs crossed a zero-shot transfer threshold ([Radford et al. 2021](https://arxiv.org/abs/2103.00020)). MedSAM, a medical-imaging adapter on SAM, improved DICE by 22.51 points over zero-shot SAM across 86 of 86 internal tasks, using 1.57M medical mask annotations ([Ma et al. 2024](https://www.nature.com/articles/s41467-024-44824-z)). **The domain data needed post-threshold was two to three orders of magnitude smaller than the foundation data.** BiomedCLIP followed the same pattern with 15M medical image-text pairs on a CLIP base ([Zhang et al. 2023](https://arxiv.org/abs/2303.00915)).

**Automatic speech recognition**. Whisper (2022) trained on 680,000 hours crossed the threshold for zero-shot transfer across accents, domains, and languages ([Radford et al. 2022](https://arxiv.org/abs/2212.04356)). Pre-Whisper, Nuance Dragon Medical was used by 55% of US physicians and was built on per-domain specialized acoustic models. Post-Whisper, the vertical winner is Abridge ($5.3B valuation, June 2025), which sits on Whisper-class foundations plus LLMs rather than a from-scratch medical ASR stack ([press coverage](https://www.reuters.com/business/healthcare-pharmaceuticals/ai-medical-scribe-abridge-raises-250-million-series-d-2025-02-17/)). The specialized-acoustic-model moat largely evaporated.

**Counter-examples make the rule sharper, not weaker.** Two text-LLM projects tried to build domain foundations at sub-frontier scale. **BloombergGPT** (2023, 50B parameters, trained from scratch on 363B finance tokens plus 345B general) was matched or exceeded by GPT-4 on most finance tasks within twelve months of its release ([Wu et al. 2023](https://arxiv.org/abs/2303.17564)). **Galactica** (2022, 120B parameters, science-specialized) was publicly withdrawn after three days because its narrow corpus produced hallucinated citations that sounded plausible ([Taylor et al. 2022](https://arxiv.org/abs/2211.09085)). Neither case refutes the foundation-first pattern. Both refine it. **The operational rule is "foundation first, vertical as adapter on the foundation," not "vertical foundation at sub-frontier scale."** Attempts at the latter lose.

*(Figure F2: foundation threshold table. Columns: domain, foundation model, parameters, data volume, named post-threshold vertical winner. Rows: text (GPT-3, 175B, 300B tokens, Harvey / Hippocratic), vision (CLIP, 400M params, 400M pairs, MedSAM / BiomedCLIP), ASR (Whisper, 1.55B, 680k hours, Abridge), full-duplex STS (empty row, labeled "not yet crossed"). The empty last row motivates §3.)*

## 3. Where full-duplex STS actually sits now

STS is not pre-foundation. It is mid-foundation.

**Model side.** Moshi (September 2024) was the first open full-duplex STS model, at roughly 7B parameters ([Kyutai 2024](https://arxiv.org/abs/2410.00037)). PersonaPlex (January 2026, NVIDIA) is a Moshi-fine-tuned variant with persona control. J-Moshi (2025) is the Japanese language-specific variant ([Ohashi et al. 2025](https://arxiv.org/abs/2506.xxxx)). Parameter counts sit between GPT-2 scale and GPT-3 scale.

**Data side.** From Article 04's inventory, public full-duplex-ready speech totals roughly 2,000 to 3,000 hours. Fisher English is the anchor at 1,960 hours (LDC2004S13, LDC2005S13), gated by LDC license. AMI contributes 100 hours, ICSI 72 hours, CHiME-6 40 hours, CANDOR 850 hours under CC BY-NC, plus smaller contributions from InteractSpeech and DialogueSidon. **Total public full-duplex-ready hours under commercial license is close to zero.**

**On the scaling curve, Moshi is the GPT-2 analog.** It crossed the zero-shot full-duplex viability threshold. No public GPT-3 analog exists yet. This diagnosis is not oto's invention; it is visible in the models themselves, which are still brittle on long-context turn-taking, in the benchmarks, where Full-Duplex-Bench scores remain in a wide range, and in the data, where every scaling attempt hits the same Fisher-plus-scraps supply.

**The foundation threshold for full-duplex STS probably sits between 100,000 and 500,000 hours of two-channel dyadic conversational audio.** This is a hypothesis, not a measurement, and it depends on three load-bearing assumptions that deserve to be named.

First, the ASR scaling curve that runs from Switchboard (260 hours, 1991) through Fisher (1,960 hours, 2004) to LibriLight (60,000 hours, 2020) to Whisper (680,000 hours, 2022) is usable as an analogy for full-duplex. That is, the per-hour data efficiency transfers from single-channel to two-channel within roughly one order of magnitude. Second, full-duplex is strictly harder than single-channel (two tracks, natural overlap, backchannels, turn-taking), but the difficulty multiplier is bounded at one to two orders of magnitude, not more. Third, parameter scaling and data scaling co-move as they did in the LLM and ASR arcs, so the target model lands at 50B to 500B parameters, roughly 10x to 50x current Moshi scale.

**If any of the three assumptions breaks, the estimate breaks with it.** The 5x spread in the hour estimate (100k to 500k) is the honest range that absorbs these three uncertainties; it is not the statistical confidence interval of a measured quantity.

**If the threshold is 100-500k hours and current supply is 2-3k, closing the gap requires a 30x to 150x scale-up.** On the ASR arc's template, this is a multi-year project, not a six-month sprint.

*(Figure F3: scaling curve with text LLMs (GPT-1 → GPT-2 → GPT-3 → GPT-4) plotted, full-duplex STS superimposed as a separate track. Moshi is positioned near GPT-2 on the y-axis. The foundation threshold band is shaded between GPT-3-equivalent and GPT-4-equivalent. Annotated as analogy, not identity.)*

## 4. Which verticals will need their own data, and why generic STS will not suffice

Even after the foundation threshold is crossed, some verticals will still require vertical-specific fine-tuning. The reasons differ by vertical.

- **Call center** uses scripted prompts, complaint register, sensitive data handling, QA review. The deployment reality is 8 kHz narrowband; the training reality needs natural overlap between agent and caller.
- **Medical** requires drug-name vocabulary, ICD-10 term accuracy, emotional-register control with patients, and HIPAA-compliant scribing.
- **Legal** runs on formal oral-argument register, citation-heavy vocabulary, and adversarial turn-taking that looks nothing like casual conversation.
- **Education** follows teacher-student Initiation-Response-Evaluation patterns, includes code-switching with minors, and is gated by FERPA and COPPA.
- **Multi-party meetings (three or more speakers)** require diarization, overlap resolution, and role attribution. **Current full-duplex models are dyadic.** This is a structural gap, not a data gap.
- **Gaming** requires sub-200ms latency, emotional-register matching, gaming-specific jargon, background noise handling, and interruptions as a feature, not a bug.
- **Casual everyday and companion** use cases need backchannels, laughter, emotional attunement, and long-context memory.

**Each of these is a different distribution of turn patterns, channel configurations, vocabulary, or regulatory constraints.** Generic STS can handle none of them at production quality.

### 4.1 The Japanese case, a short aside

Japanese full-duplex STS is a special case because **no Fisher-equivalent exists**. J-Moshi's fine-tune mixture totals 344 hours, of which only 143 hours comes from publicly reproducible corpora ([Ohashi et al. 2025](https://arxiv.org/abs/2506.xxxx)). The remaining 201 hours is Nagoya University in-house recordings. J-CHAT provides 69,000 hours of Japanese audio but is mono single-speaker and cannot be used for the full-duplex fine-tune stage ([Nakata et al. 2024](https://arxiv.org/abs/2407.15828)). CEJC (200 hours), BTSJ (127 hours), CSJ (650 hours, predominantly monologue) push the total dialogue audio toward 500-600 hours with mixed channel configurations and mixed licenses.

**The Japanese full-duplex community is training on a public floor of 143 hours.** English is not great. Japanese is worse.

*(Figure F4: domain divergence table. Seven vertical rows (call center, medical, legal, education, multi-party, gaming, casual). Four columns: vocabulary divergence, turn-pattern divergence, channel configuration, regulation. Japanese flagged in a callout row.)*

## 5. The three-pattern reality check

Across nine verticals inventoried (call center, medical, legal, education, multi-party meetings, gaming, brainstorming, casual everyday, Japanese), data for each falls into one of three categories.

- **Nonexistent**: no public corpus at any scale.
- **Too small**: under 200 hours in the largest public corpus.
- **Blocked**: 200+ hours exist but are gated by license, regulation, or channel configuration.

Plus a fourth pattern worth naming: **structurally wrong**. The data exists at scale but in the wrong configuration for full-duplex training, most commonly because it is mono-mixed rather than two-channel.

**Zero verticals have an existing, commercially usable, public full-duplex corpus over 1,000 hours.** Every vertical fails at least one of the three bars.

| Vertical | Category | Best public two-channel corpus | Hours | Primary blocker |
|---|---|---|---|---|
| Call center | Blocked (license) | Fisher English (LDC2004/2005S13) | 1,960 | LDC gated, 8 kHz narrowband, English only |
| Medical | Blocked (regulation) | PriMock57 + ACI-Bench (mocked) | <100 | HIPAA / GDPR; real 14kh Google Health internal |
| Legal | Structurally wrong (mono-mixed) | Oyez Supreme Court | ~5,000 | Single mixed track, not two-channel |
| Education | Blocked (regulation) | NCTE Classroom (gated) / SimClass (simulated) | 5,000 / 391 | FERPA / COPPA; minors on tape |
| Multi-party (3+) | Too small | AMI + ICSI + CHiME-6 + VoxConverse + MSDWild | ~360 | No single corpus >100h |
| Gaming | Too small (near-nonexistent) | OGVC (Japanese MMORPG) | <20 | No public English gaming corpus |
| Brainstorming | Too small | AMI scenario-driven subset | 65 | Subsumed into meetings |
| Casual everyday | Blocked (license) | CANDOR | 850 | CC BY-NC, non-commercial only |
| Japanese | Too small + blocked | J-Moshi public portion | 143 | No Fisher-equivalent |

Three verticals deserve a closer look because they surprised the inventory.

**Legal is configuration-wrong, not data-poor.** The Oyez Project hosts more than 5,000 hours of US Supreme Court oral argument audio, which is public domain. But it is mono-mixed from the courtroom recording system; a well-funded effort could diarize and release per-speaker tracks, but the underlying acoustic recording has only one track ([Oyez.org](https://www.oyez.org/)). The domain is not scarce; the configuration is.

**Medical is structurally forced into synthetic data.** The two named public medical conversation corpora, PriMock57 ([Korfiatis et al. 2022](https://arxiv.org/abs/2204.00333)) and ACI-Bench ([Yim et al. 2023](https://arxiv.org/abs/2306.02022)), are both explicitly mocked with patient actors. The authors are explicit about this being a HIPAA workaround. The Google Health medical conversation dataset of 14,000 hours ([Chiu et al. 2017](https://arxiv.org/abs/1711.07274)) is institutional and has never been released. **Medical full-duplex STS training is structurally forced into synthetic data by regulation, not by effort.**

**The only commercial full-duplex corpus over 10,000 hours is vendor-claimed.** Abaka AI's 20,000-hour bidirectional release (2026) is the only named public precedent that clears the commercial-license-at-scale bar, but it is vendor-claimed and has not been independently audited for overlap rate, consent documentation, or per-language hour breakdown. One data point is not a distribution.

*(Figure F5: central traffic-light matrix. Nine verticals on rows. Four columns for size, channel configuration, license, regulation, each colored green / yellow / red. This is the article's anchor visual.)*

## 6. Why vertical-first investment is premature now

The simple version of this article's thesis: **committing vertical-specific STS capital in 2026 is not wrong in direction, but wrong in timing.** The vertical needs a foundation to sit on. That foundation does not yet exist for native full-duplex.

Three market positions implicitly take the vertical-first bet.

**Decagon** raised a Series D at a $4.5 billion valuation in January 2026, deploying customer-service STS agents for enterprises ([press coverage](https://techcrunch.com/2026/01/28/decagon-series-d/)). **Deepgram** raised a Series C at $1.3 billion for enterprise voice AI in the same month ([press coverage](https://www.reuters.com/technology/deepgram-raises-series-c-2026-01-13/)). **Vapi** raised a $20M Series A in late 2024 and has reportedly crossed $130M in valuation since, building a developer voice platform ([press coverage](https://techcrunch.com/2024/12/12/vapi-series-a/)).

Each of these is a **pipeline STS stack**, meaning ASR plus LLM plus TTS, not native full-duplex. The product works today. It ships to customers today. The full-duplex quality gap (natural overlap, true interruption handling, backchannel nuance) is real but not yet a deal-breaker for the customer-service and developer-tool use cases these companies address.

**The risk is not the business model. The risk is timing.** If the native full-duplex foundation threshold is crossed in 2026-2028, pipeline-based verticals face a transition cost: either migrate to native full-duplex (expensive) or maintain the pipeline stack against competitors who build natively on the new foundation (compounding disadvantage).

One honest counter-nuance: **not every vertical needs the foundation to be ready.** Retell has reportedly reached $50M ARR on roughly $5M total funding ([company page](https://www.retellai.com/)), which suggests pipeline STS can compound without foundation-level investment for certain use cases where the full-duplex quality gap is not the binding constraint. The foundation-first pattern is strongest for verticals where natural conversation quality is the bottleneck: companionship, emotional support, interactive gaming, long-context conversational agents.

The post-foundation compression factor is worth keeping in view. Harvey went from ChatGPT launch (November 2022) to Series A (April 2023) in five months. Nuance went from founding (1992) to Microsoft acquisition at $19.7 billion in thirty years. **Post-foundation verticals compound roughly 30x faster because the foundation is a commodity input.** STS unicorns built on a native full-duplex foundation, when it exists, will compound on that ratio, not Nuance's.

*(Figure F6: investment timing chart. X-axis foundation moment in each domain. Y-axis months to first $1B vertical. Points plotted: Harvey (text legal, 13 months post-ChatGPT), Abridge (medical ASR, 24 months post-Whisper), full-duplex STS (projected 2027-2028, conditional on threshold crossing).)*

## 7. Where the data could plausibly come from: six routes

If the foundation threshold requires 100,000 to 500,000 hours of two-channel dyadic audio, where does that data come from? Six routes are plausible. Each has a precedent; none has cleared the full bar of commercial-license at scale for full-duplex specifically.

**Route 3, BPO and commercial vendors.** Strongest current precedent. Abaka AI's 20,000 hours bidirectional commercial release (2026) is the only named public precedent that clears commercial license at scale. Caveat: vendor-claimed, not independently audited. Nexdata's 15k-hour multilingual conversational corpus is mono 8 kHz and fails the channel bar. Appen and TELUS Digital are project-based managed collection, not standing corpora. **One working route with one working data point.**

**Route 5, government-sponsored (DARPA template).** Strongest historical precedent for native two-channel conversational. Switchboard (1990-91, DARPA + Texas Instruments, 260 hours) and Fisher (2003-04, DARPA EARS + LDC, 1,960 hours) are the two canonical releases of the modern era. **Neither alone clears 10,000 hours, and no public 2024-2026 program is known to target full-duplex at tens-of-thousands-of-hours scale.** The ceiling is political and logistical, not technical.

**Route 4, academic consortium (LDC model).** LDC has the longest operational track record for two-channel conversational licensing and a working commercial tier ($34-40k annual for-profit membership) that yields in-year commercial rights. But LDC has not produced a new two-channel conversational corpus above 2,000 hours since Fisher in 2004-2005. **The institutional scaffolding is intact; the origination funding has not been there for twenty-two years.**

**Route 6, platform-gated licensing.** Infrastructure is mature. Reddit-Google ($60M per year, February 2024) and Reddit-OpenAI ($70M per year) prove platforms can monetize UGC corpora to AI labs. YouTube's December 2024 opt-in creator control and the RSL protocol (launched September 2025, 1,500+ publishers by late 2025) provide the opt-in plumbing for web-scale audio. But **no audio-platform-wide bulk licensing deal to an AI lab is publicly disclosed for STS training as of April 2026.** Spotify's May 2025 Developer Policy explicitly prohibits Spotify-content training. **The pipes are built. The deals are not signed.** Watch Route 6 for a surprise inflection.

**Route 2, crowdsourced opt-in.** Ceiling of known evidence: Mozilla Common Voice at 31,841 hours across 286 languages, all CC0. But **all of it is single-speaker read or monologue**. No crowdsourced precedent for full-duplex conversational speech at any scale has been identified. The structural reason is that crowdsourcing assumes one-person-per-device recording, and full-duplex requires paired speakers on isolated channels. **The route with the single largest structural gap.** Whoever builds the pairing, channel-isolation, and consent infrastructure at scale could become the Mozilla Common Voice of STS.

**Route 1, consumer companion app opt-in.** Highest volume, lowest marketability. Replika (since 2017), Character.AI (since 2021), Inflection Pi (2022-24), and Sesame (beta 2025) accumulate in-app conversational data in volumes that almost certainly exceed 10,000 hours per company. But **none has ever released, licensed, or sold a conversational corpus to a third party.** The Italian DPA's February 2023 provisional ban and April 2025 €5M fine against Luka Inc. (Replika) demonstrate the GDPR ceiling: privacy policies that conflate chatbot interaction with model development fail the lawful-basis test, making commercial redistribution impossible ([Garante decision, April 2025](https://www.garanteprivacy.it/home/docweb/-/docweb-display/docweb/10085565)). **Companion-app data is structurally trapped inside the companion app.**

*(Figure F7: six-route comparison card. Columns: route, canonical precedent, precedent hours, license type, full-duplex applicability, current ceiling. Color-coded on precedent strength, scalability, and commercial viability.)*

## 8. Sequencing: what buys what

The ASR arc is the clearest template. **Automatic speech recognition went from Switchboard (1991, 260 hours, roughly $2M DARPA budget, reported) to Whisper (2022, 680,000 hours, scraped from the web) in thirty-two years.** The per-hour cost of ASR training data collapsed from thousands of dollars to effectively zero-but-legally-contested. Each jump was enabled by a different collection route.

- **1990-91**: Switchboard, Route 5 (government).
- **2003-04**: Fisher, Route 5 continued.
- **2015**: Mozilla Common Voice launched, Route 2 (crowdsourced).
- **2020**: LibriLight, 60,000 hours public-domain audiobooks, Route 4-adjacent.
- **2022**: Whisper, 680,000 hours scraped, Route 6.
- **2024**: Moshi, 7M hours web speech pretraining, Route 6 continued.

**Full-duplex STS is roughly at the 1991 stage of the ASR arc.** Absolute hours of available training data (Abaka 20k + Fisher 2k + AMI / ICSI / CHiME-6 / CANDOR / DialogueSidon / InteractSpeech at roughly a thousand combined) approximate 1991 ASR in raw volume, and worse in relative terms because two-channel is a harder collection problem than mono.

**The likely sequence, treated as hypothesis rather than forecast.** Near term, 2026-2027: Route 3 scales as the proven route. One or two more commercial vendors cross 10,000 hours. Route 5 re-enters if a DARPA-equivalent or EU-equivalent program launches for full-duplex. Medium term, 2027-2029: Route 6 delivers a surprise inflection if an audio platform signs a bulk licensing deal, most likely a podcast distributor via RSL or a UGC platform with opt-in, not consumer companion apps. Longer term, 2028+: Route 2 becomes viable once someone builds the pairing, channel-isolation, and consent infrastructure for full-duplex crowdsourcing. No one has built it yet.

**The post-foundation compression factor should shorten this arc by roughly 3x relative to ASR's thirty-two years.** Infrastructure that did not exist in 1991 (cloud storage, consent UX primitives, opt-in protocols, commercial vendor markets for labeled data) exists now. **A 10-year arc from Switchboard-equivalent to Whisper-equivalent is plausible; a 30-year arc is not.**

The order of operations matters for investors. **Foundation data, then foundation model, then vertical product.** Reversing the order (betting on vertical before foundation) works for pipeline STS, but where the value ultimately concentrates at each layer is an open question that the LLM arc does not cleanly resolve for STS.

In text LLMs, the foundation model layer captured significant terminal value. STS is likely to play out differently in at least one structural way. Hyperscalers (Google with Gemini Live and OpenAI with GPT-4o, with Microsoft and Meta closer behind the frontline) already show a tendency to internalize foundation STS behind proprietary APIs. If that pattern holds, the independent-player opportunity shifts away from foundation model replication and toward four adjacent layers: **foundation data** (what oto is building), **evaluation and benchmarking** (the subject of Articles 07 and 08), **migration infrastructure** (pipeline-to-native adapters), and **vertical integration on top of closed foundations**. The conservative reading is that value will concentrate differently than in the LLM arc, with at least one of those four adjacent layers accruing a disproportionate share.

**The cleanest framing for investors is that foundation-first is the timing claim, not a value-capture claim.** What verticals wait for is that foundation readiness. What pure-play foundation model startups face is that the foundation may not be where the terminal value sits.

*(Figure F8: ASR arc as a timeline. 1991 Switchboard, 2003 Fisher, 2015 Common Voice, 2020 LibriLight, 2022 Whisper, 2024 Moshi. Full-duplex STS annotated with a "you are here" arrow near the 1991-equivalent position. Compression-factor note explains the expected 10-year arc rather than 32-year.)*

## 9. Forward pointers

For investors, **the investable proposition in full-duplex STS today is not vertical-first.** It is foundation-data collection, if you believe the threshold is crossable, or pipeline STS verticals with a planned migration path to native full-duplex, if you do not. Both are reasonable positions; neither is the SaaS-style vertical bet.

For engineers and researchers, **two-channel conversational data at 100,000 to 500,000 hours is the binding input.** The technical problem most worth solving is not a new architecture. It is the pairing, channel-isolation, and consent infrastructure for crowdsourced full-duplex.

For frontier labs, **vertical fine-tuning is premature.** The 2024-2026 window is the foundation window. Article 04 covered the data-supply side of the same coin; Article 07 will cover how we will know when the threshold has been crossed, via benchmarks. Article 10 will cover the legal ceilings on Routes 1, 2, and 6 in detail.

**Data for full-duplex STS is not a vertical problem yet. It is a foundation problem still.**

---

_Originally published at [https://fullduplex.ai/blog/foundation-before-vertical](https://fullduplex.ai/blog/foundation-before-vertical)._
_Part of **The STS Series** · 05 / 10 · from Fullduplex._
_Full index: https://fullduplex.ai/blog · Markdown of every article: https://fullduplex.ai/llms-full.txt._
