FullduplexFullduplex/
the sts series05 / 10#foundation#investment§ 09 sections · 08 figures · 01 matrix

Foundation before vertical.

Speech-to-speech AI is between the GPT-2 moment and the GPT-3 moment. “Which vertical wins first?” is the wrong question yet. A thesis essay on the foundation threshold, the 30×–150× data gap, and the six routes that could plausibly close it.

fig.00 · foundation-first vs vertical-first — where STS actually sits in 2026fullduplex / synthesized

The VC question and the wrong answer

Investors evaluating voice AI in 2026 keep asking a reasonable-sounding question. Which vertical will speech-to-speech AI win first? Call centers? Medical documentation? Legal? Education? Gaming? The premise of that question is that the foundation is ready and the remaining problem is product-market fit. Eighteen months of data inventory, benchmark results, and model releases suggest the premise is wrong. Full-duplex speech-to-speech sits somewhere between the GPT-2 moment and the GPT-3 moment. Asking which vertical to chase is like asking in 2019 whether the first billion-dollar LLM company would be in legal or in medical. The answer then was “neither, because the foundation is not ready.” The answer for STS in 2026 is the same.

The vertical-first framing comes naturally to people who have financed a decade of SaaS. Pick a vertical with pain, ship a narrow product faster than the incumbents, compound through distribution. For speech-to-speech AI in 2026, this framing is a category error. The constraint is not which market to address. The constraint is whether the foundation that verticals will sit on exists yet.

Text LLMs went through the same confusion in 2019 and 2020. GPT-2 (2019) could write paragraphs but not reliably answer domain questions. Vertical LLM startups at that stage either built their own domain foundation from scratch (and lost) or waited. GPT-3 (2020) flipped the economics. Post-GPT-3, Harvey raised its Series A five months after ChatGPT shipped. Hippocratic AI raised a $50M seed six months after. Neither would have been financeable eighteen months earlier.

The right question for STS in 2026 is not “which vertical?” but “is the foundation data bottleneck closing, and on what timeline?” The rest of this article works through what is known about the answer.

One note on epistemic status before continuing. This article is a thesis essay, not a research summary. Facts and hypotheses are tagged differently. Facts include the public supply total (2,000 to 3,000 hours), Fisher's 1,960 hours, Abaka's vendor-claimed 20,000 hours, the LDC license structure, and the funding rounds cited in §2 and §6. Hypotheses include the 100,000 to 500,000 hour foundation threshold estimate (§3), the 30× to 150× supply gap that follows from it (§3), the 3× post-foundation compression factor applied to the ASR arc (§8), and the 2027 to 2029 sequencing reading (§8). Readers should hold the hypotheses loosely and update them against new data as it arrives.

fig.01 · two framings·········
vertical-first framing"which market do we win?"callcentersmedicalscribinglegaldepos.educationtutoringgamingcompanionscasualcompanionassumes foundation is readyand the remaining problem is GTMcategory error for STS in 2026foundation-first framing"is the foundation ready?"hours of two-channel data →~2–3k hcurrent100–500k hthreshold30×–150× gapfoundation not yet readythe gap closes before verticals compound
Two framings of the STS investment question. The vertical-first framing inherits from a decade of SaaS logic and assumes the foundation is ready; the foundation-first framing treats two-channel conversational data volume as the binding constraint and marks the supply gap.

Foundation threshold, a concept worth naming

The foundation threshold is the data-and-parameter scale at which a single pretrained model generalizes well enough, zero-shot or with light fine-tuning, that domain-specialized products can be built as adapters on that model rather than from scratch. Below the threshold, each vertical must solve its own data, model, and product. Above it, the foundation is a commodity input and the vertical becomes a distribution problem.

The threshold is visible across three domains that have already crossed it.

Text LLMs. GPT-1 (2018) at 117M parameters and 0.8B tokens required fine-tuning for any task (Radford et al. 2018). GPT-2 (2019) at 1.5B and 10B tokens had zero-shot performance that was interesting but unreliable (Radford et al. 2019). GPT-3 (2020) at 175B and 300B tokens had few-shot in-context learning robust enough that a vertical startup could build a product by prompting alone (Brown et al. 2020). Post-threshold vertical adapters confirmed the pattern: Med-PaLM scored 67.6% on MedQA (Singhal et al. 2022), Med-PaLM 2 scored 86.5% (Singhal et al. 2023), both built on PaLM and PaLM 2 respectively, not trained from scratch. Code Llama added 500B code tokens on top of Llama 2, roughly ten percent additional training (Rozière et al. 2023). Specialization was additive and cheap on top of a proven base.

Computer vision. CLIP (2021) trained on 400M image-text pairs crossed a zero-shot transfer threshold (Radford et al. 2021). MedSAM, a medical-imaging adapter on SAM, improved DICE by 22.51 points over zero-shot SAM across 86 of 86 internal tasks, using 1.57M medical mask annotations (Ma et al. 2024). The domain data needed post-threshold was two to three orders of magnitude smaller than the foundation data. BiomedCLIP followed the same pattern with 15M medical image-text pairs on a CLIP base (Zhang et al. 2023).

Automatic speech recognition. Whisper (2022) trained on 680,000 hours crossed the threshold for zero-shot transfer across accents, domains, and languages (Radford et al. 2022). Pre-Whisper, Nuance Dragon Medical was used by 55% of US physicians and was built on per-domain specialized acoustic models. Post-Whisper, the vertical winner is Abridge ($5.3B valuation, June 2025), which sits on Whisper-class foundations plus LLMs rather than a from-scratch medical ASR stack. The specialized-acoustic-model moat largely evaporated.

Counter-examples make the rule sharper, not weaker. Two text-LLM projects tried to build domain foundations at sub-frontier scale. BloombergGPT (2023, 50B parameters, trained from scratch on 363B finance tokens plus 345B general) was matched or exceeded by GPT-4 on most finance tasks within twelve months of its release (Wu et al. 2023). Galactica (2022, 120B parameters, science-specialized) was publicly withdrawn after three days because its narrow corpus produced hallucinated citations that sounded plausible (Taylor et al. 2022). Neither case refutes the foundation-first pattern. Both refine it. The operational rule is “foundation first, vertical as adapter on the foundation,” not “vertical foundation at sub-frontier scale.” Attempts at the latter lose.

fig.02 · three domains, one pattern·········
DomainFoundation referenceParamsDataPost-threshold vertical
Text LLMGPT-3 (2020)175B300B tokensHarvey · Hippocratic
VisionCLIP / SAM (2021–23)~400M / ~600M400M pairs / 1.1B masksMedSAM · BiomedCLIP
ASRWhisper large (2022)1.55B680,000 hAbridge
Full-duplex STSNot yet crossed~7B (Moshi)~2–3k h public full-duplex-readyNone native
Foundation thresholds and their post-threshold vertical winners, across three domains that have crossed. Full-duplex STS appears as an unfilled row; the rest of the article works through why.

Where full-duplex STS actually sits now

STS is not pre-foundation. It is mid-foundation.

Model side. Moshi (September 2024) was the first open full-duplex STS model, at roughly 7B parameters (Kyutai 2024). PersonaPlex (January 2026, NVIDIA) is a Moshi-fine-tuned variant with persona control. J-Moshi (2025) is the Japanese language-specific variant. Parameter counts sit between GPT-2 scale and GPT-3 scale.

Data side. From Article 04's inventory, public full-duplex-ready speech totals roughly 2,000 to 3,000 hours. Fisher English is the anchor at 1,960 hours (LDC2004S13, LDC2005S13), gated by LDC license. AMI contributes 100 hours, ICSI 72 hours, CHiME-6 40 hours, CANDOR 850 hours under CC BY-NC, plus smaller contributions from InteractSpeech and DialogueSidon. Total public full-duplex-ready hours under commercial license is close to zero.

On the scaling curve, Moshi is the GPT-2 analog. It crossed the zero-shot full-duplex viability threshold. No public GPT-3 analog exists yet. This diagnosis is not fullduplex's invention; it is visible in the models themselves, which are still brittle on long-context turn-taking, in the benchmarks, where Full-Duplex-Bench scores remain in a wide range, and in the data, where every scaling attempt hits the same Fisher-plus-scraps supply.

The foundation threshold for full-duplex STS probably sits between 100,000 and 500,000 hours of two-channel dyadic conversational audio. This is a hypothesis, not a measurement, and it depends on three load-bearing assumptions that deserve to be named.

First, the ASR scaling curve that runs from Switchboard (260 hours, 1991) through Fisher (1,960 hours, 2004) to LibriLight (60,000 hours, 2020) to Whisper (680,000 hours, 2022) is usable as an analogy for full-duplex. That is, the per-hour data efficiency transfers from single-channel to two-channel within roughly one order of magnitude. Second, full-duplex is strictly harder than single-channel (two tracks, natural overlap, backchannels, turn-taking), but the difficulty multiplier is bounded at one to two orders of magnitude, not more. Third, parameter scaling and data scaling co-move as they did in the LLM and ASR arcs, so the target model lands at 50B to 500B parameters, roughly 10× to 50× current Moshi scale.

If any of the three assumptions breaks, the estimate breaks with it. The 5× spread in the hour estimate (100k to 500k) is the honest range that absorbs these three uncertainties; it is not the statistical confidence interval of a measured quantity.

If the threshold is 100–500k hours and current supply is 2–3k, closing the gap requires a 30× to 150× scale-up. On the ASR arc's template, this is a multi-year project, not a six-month sprint.

fig.03 · scaling curves, by analogy·········
data scale (log) →model capability →GPT-1GPT-2GPT-3GPT-4text LLMsfoundation threshold bandMoshi 2024PersonaPlex · J-Moshi? STS GPT-3 analogfull-duplex STSMoshi ≈ GPT-2 moment
Full-duplex STS plotted as an analogical trajectory on the text-LLM scaling curve. Moshi lands near the GPT-2 position; no public GPT-3 analog exists yet. The shaded band marks the plausible foundation-threshold region. Analogy, not identity.

Which verticals will need their own data

Even after the foundation threshold is crossed, some verticals will still require vertical-specific fine-tuning. The reasons differ by vertical.

Call center uses scripted prompts, complaint register, sensitive data handling, QA review. The deployment reality is 8 kHz narrowband; the training reality needs natural overlap between agent and caller. Medical requires drug-name vocabulary, ICD-10 term accuracy, emotional-register control with patients, and HIPAA-compliant scribing. Legal runs on formal oral-argument register, citation-heavy vocabulary, and adversarial turn-taking that looks nothing like casual conversation. Education follows teacher-student Initiation-Response-Evaluation patterns, includes code-switching with minors, and is gated by FERPA and COPPA. Multi-party meetings (three or more speakers) require diarization, overlap resolution, and role attribution; current full-duplex models are dyadic, which is a structural gap not a data gap. Gaming requires sub-200ms latency, emotional-register matching, gaming-specific jargon, background noise handling, and interruptions as a feature. Casual everyday and companion use cases need backchannels, laughter, emotional attunement, and long-context memory.

Each of these is a different distribution of turn patterns, channel configurations, vocabulary, or regulatory constraints. Generic STS can handle none of them at production quality.

4.1 The Japanese case, a short aside

Japanese full-duplex STS is a special case because no Fisher-equivalent exists. J-Moshi's fine-tune mixture totals 344 hours, of which only 143 hours comes from publicly reproducible corpora. The remaining 201 hours is Nagoya University in-house recordings. J-CHAT provides 69,000 hours of Japanese audio but is mono single-speaker and cannot be used for the full-duplex fine-tune stage (Nakata et al. 2024). CEJC (200 hours), BTSJ (127 hours), and CSJ (650 hours, predominantly monologue) push the total dialogue audio toward 500–600 hours with mixed channel configurations and mixed licenses.

The Japanese full-duplex community is training on a public floor of 143 hours. English is not great. Japanese is worse.

fig.04 · seven verticals + japan·········
VerticalVocabularyTurn patternChannelRegulation
Call centerQA / complaintAgent-caller8 kHz narrowbandRecording consent
MedicalDrug / ICD-10Patient-ledWide-bandHIPAA / GDPR
LegalCitation-heavyAdversarialMono-mixedCourt rules
EducationSubject-domainIREClassroom noiseFERPA / COPPA
Multi-party (3+)GeneralDiarizationRole attributionContext-dep.
GamingJargonInterruption-richBackground noiseModeration
Casual companionGeneralBackchannelsWide-bandGDPR · consent
Japanese (any)Language splitAizuchi143 h publicAPPI
Seven verticals plus the Japanese cross-cutting case, scored on how far each diverges from general-purpose STS pretraining. Red cells are binding constraints; yellow cells are meaningful distributional gaps; uncolored cells are approximately covered by generic data.

The three-pattern reality check

Across nine verticals inventoried (call center, medical, legal, education, multi-party meetings, gaming, brainstorming, casual everyday, Japanese), data for each falls into one of three categories: nonexistent (no public corpus at any scale), too small (under 200 hours in the largest public corpus), or blocked (200+ hours exist but are gated by license, regulation, or channel configuration). Plus a fourth pattern worth naming: structurally wrong, where the data exists at scale but in the wrong configuration for full-duplex training, most commonly because it is mono-mixed rather than two-channel.

Zero verticals have an existing, commercially usable, public full-duplex corpus over 1,000 hours. Every vertical fails at least one of the three bars.

fig.05 · nine verticals · four bars·········
Size
Channel config
License
Regulation
Call center
1,960 hFisher English
OKTwo-channel telephone
BlockedLDC gated
MidConsent varies
Medical
<100 hPriMock / ACI mocked
MixedWide-band, varies
MidPaper-only
BlockedHIPAA / GDPR
Legal
5,000 hOyez SCOTUS
WrongMono-mixed only
OKPublic domain
MidCourt rules
Education
5,000 hNCTE (gated)
MixedClassroom mic
BlockedData-use agreement
BlockedFERPA / COPPA
Multi-party (3+)
~360 hAMI + ICSI + CHiME-6
MixedClose-talk only CHiME
MidCC BY-NC-SA mostly
OKLow risk
Gaming
<20 hOGVC only
MixedVoice-chat format
MidAcademic
MidPlatform ToS
Brainstorming
65 hAMI scenario subset
OKClose-talk
BlockedCC BY-NC-SA
OKLow risk
Casual everyday
850 hCANDOR
OKTwo-channel
BlockedCC BY-NC
OKConsented
Japanese
143 hJ-Moshi public portion
MixedStereo/mono mix
MidMixed
MidAPPI
Clears the barPartial or workable with effortBinding constraint
Nine verticals scored on four data availability dimensions. Red cells are binding constraints; yellow cells are partial; green cells clear the bar. Every vertical has at least one red cell.

Three verticals deserve a closer look because they surprised the inventory.

Legal is configuration-wrong, not data-poor. The Oyez Project hosts more than 5,000 hours of US Supreme Court oral argument audio, which is public domain. But it is mono-mixed from the courtroom recording system; a well-funded effort could diarize and release per-speaker tracks, but the underlying acoustic recording has only one track. The domain is not scarce; the configuration is.

Medical is structurally forced into synthetic data. The two named public medical conversation corpora, PriMock57 (Korfiatis et al. 2022) and ACI-Bench (Yim et al. 2023), are both explicitly mocked with patient actors. The authors are explicit about this being a HIPAA workaround. The Google Health medical conversation dataset of 14,000 hours (Chiu et al. 2017) is institutional and has never been released. Medical full-duplex STS training is structurally forced into synthetic data by regulation, not by effort.

The only commercial full-duplex corpus over 10,000 hours is vendor-claimed. Abaka AI's 20,000-hour bidirectional release (2026) is the only named public precedent that clears the commercial-license-at-scale bar, but it is vendor-claimed and has not been independently audited for overlap rate, consent documentation, or per-language hour breakdown. One data point is not a distribution.

Why vertical-first investment is premature now

The simple version of this article's thesis: committing vertical-specific STS capital in 2026 is not wrong in direction, but wrong in timing. The vertical needs a foundation to sit on. That foundation does not yet exist for native full-duplex.

Three market positions implicitly take the vertical-first bet. Decagon raised a Series D at a $4.5 billion valuation in January 2026, deploying customer-service STS agents for enterprises. Deepgram raised a Series C at $1.3 billion for enterprise voice AI in the same month. Vapi raised a $20M Series A in late 2024 and has reportedly crossed $130M in valuation since, building a developer voice platform.

Each of these is a pipeline STS stack, meaning ASR plus LLM plus TTS, not native full-duplex. The product works today. It ships to customers today. The full-duplex quality gap (natural overlap, true interruption handling, backchannel nuance) is real but not yet a deal-breaker for the customer-service and developer-tool use cases these companies address.

The risk is not the business model. The risk is timing. If the native full-duplex foundation threshold is crossed in 2026–2028, pipeline-based verticals face a transition cost: either migrate to native full-duplex (expensive) or maintain the pipeline stack against competitors who build natively on the new foundation (compounding disadvantage).

One honest counter-nuance: not every vertical needs the foundation to be ready. Retell has reportedly reached $50M ARR on roughly $5M total funding, which suggests pipeline STS can compound without foundation-level investment for certain use cases where the full-duplex quality gap is not the binding constraint. The foundation-first pattern is strongest for verticals where natural conversation quality is the bottleneck: companionship, emotional support, interactive gaming, long-context conversational agents.

The post-foundation compression factor is worth keeping in view. Harvey went from ChatGPT launch (November 2022) to Series A (April 2023) in five months. Nuance went from founding (1992) to Microsoft acquisition at $19.7 billion in thirty years. Post-foundation verticals compound roughly 30× faster because the foundation is a commodity input. STS unicorns built on a native full-duplex foundation, when it exists, will compound on that ratio, not Nuance's.

fig.06 · time to $1B, pre/post-foundation·········
foundation moment →time to first $1B vertical30 yr5 yr1 yr019922022 ChatGPT2022 Whisper2027–28?Nuance30 yr · pre-foundationHarvey13 mo to $1BAbridge28 mo to $1BSTS verticalprojected~30× compressionpost-foundation
Time from a domain's foundation moment to its first billion-dollar vertical. Nuance (pre-foundation) took 30 years. Harvey (post-GPT-3) reached $1B+ in 13 months. The post-foundation compression factor is roughly 30×. Full-duplex STS vertical timing is projected conditional on threshold crossing.

Where the data could plausibly come from: six routes

If the foundation threshold requires 100,000 to 500,000 hours of two-channel dyadic audio, where does that data come from? Six routes are plausible. Each has a precedent; none has cleared the full bar of commercial-license at scale for full-duplex specifically.

Route 3, BPO and commercial vendors. Strongest current precedent. Abaka AI's 20,000 hours bidirectional commercial release (2026) is the only named public precedent that clears commercial license at scale. Caveat: vendor-claimed, not independently audited. Nexdata's 15k-hour multilingual conversational corpus is mono 8 kHz and fails the channel bar. Appen and TELUS Digital are project-based managed collection, not standing corpora. One working route with one working data point.

Route 5, government-sponsored (DARPA template). Strongest historical precedent for native two-channel conversational. Switchboard (1990–91, DARPA + Texas Instruments, 260 hours) and Fisher (2003–04, DARPA EARS + LDC, 1,960 hours) are the two canonical releases of the modern era. Neither alone clears 10,000 hours, and no public 2024–2026 program is known to target full-duplex at tens-of-thousands-of-hours scale. The ceiling is political and logistical, not technical.

Route 4, academic consortium (LDC model). LDC has the longest operational track record for two-channel conversational licensing and a working commercial tier ($34–40k annual for-profit membership) that yields in-year commercial rights. But LDC has not produced a new two-channel conversational corpus above 2,000 hours since Fisher in 2004–2005. The institutional scaffolding is intact; the origination funding has not been there for twenty-two years.

Route 6, platform-gated licensing. Infrastructure is mature. Reddit-Google ($60M per year, February 2024) and Reddit-OpenAI (~$70M per year) prove platforms can monetize UGC corpora to AI labs. YouTube's December 2024 opt-in creator control and the RSL protocol (launched September 2025, 1,500+ publishers by late 2025) provide the opt-in plumbing for web-scale audio. But no audio-platform-wide bulk licensing deal to an AI lab is publicly disclosed for STS training as of April 2026. Spotify's May 2025 Developer Policy explicitly prohibits Spotify-content training. The pipes are built. The deals are not signed. Watch Route 6 for a surprise inflection.

Route 2, crowdsourced opt-in. Ceiling of known evidence: Mozilla Common Voice at 31,841 hours across 286 languages, all CC0. But all of it is single-speaker read or monologue. No crowdsourced precedent for full-duplex conversational speech at any scale has been identified. The structural reason is that crowdsourcing assumes one-person-per-device recording, and full-duplex requires paired speakers on isolated channels. Whoever builds the pairing, channel-isolation, and consent infrastructure at scale could become the Mozilla Common Voice of STS.

Route 1, consumer companion app opt-in. Highest volume, lowest marketability. Replika (since 2017), Character.AI (since 2021), Inflection Pi (2022–24), and Sesame (beta 2025) accumulate in-app conversational data in volumes that almost certainly exceed 10,000 hours per company. But none has ever released, licensed, or sold a conversational corpus to a third party. The Italian DPA's February 2023 provisional ban and April 2025 €5M fine against Luka Inc. (Replika) demonstrate the GDPR ceiling. Companion-app data is structurally trapped inside the companion app.

fig.07 · six routes, ranked·········
Route 3 · Strongest precedent
BPO / commercial vendor
Precedent: Abaka AI 20,000 h bidirectional (2026, vendor-claimed).
Ceiling: audit gap; single data point.
Route 5 · Historical template
Government-sponsored
Precedent: Switchboard 260 h (1991), Fisher 1,960 h (2004).
Ceiling: no active 2024–26 program at scale.
Route 4 · Institutional
Academic consortium
Precedent: LDC commercial tier; Fisher (2004).
Ceiling: 22-year gap since last 2,000 h+ release.
Route 6 · Infrastructure ready
Platform-gated licensing
Precedent: Reddit-Google $60M/yr (text). YouTube opt-in, RSL.
Ceiling: no audio-to-AI-lab deal signed.
Route 2 · Largest structural gap
Crowdsourced opt-in
Precedent: Common Voice 31k h CC0 but mono read.
Ceiling: no pairing / channel-isolation infrastructure exists.
Route 1 · Trapped volume
Consumer companion app
Volume: likely >10k h per app (Replika, Character.AI).
Ceiling: Italian DPA / GDPR block redistribution.
Six plausible routes to 100,000+ hours of two-channel full-duplex data, ranked by current precedent strength. Only Route 3 has a data point above 10,000 hours, and that point is vendor-claimed. Route 6 has the most mature infrastructure but no signed deals.

Sequencing: what buys what

The ASR arc is the clearest template. Automatic speech recognition went from Switchboard (1991, 260 hours, roughly $2M DARPA budget, reported) to Whisper (2022, 680,000 hours, scraped from the web) in thirty-two years. The per-hour cost of ASR training data collapsed from thousands of dollars to effectively zero-but-legally-contested. Each jump was enabled by a different collection route.

fig.08 · 32 years, one arc·········
32 years of ASR training-data collectionper-hour cost collapsed from ~$8k to effectively zero1991Switchboard260 h · Route 52004Fisher1,960 h · Route 52015Common Voice31k h · Route 22020LibriLight60k h · public dom.2022Whisper680k h · Route 62024Moshi pretrain7M h · Route 6full-duplex STS is here≈ 1991 Switchboard position
Thirty-two years of ASR training-data collection. Each jump passed through a different collection route: government → academic → crowdsourced → public domain → scraped. Full-duplex STS is roughly at the 1991 Switchboard position. The post-foundation compression factor suggests a 10-year arc rather than 32.

Full-duplex STS is roughly at the 1991 stage of the ASR arc. Absolute hours of available training data (Abaka 20k + Fisher 2k + AMI / ICSI / CHiME-6 / CANDOR / DialogueSidon / InteractSpeech at roughly a thousand combined) approximate 1991 ASR in raw volume, and worse in relative terms because two-channel is a harder collection problem than mono.

The likely sequence, treated as hypothesis rather than forecast. Near term, 2026–2027: Route 3 scales as the proven route. One or two more commercial vendors cross 10,000 hours. Route 5 re-enters if a DARPA-equivalent or EU-equivalent program launches for full-duplex. Medium term, 2027–2029: Route 6 delivers a surprise inflection if an audio platform signs a bulk licensing deal, most likely a podcast distributor via RSL or a UGC platform with opt-in, not consumer companion apps. Longer term, 2028+: Route 2 becomes viable once someone builds the pairing, channel-isolation, and consent infrastructure for full-duplex crowdsourcing. No one has built it yet.

The post-foundation compression factor should shorten this arc by roughly 3× relative to ASR's thirty-two years. Infrastructure that did not exist in 1991 (cloud storage, consent UX primitives, opt-in protocols, commercial vendor markets for labeled data) exists now. A 10-year arc from Switchboard-equivalent to Whisper-equivalent is plausible; a 30-year arc is not.

The order of operations matters for investors. Foundation data, then foundation model, then vertical product. Reversing the order (betting on vertical before foundation) works for pipeline STS, but where the value ultimately concentrates at each layer is an open question that the LLM arc does not cleanly resolve for STS.

In text LLMs, the foundation model layer captured significant terminal value. STS is likely to play out differently in at least one structural way. Hyperscalers (Google with Gemini Live and OpenAI with GPT-4o, with Microsoft and Meta closer behind the frontline) already show a tendency to internalize foundation STS behind proprietary APIs. If that pattern holds, the independent-player opportunity shifts away from foundation model replication and toward four adjacent layers: foundation data (what fullduplex is tracking), evaluation and benchmarking, migration infrastructure (pipeline-to-native adapters), and vertical integration on top of closed foundations. The conservative reading is that value will concentrate differently than in the LLM arc, with at least one of those four adjacent layers accruing a disproportionate share.

The cleanest framing for investors is that foundation-first is the timing claim, not a value-capture claim. What verticals wait for is that foundation readiness. What pure-play foundation model startups face is that the foundation may not be where the terminal value sits.

Forward pointers

For investors, the investable proposition in full-duplex STS today is not vertical-first. It is foundation-data collection, if you believe the threshold is crossable, or pipeline STS verticals with a planned migration path to native full-duplex, if you do not. Both are reasonable positions; neither is the SaaS-style vertical bet.

For engineers and researchers, two-channel conversational data at 100,000 to 500,000 hours is the binding input. The technical problem most worth solving is not a new architecture. It is the pairing, channel-isolation, and consent infrastructure for crowdsourced full-duplex.

For frontier labs, vertical fine-tuning is premature. The 2024–2026 window is the foundation window. Article 04 covered the data-supply side of the same coin; later articles will cover how we will know when the threshold has been crossed, via benchmarks, and the legal ceilings on Routes 1, 2, and 6 in detail.

Data for full-duplex STS is not a vertical problem yet. It is a foundation problem still.

Fullduplex is tracking this

We publish a weekly dispatch covering the foundation-data gap, the corpora that move the public floor, and the routes that actually deliver. If you are collecting, licensing, or building STS infrastructure, we want to hear from you. Read the data-ceiling companion (Article 04), then the consolidated references.

#foundation#investment#sts-series#data#moshifiled under: the latent · sts 05