Signals · 2026-W25 — Fullduplex

What happened this week

This was the busiest week of the year on the foundational full-duplex side. Five distinct FD papers across labs converged on the same gaps the field has been pointing at since W21: native turn-taking without external VAD, interaction-level alignment that supervised loss doesn't reach, and interpretability of the listen/speak transition. On the product side, LiveKit shipped a single but consequential release.

Full-duplex — native architecture

BayLing-Duplex (Fang, Guo, Feng) is the headline FD paper. It argues that LLaMA-Omni- and GLM-4-Voice-class SpeechLMs are still turn-based because they rely on an external VAD module to mark the end of the user turn — and that this is the architectural limit on interactivity. The proposal: a single autoregressive LLM that decides when to listen, when to speak, and when to stop, with no auxiliary turn-taking module. Joins the small set of native-FD backbones (Moshi, DuplexSLA, TML-Interaction-Small) without copying their dual-stream design.

Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models (Ohashi, Zeghidour, Défossez — Kyutai) is the alignment paper to read alongside. It targets a specific gap: current FD models are trained with token-level likelihood maximisation, which does not optimise interaction-level behaviours, producing excessive silence and ill-timed turn-taking. Prior RL work on FD addressed only a narrow set of interactive behaviours. This paper proposes a post-training alignment method that comprehensively improves the interactivity axis — the missing piece between Moshi-style pretraining and a deployable FD agent.

Overcoming State Inertia (Chang, Chang, Liu) is the interpretability counterpart. It probes what FD-SLM hidden representations actually encode and finds stream-specific predictive patterns: during listening, the model preferentially predicts the incoming user stream; during speaking, it preferentially predicts its own output. Activation steering can then dynamically modulate the internal predictive focus between the two states — a concrete handle on the inertia problem that plagues every FD backbone.

Full-duplex — turn-taking and endpointing

Endpoint Anticipation for Low-Latency Spoken Dialogue (Udupa, Watanabe, Schwarz) shifts the endpointing problem from reactive detection to proactive forecasting, anticipating end-of-turn signals up to 2.56 seconds in advance so the LLM and TTS pipelines can speculatively execute on partial context. New metrics quantify the trade-off between realised latency reduction and computational redundancy. Integration with the Unmute framework validates the approach end-to-end. Direct relevance to anyone running a cascaded voice stack who is bottlenecked on VAD-style detection.

Adaptive Turn-Taking for Real-time Multi-Party Voice Agents (Mitra, Pandey, Jain) introduces ModeratorLM, a role-playing voice agent that conditions turn-taking on an explicitly assigned role in multi-party settings, and RolePlayConv, a large-scale synthetic spoken multi-party dataset with diverse assistant roles. Adds a chain-of-thought reasoning variant over conversational context plus assigned role. The multi-party axis is the one EVA-Bench and FD-Bench have not yet covered — this paper opens that surface.

Paralinguistic and streaming translation

ParaBridge (Wang, Ni, Cai) tackles the paralinguistic perception-behavior gap from the dialogue side. SLMs can recognise paralinguistic cues but often ignore them in open-ended dialogue. The paper observes that a simple paralinguistic instruction scaffold at inference narrows this gap, suggesting the cues are already latent — and proposes ParaBridge, an on-policy self-distillation method that bakes that scaffold into the model. Pairs cleanly with W23 VoxParadox, which measured the gap.

NaturalFlow (Lee, Cho, Park) closes out the streaming S2ST thread that has been running since W21. The observation: excessive pursuit of low latency in simultaneous translation produces fragmented chunk-wise speech with frequent unnatural pauses, raising listener cognitive load. The fluency-aware optimisation framework discovers the sweet spot between low-latency benefits and natural acoustic flow — minimising inter-chunk silences without giving up the streaming property. Together with W22 Samsung's adaptive emit policy and W23 DOA's training-free decoder-only attention, the open SimulST stack now has three complementary papers in five weeks.

Product — LiveKit Agents 1.6.0 ships Asynchronous Tools

livekit-agents 1.6.0 is the major-version bump. The headline feature is first-class asynchronous tools: when a long-running tool is in progress, the agent can hand control back to the LLM before it finishes and stream updates into the conversation as it progresses. ctx.update(...) from inside a tool releases control and lets the agent say e.g. "Sure, searching flights — this'll take a minute." Later updates are coalesced into a deferred reply when the agent is idle. The matching ctx.with_filler(...) API plays filler phrases during long silences. This is the first-class fix for the long-tool-call silence problem that voice agents have papered over with hacks. Also: per-FlushSentinel audio/text flush, single_peer_connection in JobContext.connect, and the Hamming plugin extensions.

What is not here

Pipecat had no in-window release (v1.3.0 was May 29, before the window). Cartesia, Hume, Deepgram Voice Agent, and ElevenLabs Agents shipped no in-window technical changelog items. No dataset drop with a verifiable primary source. The audio-safety benchmarks /benchmarks#audio-safety page was added between W24 and W25 — a site change rather than a digest entry.

Corrections to hi@fullduplex.ai.

Signals · 2026-W25.