What happened this week
A dense foundational week, with two distinct themes pulling on each other. On the foundational side, papers argue that LALMs should stop being offline batch systems and become online interaction models, and that the full-duplex stack needs to survive realistic acoustic interference. On the product side, Microsoft and LiveKit both shipped substantive updates that move the application surface forward.
Foundational — the LALM goes online
Audio Interaction Model is the conceptual paper of the week. The authors observe that today's LALMs are offline while existing streaming audio models each handle only a single task such as streaming ASR or voice chatting. They formalise a regime they call the Audio Interaction Model — an always-on perceive-decide-respond loop that listens to sound, environment, and instructions in real time and reacts on the fly — and realise it with Audio-Interaction, a unified streaming model that retains offline task performance while adding the online interaction surface. It is more a problem statement than a leaderboard result, but the framing is the right one: the gap between an offline LALM and a deployed voice agent is exactly the always-on loop.
IRAF is the practical counterpart. End-to-end dual-channel full-duplex models can degrade in realistic acoustic environments because interfering speakers leaking into the user microphone get encoded as part of the user query, corrupting the LLM's conditioning and producing unstable turn-taking. Interference-Resilient Adaptive Fusion adds a lightweight, streaming front-end that separates interference before fusion. The paper is a direct answer to the deployment failure modes that DuplexSLA and similar dual-stream backbones expose when they leave the lab.
Foundational — open TTS foundations
VoxCPM2 Technical Report extends OpenBMB's hierarchical diffusion-autoregressive paradigm into a full multilingual foundation: 30 languages, 9 Chinese dialects, natural-language voice design, style-controllable voice cloning, and high-fidelity continuation cloning in a single backbone. The model upgrades the AudioVAE to an asymmetric 16 kHz encode / 48 kHz decode design, and the weights are released on Hugging Face under Apache 2.0. With 2M+ hours of multilingual training, this is the most comprehensive open-source TTS foundation release since CosyVoice 2.
dots.tts Technical Report is the other open TTS drop — a 2B-parameter continuous autoregressive TTS that models speech in a continuous latent space rather than via discrete tokens. Three named innovations: an AudioVAE trained with multiple objectives to build a semantically structured and prediction-friendly latent space, full-history conditioning in the flow-matching head to preserve long-range consistency, and a reward-free supervised path that sidesteps RLHF for the TTS leg. Together with VoxCPM2, the continuous-latent TTS frontier just got two strong open anchor points in the same week.
Foundational — audio safety, extended
SpeechJBB probes safety alignment in LALMs under code-switched speech, addressing the gap that LALM safety is still primarily evaluated on monolingual text-based harmful prompts. The augmented setting introduces phonological perturbations to test whether safety alignment generalises across language mixing. Together with last week's Acoustic Interference paper, the field now has two distinct audio-jailbreak vectors documented in the literature — paralinguistic priors and code-switching — that LALM deployments need to defend against.
Product — Microsoft MAI-Voice-2 and LiveKit 1.5.17
Microsoft MAI-Voice-2 shipped at Build 2026 on June 2. The model covers 15+ languages and 18 locales with voice cloning from as little as 5 seconds of reference audio, named emotional categories (angry, confused, embarrassed, joyful, whispering), long-form generation via chunking with context carryover, and 24 kHz mono output. Pricing is $22 / 1M characters via Azure Speech, with a faster MAI-Voice-2 Flash variant alongside. Voice prompting is gated behind Microsoft approval and consent safeguards — a notable contrast with the open-default posture of VoxCPM2 and dots.tts above.
livekit-agents 1.5.17 is the substantive in-window release on the agent SDK side. The highlights: a reasoning parameter passes through to OpenAI Realtime so agent builders can opt into GPT-Realtime-2's higher reasoning tiers from inside the LiveKit session; AgentSession adds claim_user_turn (now private) to coordinate which agent participant owns the user turn in multi-agent setups; AMD gets hardened against lost publishers with realtime transcripts forwarded through; recorder-side fixes prevent close hangs and corrupt frame splits; and a model-literal update lands new LLM, STT, and TTS options. Lower-profile than 1.5.12's UserTurnLimitOptions but still substantive.
What is not here
No in-window dataset drop with a verifiable primary source. Pipecat v1.3.0 (the multi-agent worker framework with UIWorker for client-driven voice agents) shipped on May 29, in W23 — outside this window. W23 itself was skipped by the scheduled task; a separate backfill issue would need to capture it. Cartesia, Hume, Deepgram Voice Agent, and ElevenLabs Agents did not publish in-window technical changelog items.
Corrections to hello@fullduplex.ai.