What happened this week
The headline event is on the platform layer: OpenAI's Realtime API hits GA on May 7, and the three new audio models shipped alongside it move both the reasoning ceiling and the cost floor for production voice agents. On the foundational side, a tightly-themed paper cluster argues about where the speech-LLM modality gap actually lives, and a 0.1B-parameter open omni release pushes on the lower end of the size–capability frontier.
The platform headline
OpenAI Realtime API GA — GPT-Realtime-2 / Translate / Whisper is the single most consequential shipment in the window. GPT-Realtime-2 brings GPT-5-class reasoning into the realtime path, the context window expands from 32K to 128K tokens, and OpenAI reports a 15.2 pp lift on Big Bench Audio over Realtime-1.5 and 13.8 pp on Audio MultiChallenge at the higher reasoning tier. GPT-Realtime-Translate covers 70+ input languages into 13 output languages at $0.034 per minute, and GPT-Realtime-Whisper offers streaming transcription at $0.017 per minute. The GA flag is the part to take seriously: teams that were holding back on production deploys because the surface kept moving are now on a stable contract.
On the open-source agent stack, livekit-agents 1.5.8 is the more surgical release. The headline change is a barge-in cooldown window for corrections — a small but pointed addition that lets the agent distinguish a real user takeover from a quick mid-utterance self-correction, the exact failure mode that adaptive interruption handling left on the table in 1.5.0. The release also moves Fish Audio to a websocket inference path for lower latency, adds Soniox as a TTS plugin and a new Inworld model, and ships a long tail of fixes to AMD, warm transfer, and OpenAI Realtime error handling.
Foundational models and representations
MiniMind-O is a 0.1B-scale open omni model that accepts text, speech, and image and returns both text and streaming speech. The release is unusually complete: weights, code, and Parquet training datasets for text-to-audio, image-to-text, and audio-to-audio are all published, so the full interaction loop is inspectable in one repo. Architecturally it sticks to a Thinker–Talker split with frozen SenseVoice-Small and SigLIP2 encoders, lightweight MLP projectors, and an autoregressive eight-codebook Mimi buffer; the technical contribution is a set of scale-critical design choices for small omni models rather than a leaderboard result.
WavCube tackles the long-standing split between semantic SSL features and acoustic reconstruction features by training a continuous latent that supports understanding, reconstruction, and generation jointly. With an 8x dimensional compression, WavCube approaches WavLM on SUPERB, reaches SOTA zero-shot TTS, and converges faster during training. It is one of the more concrete steps in the window toward a truly unified speech backbone instead of two stitched-together stacks.
TextPro-SLM makes a complementary argument: prior work has mostly tried to close the speech-LLM modality gap from the output side, but the dominant remaining bottleneck is on the input side. TextPro-SLM pairs a WhisperPro encoder that produces synchronized text tokens and prosody embeddings with an LLM backbone trained to keep its original semantic capabilities while learning paralinguistic understanding, and claims the lowest modality gap among leading SLMs at 3B and 7B scales with only ~1,000 hours of audio.
TTS and evaluation
X-Voice is a 0.4B multilingual zero-shot voice cloning model trained on a 420K-hour corpus with the International Phonetic Alphabet as a unified representation. A two-stage training paradigm eliminates the reliance on prompt-text transcripts at inference, the architecture extends F5-TTS with dual-level language-ID injection and decoupled CFG scheduling, and the authors report cross-lingual cloning comparable to billion-scale systems like Qwen3-TTS with all resources open-sourced.
MSEB on audio-native LLMs from Google evaluates leading LLMs — including Gemini and GPT family members — across the eight core MSEB capabilities. The cleaner finding is that a meaningful modality gap still separates audio-native LLMs from specialized cascaded pipelines on performance and robustness, and the paper resists declaring an optimal architecture: the choice between audio-native and cascaded designs depends on the latency, cost, and reasoning-depth assumptions baked into each deployment. This pairs naturally with the TextPro-SLM and WavCube papers above, which try to close exactly the gaps MSEB is measuring.
What is not here
No in-window dataset drop and no reclassification surfaced with a primary source we could verify. Cartesia, Hume, and Deepgram did not publish anything in scope this week; the lab-blog bucket is carried entirely by OpenAI.
Corrections to hello@fullduplex.ai.