Signals — Fullduplex

What happened this week

Four preprints and one dataset release worth forwarding to a researcher inbox. The headline item is Alibaba's Qwen3.5-Omni technical report, which formalises the architecture behind the late-March release and makes the scaling story public. Beyond that, the week is steady: an agentic spoken-dialogue system, two evaluation papers, and a low-resource S2ST dataset.

The method paper — Qwen3.5-Omni

Qwen3.5-Omni scales to hundreds of billions of parameters across a Hybrid Attention MoE for both the Thinker and Talker stacks, extends context to 256k tokens, and introduces ARIA to align speech and text units at generation time. The numbers are lab-internal — Qwen team claims SOTA on 215 audio and audio-visual subtasks — so treat the headlines as suggestive until third-party evals land. Open weights have not yet been published; access is via DashScope.

Agents on speech — VoxMind

VoxMind (ACL 2026) is an end-to-end spoken dialogue model with tool use. The interesting bits are the 470-hour AgentChat dataset and the Multi-Agent Dynamic Tool Management scheme that decouples inference latency from tool-inventory size. Reported task completion moves from 34.88 to 74.57 percent on their eval, with code and data released.

Evaluation — MINT-Bench and MoVE

Two papers push evaluation forward rather than capability:

MINT-Bench is a hierarchical, ten-language benchmark for instruction-following TTS. It separates content consistency, instruction-following, and perceptual quality, and finds that current frontier commercial systems still lead overall, but open-source models are competitive in localized settings like Chinese. Useful as a public leaderboard when choosing a controllable TTS stack.
MoVE tackles non-verbal vocalization preservation in speech-to-speech translation with a Mixture-of-LoRA-Experts router. The takeaway for anyone working on expressive S2ST is the data-efficiency result: 30 minutes of curated data was enough to reach 76 percent NV reproduction on English-Chinese, versus ≤14 percent for prior S2ST baselines.

Dataset — NaijaS2ST

NaijaS2ST releases roughly 50 hours per language of parallel speech across Igbo, Hausa, Yorùbá, and Nigerian Pidgin paired with English. It is a benchmark-plus-dataset release, and its empirical finding that audio-LLM few-shot beats fine-tuned cascaded and end-to-end systems for S-to-T but not for S-to-S translation is the kind of gap statement low-resource S2ST needed.

Corrections to hello@fullduplex.ai.

Signals · 2026-W17.

What happened this week

The method paper — Qwen3.5-Omni

Agents on speech — VoxMind

Evaluation — MINT-Bench and MoVE

Dataset — NaijaS2ST