VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions

Comprehensive evaluation of voice style adaptation capabilities in spoken language models, assessing not just what models say, but how they say it.

Overview

VStyle Framework
Hierarchical evaluation process for speech generation. The system first checks textual adherence; if failed, a score of 1 is assigned. Passing leads to stylistic adherence (score 2 for non‑compliance, 3 for partial compliance, full compliance advances to naturalness). At the final stage, high naturalness achieves a score of 5. This progressive framework reflects the incremental requirements of speech models: beginning with content correctness, followed by stylistic appropriateness, and culminating in natural, human‑like speech capable of sustaining realistic interaction.

Models Leaderboard

Examples

Main Categories

Subcategories

Loading samples...

Results Submission

Have a new model to evaluate?

Join the VStyle leaderboard! Check out our Evaluation Guide to get started.

We provide detailed instructions on how to run evaluations, format your results, and submit them for inclusion in our benchmark leaderboard.