VoiceGenEval: A Benchmark for Controllable Speech Generation

Comprehensive evaluation of controllable speech generation capabilities in spoken language models, assessing not just what models say, but how they say it.

Overview

VoiceGenEval Framework
Overview of the proposed VoiceGenEval framework. The upper part shows instruction examples in two languages across four data categories: Acoustic Attribute, Natural Language Instruction, Role-Playing, and Explicit Empathy. The lower part illustrates a three-level evaluation process--content, style, and naturalness--conducted with LALM.

Models Leaderboard

Examples

Main Categories

Subcategories

Loading samples...

Results Submission

Have a new model to evaluate?

Join the VoiceGenEval leaderboard! Check out our Evaluation Guide to get started.

We provide detailed instructions on how to run evaluations, format your results, and submit them for inclusion in our benchmark leaderboard.