Evaluating synthetic speech for audio description: insights from RNIB’s latest study
As synthetic speech continues to advance, its potential applications in accessibility are growing. For blind and partially sighted audiences, high-quality audio description (AD) plays a crucial role in making film and television more accessible.
But can synthetic voices deliver an AD experience comparable to human narration? To explore this question, RNIB worked with several broadcasters, and the Acoustics Research Centre at the University of Salford to assess how synthetic AD performs across different content genres.
The study combined qualitative and quantitative methods to capture audience perceptions. Participants evaluated synthetic AD across six sample clips from entertainment, drama, sport, factual programming, and documentaries. The research revealed a nuanced picture: synthetic voices were considered acceptable for clarity and consistency, particularly in documentary and factual genres where the primary function is delivering information. However, they struggled to match human narrators in conveying emotion, spontaneity, and contextual sensitivity—key factors that enhance engagement in entertainment-focused content.
A key concern raised by participants was the importance of matching AD tone to the content’s emotional and cultural context. Technical factors, such as sound mixing and audio ducking (balancing AD against background audio), were also highlighted as areas requiring particular attention when synthetic speech is used. The feedback suggests that synthetic AD, while promising, must meet minimum quality standards, set out in the report, to ensure a positive viewing experience.
As a next step, RNIB is proposing a set of industry-wide benchmarks for synthetic AD, covering intelligibility, prosody, and emotional adaptability. Further pilot projects will explore how synthetic voices can be optimised for different types of content, and whether manual or automated ducking strategies impact viewer experience. While synthetic speech offers scalability, this study reinforces the continuing value of human narration, particularly for emotionally rich storytelling. With further research and development, synthetic voices could complement—but not replace—human AD, providing greater accessibility without compromising quality.