TTS Caching

TTS (Text-to-Speech) caching is an administrator-controlled feature that enables a conversational AI system to check incoming text input against a stored TTS cache and return the matching pre-generated audio output instead of generating a new speech response from scratch. By reusing previously synthesized audio, TTS caching improves system performance and reduces latency in voice interactions.

Because TTS vendors typically charge based on the number of characters transcribed, caching repeated phrases or common responses can significantly reduce operational costs. By default, TTS caching is disabled and carries a configurable cache lifetime (defaulting to 24 hours). Once enabled by an administrator, it can be further controlled at the bot developer level on a per-message basis. 

Key Points

  • Checks text input against cached TTS audio and returns the matching pre-generated output
  • Reduces TTS processing time and improves voice response latency
  • Cuts costs by avoiding repeated charges for the same text content from TTS vendors
  • Default cache lifetime is 24 hours, but is fully configurable
  • Disabled by default; enabled and managed by administrators and bot developers

Why It Matters

In high-volume contact center environments, the same phrases — greetings, hold messages, menu prompts — are synthesized thousands of times per day. TTS caching eliminates redundant generation of these repeated outputs, delivering faster response times for customers while reducing the per-character API costs that accumulate at scale with TTS providers.

Best-Practice Perspective

Cognigy recommends enabling TTS caching for all static or frequently repeated voice responses, such as welcome messages, menu options, and standard confirmation phrases. Cache lifetime should be tuned to balance freshness with cost savings, and bot developers should selectively disable caching for dynamic, personalized, or time-sensitive messages where pre-cached audio would be inappropriate.