tts Plugin
A detailed guide to the tts built-in plugin, how it installs local synthesis dependencies, configures model and output parameters, generates audio files, and returns reusable audio tags
tts Plugin
tts is one of the clearest built-in examples of an explicit output-generation plugin.
You give it text, and it produces:
- a local audio file
- a reusable audio file tag
That allows a plain text answer to expand naturally into a spoken output artifact.
What it does
tts is mainly responsible for:
- installing local text-to-speech dependencies
- managing default model, format, and speed configuration
- generating audio files from input text
- returning the result as a file path plus file tag
Its center of gravity is different from asr.
asr is input augmentation. tts is output artifact generation.
Which plugin capabilities it uses
tts uses:
configsetupusageavailabilityactionssystem
It does not use pipeline hooks, and it does not use runtime HTTP.
That means:
ttsdoes not automatically jump into the main message chain- it is best used through explicit calls when audio is actually needed
Typical scenarios
Scenario 1: the user wants the final answer sent as voice
This is the most typical tts workflow.
Generate the text answer first, then call tts.synthesize, and send the resulting audio file to the target channel.
Scenario 2: you want one default output format for a channel
For example, only wav, or defaulting to flac for smaller files.
That kind of requirement fits plugin config better than repeating the same override in every synthesis request.
Scenario 3: you want multi-language narration or different voices
Then one synthesis request can override:
languagevoicespeed
without changing the project-wide defaults.
How to register it in the SDK
import { Agent, ttsPlugin } from "@downcity/agent";
const agent = new Agent({
id: "speech-agent",
path: "/path/to/project",
tools: {},
plugins: [ttsPlugin],
});Key config fields
| Config field | What it does | Notes |
|---|---|---|
provider | active provider | currently local by default |
modelId | default model | for example qwen3-tts-0.6b |
format | default output format | wav or flac |
speed | default speech speed | 1 means normal rate |
language | default language hint | can be overridden per call |
voice | default voice id | can be overridden per call |
Main actions
| Action | What it does | When to use it |
|---|---|---|
status | inspect config and synthesizer status | confirm environment first |
doctor | diagnose dependency readiness | when model or runtime setup looks broken |
models | list available models | build a picker or control surface |
install | install synthesis dependencies | first-time TTS setup |
configure | update default config | change model, format, speed |
on | enable the plugin, optionally install too | one-step capability enablement |
off | disable the plugin | pause speech synthesis |
use | switch the active model | multi-model workflows |
synthesize | turn text into an audio file | the core action |
Scenario-driven usage examples
Check current status
const status = await agent.plugins.runAction({
plugin: "tts",
action: "status",
});List available models
const models = await agent.plugins.runAction({
plugin: "tts",
action: "models",
});Install TTS dependencies
const result = await agent.plugins.runAction({
plugin: "tts",
action: "install",
payload: {
modelIds: ["qwen3-tts-0.6b"],
activeModel: "qwen3-tts-0.6b",
format: "wav",
installDeps: true,
},
});Switch the active model
const result = await agent.plugins.runAction({
plugin: "tts",
action: "use",
payload: {
modelId: "qwen3-tts-0.6b",
},
});Generate a speech file
const result = await agent.plugins.runAction({
plugin: "tts",
action: "synthesize",
payload: {
text: "Hello, welcome to Downcity",
language: "en",
format: "wav",
speed: 1,
},
});On success, the result usually contains:
outputPathfileTagbytes
The fileTag matters because it makes downstream audio delivery much easier.
Run dependency diagnosis
const diagnosis = await agent.plugins.runAction({
plugin: "tts",
action: "doctor",
});Why this is an explicit generation plugin
Unlike the automatic augmentation shape of asr, the right mental model for tts is:
- call it when you explicitly need an audio artifact
That makes it closer to:
- a file generator
- an output transformer
- a delivery-preparation layer before channel sending
What system does here
The tts system text reminds the agent that:
- the environment has TTS capability
tts.synthesizeshould be used for spoken output requests- successful generation returns both an audio path and a
<file type="audio">...</file>tag
That helps the agent move beyond text-only answers when the user asks for “a voice version.”
Its boundary relative to channel delivery
tts is responsible for turning text into an audio file.
It is not responsible for:
- how one IM channel uploads audio
- how one platform renders or sends an audio message
The cleaner mental model is:
ttsgenerates the artifact- the channel layer delivers the artifact
Its boundary relative to services
tts may rely on local models and Python runtime underneath, but architecturally it is still a better fit as a plugin:
- explicit invocation
- structured result return
- centralized config
If your main problem is “generate an audio artifact that can be delivered,” a plugin is the more natural boundary.
Important boundaries
When the plugin is unavailable, synthesis should not be treated as usable
If availability is false, synthesize will fail.
format and speed can be overridden per request
Do not change the whole project default just to satisfy one one-off narration request.
stderr does not always mean failure
The current implementation allows non-fatal stderr from the Python runner and returns a summary as extra context.
So:
- stderr output does not automatically mean synthesis failed
When to use it as a reference
Use tts as a reference for:
- explicit artifact-generation plugins
- model and output-parameter configuration
- returning file paths and reusable file tags
- turning text output into audio output
Related docs
asr Plugin
A detailed guide to the asr built-in plugin, how it installs transcription dependencies, configures models, transcribes audio explicitly, and auto-augments inbound chat messages
workboard Plugin
A detailed guide to the workboard built-in plugin, how it exposes structured work snapshots through runtime HTTP, and why it belongs more to observability and control-plane capability