asr Plugin
A detailed guide to the asr built-in plugin, how it installs transcription dependencies, configures models, transcribes audio explicitly, and auto-augments inbound chat messages
asr Plugin
asr is one of the clearest built-in examples that a plugin is more than an action collection. It can directly participate in the runtime pipeline.
Its core goal is simple:
- turn voice or audio attachments into text
- let the agent handle spoken input the same way it handles normal text
What it does
asr is responsible for three kinds of work:
- install and configure speech-transcription dependencies
- provide an explicit action for local audio transcription
- automatically augment inbound chat messages by turning voice attachments into text sections
So it is both:
- an explicit tool capability
- a runtime middleware-style augmentation capability
Which plugin capabilities it uses
asr uses:
configsetupusageavailabilityactionshooks.pipelinesystem
The most important piece here is hooks.pipeline.
That means asr does not only wait for you to call transcribe. It can actively participate before the message reaches the agent.
Why it is such an important reference
If you want to design capability like this:
- preprocess the input first
- send the enriched result into the agent next
- avoid killing the whole main flow when preprocessing fails
then asr is an excellent reference.
Its current strategy is:
- if voice attachments exist, attempt transcription
- if transcription succeeds, append text into the message augmentation area
- if transcription fails, skip it on a best-effort basis and do not block the main flow
Typical runtime flow
In automatic augmentation mode, a spoken message usually flows like this:
- an inbound chat message arrives with
voiceoraudioattachments - the
asrpipeline hook inspects the attachment list - each valid audio file is passed through the transcription dependency
- successful transcription text is appended into
pluginSections - the augmented message continues into the normal agent flow
The key point is that it does not replace the original message. It adds readable text context on top of it.
Typical scenarios
Scenario 1: a Telegram voice message should be understood directly by the agent
This is the most typical use of asr.
The user sends a voice message, the system transcribes it first, and then the agent receives text.
That keeps planning, tool use, and reply generation in a text-oriented mental model.
Scenario 2: you already have one local audio file and only want a transcription result
You do not need the full chat pipeline. Just call the transcribe action explicitly.
Scenario 3: you want voice capability available, but do not want every inbound message auto-augmented
Keep the plugin enabled, but turn off augmentMessage.
Then explicit transcribe still works, while inbound messages are no longer automatically expanded with transcription text.
How to register it in the SDK
import { Agent, asrPlugin } from "@downcity/agent";
const agent = new Agent({
id: "voice-agent",
path: "/path/to/project",
tools: {},
plugins: [asrPlugin],
});Key config fields
| Config field | What it does | Notes |
|---|---|---|
injectPrompt | whether to inject ASR guidance | recommended on in most voice workflows |
augmentMessage | whether to auto-augment inbound messages | the main switch for automatic transcription flow |
provider | transcription provider | currently local-oriented by default |
modelId | default model | for example SenseVoiceSmall |
language | default language hint | for example auto, zh, en |
timeoutMs | timeout limit | bounds transcription wait time |
Main actions
| Action | What it does | When to use it |
|---|---|---|
status | inspect config and dependency status | understand the current environment |
install | install transcription dependencies | first-time speech setup |
configure | update ASR config | day-to-day model or strategy tuning |
on | enable the plugin, optionally install too | one-step speech enablement |
off | disable the plugin | pause speech capability |
use | switch the active model | after multiple models exist |
transcribe | transcribe one local audio file | explicit call path |
models | list supported models | build a picker or control plane |
doctor | diagnose plugin and dependency status | when the environment feels incomplete |
Scenario-driven usage examples
Check current status
const status = await agent.plugins.runAction({
plugin: "asr",
action: "status",
});Install transcription dependencies
const result = await agent.plugins.runAction({
plugin: "asr",
action: "install",
payload: {
modelIds: ["SenseVoiceSmall"],
activeModel: "SenseVoiceSmall",
installDeps: true,
},
});Switch the active model
const result = await agent.plugins.runAction({
plugin: "asr",
action: "use",
payload: {
modelId: "SenseVoiceSmall",
},
});Explicitly transcribe one local audio file
const result = await agent.plugins.runAction({
plugin: "asr",
action: "transcribe",
payload: {
audioPath: "/path/to/message.wav",
language: "zh",
},
});List supported models
const models = await agent.plugins.runAction({
plugin: "asr",
action: "models",
});Run dependency diagnosis
const diagnosis = await agent.plugins.runAction({
plugin: "asr",
action: "doctor",
});What pipeline really does here
This is the key to understanding asr.
At the chat inbound augmentation point, it does not “rewrite the raw channel message.” Instead, it:
- finds voice attachments
- transcribes them
- appends transcription text into the augmentation area
That means it is effectively telling the agent:
- besides the original message, here is a readable text section you can use
This design is good for:
- preserving the original message semantics
- adding extra readable context
- avoiding tight coupling to one specific channel implementation
Why transcription failure does not block the main flow
The current implementation intentionally uses a best-effort strategy.
That is practical because:
- transcription dependencies can fail occasionally
- one broken attachment does not necessarily mean the entire user message should fail
So asr prefers:
- transcribe when possible
- skip when it cannot
- let the main message continue into the agent
That is a strong fit for real production behavior.
Why system still matters here
Even if asr can already auto-augment messages, the system prompt still matters.
The agent benefits from knowing:
- the environment has speech-transcription capability
- audio-related requests can use the
asraction surface - augmented transcription text should be interpreted as supporting context
Its boundary relative to services
Although asr eventually calls models and local dependencies, its architectural role still looks like a plugin:
- expose actions
- participate in the pipeline
- manage config
If your main concern is “capability that participates in the message flow,” a plugin is often a better fit than a service.
Important boundaries
Automatic augmentation only targets voice or audio attachments
It is not a universal attachment preprocessing framework.
Turning off augmentMessage does not disable explicit transcription
Those are different behaviors.
Before switching models with use, make sure the model is actually installed
Otherwise config can point to a model that the environment is not ready to run.
When to use it as a reference
Use asr as a reference for:
- inbound message augmentation
- best-effort pipeline hooks
- plugins that mix explicit actions and automatic runtime behavior
- turning spoken input into text
Related docs
web Plugin
A detailed guide to the web built-in plugin, how it selects providers, installs dependencies, switches web modes, injects system prompts, and when to use web-access or agent-browser
tts Plugin
A detailed guide to the tts built-in plugin, how it installs local synthesis dependencies, configures model and output parameters, generates audio files, and returns reusable audio tags