A detailed guide to the asr built-in plugin, how it installs transcription dependencies, configures models, transcribes audio explicitly, and auto-augments inbound chat messages

asr Plugin

asr is one of the clearest built-in examples that a plugin is more than an action collection. It can directly participate in the runtime pipeline.

Its core goal is simple:

turn voice or audio attachments into text
let the agent handle spoken input the same way it handles normal text

What it does

asr is responsible for three kinds of work:

install and configure speech-transcription dependencies
provide an explicit action for local audio transcription
automatically augment inbound chat messages by turning voice attachments into text sections

So it is both:

an explicit tool capability
a runtime middleware-style augmentation capability

Which plugin capabilities it uses

asr uses:

config
setup
usage
availability
actions
hooks.pipeline
system

The most important piece here is hooks.pipeline.

That means asr does not only wait for you to call transcribe. It can actively participate before the message reaches the agent.

Why it is such an important reference

If you want to design capability like this:

preprocess the input first
send the enriched result into the agent next
avoid killing the whole main flow when preprocessing fails

then asr is an excellent reference.

Its current strategy is:

if voice attachments exist, attempt transcription
if transcription succeeds, append text into the message augmentation area
if transcription fails, skip it on a best-effort basis and do not block the main flow

Typical runtime flow

In automatic augmentation mode, a spoken message usually flows like this:

an inbound chat message arrives with voice or audio attachments
the asr pipeline hook inspects the attachment list
each valid audio file is passed through the transcription dependency
successful transcription text is appended into pluginSections
the augmented message continues into the normal agent flow

The key point is that it does not replace the original message. It adds readable text context on top of it.

Typical scenarios

Scenario 1: a Telegram voice message should be understood directly by the agent

This is the most typical use of asr.

The user sends a voice message, the system transcribes it first, and then the agent receives text.

That keeps planning, tool use, and reply generation in a text-oriented mental model.

Scenario 2: you already have one local audio file and only want a transcription result

You do not need the full chat pipeline. Just call the transcribe action explicitly.

Scenario 3: you want voice capability available, but do not want every inbound message auto-augmented

Keep the plugin enabled, but turn off augmentMessage.

Then explicit transcribe still works, while inbound messages are no longer automatically expanded with transcription text.

How to register it in the SDK

import { Agent, asrPlugin } from "@downcity/agent";

const agent = new Agent({
  id: "voice-agent",
  path: "/path/to/project",
  tools: {},
  plugins: [asrPlugin],
});

Key config fields

Config field	What it does	Notes
`injectPrompt`	whether to inject ASR guidance	recommended on in most voice workflows
`augmentMessage`	whether to auto-augment inbound messages	the main switch for automatic transcription flow
`provider`	transcription provider	currently local-oriented by default
`modelId`	default model	for example `SenseVoiceSmall`
`language`	default language hint	for example `auto`, `zh`, `en`
`timeoutMs`	timeout limit	bounds transcription wait time

Main actions

Action	What it does	When to use it
`status`	inspect config and dependency status	understand the current environment
`install`	install transcription dependencies	first-time speech setup
`configure`	update ASR config	day-to-day model or strategy tuning
`on`	enable the plugin, optionally install too	one-step speech enablement
`off`	disable the plugin	pause speech capability
`use`	switch the active model	after multiple models exist
`transcribe`	transcribe one local audio file	explicit call path
`models`	list supported models	build a picker or control plane
`doctor`	diagnose plugin and dependency status	when the environment feels incomplete

Scenario-driven usage examples

Check current status

const status = await agent.plugins.runAction({
  plugin: "asr",
  action: "status",
});

Install transcription dependencies

const result = await agent.plugins.runAction({
  plugin: "asr",
  action: "install",
  payload: {
    modelIds: ["SenseVoiceSmall"],
    activeModel: "SenseVoiceSmall",
    installDeps: true,
  },
});

Switch the active model

const result = await agent.plugins.runAction({
  plugin: "asr",
  action: "use",
  payload: {
    modelId: "SenseVoiceSmall",
  },
});

Explicitly transcribe one local audio file

const result = await agent.plugins.runAction({
  plugin: "asr",
  action: "transcribe",
  payload: {
    audioPath: "/path/to/message.wav",
    language: "zh",
  },
});

List supported models

const models = await agent.plugins.runAction({
  plugin: "asr",
  action: "models",
});

Run dependency diagnosis

const diagnosis = await agent.plugins.runAction({
  plugin: "asr",
  action: "doctor",
});

What `pipeline` really does here

This is the key to understanding asr.

At the chat inbound augmentation point, it does not “rewrite the raw channel message.” Instead, it:

finds voice attachments
transcribes them
appends transcription text into the augmentation area

That means it is effectively telling the agent:

besides the original message, here is a readable text section you can use

This design is good for:

preserving the original message semantics
adding extra readable context
avoiding tight coupling to one specific channel implementation

Why transcription failure does not block the main flow

The current implementation intentionally uses a best-effort strategy.

That is practical because:

transcription dependencies can fail occasionally
one broken attachment does not necessarily mean the entire user message should fail

So asr prefers:

transcribe when possible
skip when it cannot
let the main message continue into the agent

That is a strong fit for real production behavior.

Why `system` still matters here

Even if asr can already auto-augment messages, the system prompt still matters.

The agent benefits from knowing:

the environment has speech-transcription capability
audio-related requests can use the asr action surface
augmented transcription text should be interpreted as supporting context

Its boundary relative to services

Although asr eventually calls models and local dependencies, its architectural role still looks like a plugin:

expose actions
participate in the pipeline
manage config

If your main concern is “capability that participates in the message flow,” a plugin is often a better fit than a service.

inbound message augmentation
best-effort pipeline hooks
plugins that mix explicit actions and automatic runtime behavior
turning spoken input into text

asr Plugin

asr Plugin

What it does

Which plugin capabilities it uses

Why it is such an important reference

Typical runtime flow

Typical scenarios

Scenario 1: a Telegram voice message should be understood directly by the agent

Scenario 2: you already have one local audio file and only want a transcription result

Scenario 3: you want voice capability available, but do not want every inbound message auto-augmented

How to register it in the SDK

Key config fields

Main actions

Scenario-driven usage examples

Check current status

Install transcription dependencies

Switch the active model

Explicitly transcribe one local audio file

List supported models

Run dependency diagnosis

What `pipeline` really does here

Why transcription failure does not block the main flow

Why `system` still matters here

Its boundary relative to services

Important boundaries

Automatic augmentation only targets voice or audio attachments

Turning off `augmentMessage` does not disable explicit transcription

Before switching models with `use`, make sure the model is actually installed

When to use it as a reference

Table of Contents