Downcity
PluginsBuilt-in Plugins

asr Plugin

A detailed guide to the asr built-in plugin, how it installs transcription dependencies, configures models, transcribes audio explicitly, and auto-augments inbound chat messages

asr Plugin

asr is one of the clearest built-in examples that a plugin is more than an action collection. It can directly participate in the runtime pipeline.

Its core goal is simple:

  • turn voice or audio attachments into text
  • let the agent handle spoken input the same way it handles normal text

What it does

asr is responsible for three kinds of work:

  1. install and configure speech-transcription dependencies
  2. provide an explicit action for local audio transcription
  3. automatically augment inbound chat messages by turning voice attachments into text sections

So it is both:

  • an explicit tool capability
  • a runtime middleware-style augmentation capability

Which plugin capabilities it uses

asr uses:

  • config
  • setup
  • usage
  • availability
  • actions
  • hooks.pipeline
  • system

The most important piece here is hooks.pipeline.

That means asr does not only wait for you to call transcribe. It can actively participate before the message reaches the agent.

Why it is such an important reference

If you want to design capability like this:

  • preprocess the input first
  • send the enriched result into the agent next
  • avoid killing the whole main flow when preprocessing fails

then asr is an excellent reference.

Its current strategy is:

  • if voice attachments exist, attempt transcription
  • if transcription succeeds, append text into the message augmentation area
  • if transcription fails, skip it on a best-effort basis and do not block the main flow

Typical runtime flow

In automatic augmentation mode, a spoken message usually flows like this:

  1. an inbound chat message arrives with voice or audio attachments
  2. the asr pipeline hook inspects the attachment list
  3. each valid audio file is passed through the transcription dependency
  4. successful transcription text is appended into pluginSections
  5. the augmented message continues into the normal agent flow

The key point is that it does not replace the original message. It adds readable text context on top of it.

Typical scenarios

Scenario 1: a Telegram voice message should be understood directly by the agent

This is the most typical use of asr.

The user sends a voice message, the system transcribes it first, and then the agent receives text.

That keeps planning, tool use, and reply generation in a text-oriented mental model.

Scenario 2: you already have one local audio file and only want a transcription result

You do not need the full chat pipeline. Just call the transcribe action explicitly.

Scenario 3: you want voice capability available, but do not want every inbound message auto-augmented

Keep the plugin enabled, but turn off augmentMessage.

Then explicit transcribe still works, while inbound messages are no longer automatically expanded with transcription text.

How to register it in the SDK

import { Agent, asrPlugin } from "@downcity/agent";

const agent = new Agent({
  id: "voice-agent",
  path: "/path/to/project",
  tools: {},
  plugins: [asrPlugin],
});

Key config fields

Config fieldWhat it doesNotes
injectPromptwhether to inject ASR guidancerecommended on in most voice workflows
augmentMessagewhether to auto-augment inbound messagesthe main switch for automatic transcription flow
providertranscription providercurrently local-oriented by default
modelIddefault modelfor example SenseVoiceSmall
languagedefault language hintfor example auto, zh, en
timeoutMstimeout limitbounds transcription wait time

Main actions

ActionWhat it doesWhen to use it
statusinspect config and dependency statusunderstand the current environment
installinstall transcription dependenciesfirst-time speech setup
configureupdate ASR configday-to-day model or strategy tuning
onenable the plugin, optionally install tooone-step speech enablement
offdisable the pluginpause speech capability
useswitch the active modelafter multiple models exist
transcribetranscribe one local audio fileexplicit call path
modelslist supported modelsbuild a picker or control plane
doctordiagnose plugin and dependency statuswhen the environment feels incomplete

Scenario-driven usage examples

Check current status

const status = await agent.plugins.runAction({
  plugin: "asr",
  action: "status",
});

Install transcription dependencies

const result = await agent.plugins.runAction({
  plugin: "asr",
  action: "install",
  payload: {
    modelIds: ["SenseVoiceSmall"],
    activeModel: "SenseVoiceSmall",
    installDeps: true,
  },
});

Switch the active model

const result = await agent.plugins.runAction({
  plugin: "asr",
  action: "use",
  payload: {
    modelId: "SenseVoiceSmall",
  },
});

Explicitly transcribe one local audio file

const result = await agent.plugins.runAction({
  plugin: "asr",
  action: "transcribe",
  payload: {
    audioPath: "/path/to/message.wav",
    language: "zh",
  },
});

List supported models

const models = await agent.plugins.runAction({
  plugin: "asr",
  action: "models",
});

Run dependency diagnosis

const diagnosis = await agent.plugins.runAction({
  plugin: "asr",
  action: "doctor",
});

What pipeline really does here

This is the key to understanding asr.

At the chat inbound augmentation point, it does not “rewrite the raw channel message.” Instead, it:

  • finds voice attachments
  • transcribes them
  • appends transcription text into the augmentation area

That means it is effectively telling the agent:

  • besides the original message, here is a readable text section you can use

This design is good for:

  • preserving the original message semantics
  • adding extra readable context
  • avoiding tight coupling to one specific channel implementation

Why transcription failure does not block the main flow

The current implementation intentionally uses a best-effort strategy.

That is practical because:

  • transcription dependencies can fail occasionally
  • one broken attachment does not necessarily mean the entire user message should fail

So asr prefers:

  • transcribe when possible
  • skip when it cannot
  • let the main message continue into the agent

That is a strong fit for real production behavior.

Why system still matters here

Even if asr can already auto-augment messages, the system prompt still matters.

The agent benefits from knowing:

  • the environment has speech-transcription capability
  • audio-related requests can use the asr action surface
  • augmented transcription text should be interpreted as supporting context

Its boundary relative to services

Although asr eventually calls models and local dependencies, its architectural role still looks like a plugin:

  • expose actions
  • participate in the pipeline
  • manage config

If your main concern is “capability that participates in the message flow,” a plugin is often a better fit than a service.

Important boundaries

Automatic augmentation only targets voice or audio attachments

It is not a universal attachment preprocessing framework.

Turning off augmentMessage does not disable explicit transcription

Those are different behaviors.

Before switching models with use, make sure the model is actually installed

Otherwise config can point to a model that the environment is not ready to run.

When to use it as a reference

Use asr as a reference for:

  • inbound message augmentation
  • best-effort pipeline hooks
  • plugins that mix explicit actions and automatic runtime behavior
  • turning spoken input into text