speechbot

Speech-driven bots and services (e.g. Telegram) with pluggable speech matching.

Project description

The speechbot package is a framework for building speech-first chatbots with a block tree. It is designed for bots where a user moves between options by speaking a word.

The bot is configured with:

a block tree that defines steps and word labels for moving between blocks
a speech engine that matches incoming voice messages to those labels

This repository also contains a Telegram service implementation so the bot can be run as a chat bot.

At a high level, speechbot runs a voice menu. A JSON file named tree.json lists the conversation steps and the spoken words that move between them.

CLI guide

This section focuses on running the bot and configuring tree.json.

Quick start

Install the package.
Set TELEGRAM_BOT_TOKEN.
Save a minimal tree.json.
Run the bot and complete the setup prompts for missing audio.

Minimal tree.json:

{
  "root_id": "root",
  "blocks": [
    {
      "id": "root",
      "prompt_text": "Say hello or help",
      "edges": [
        {"word": "hello", "to": "hello"},
        {"word": "help", "to": "help"}
      ]
    },
    {
      "id": "hello",
      "prompt_text": "Last word: {last_word}",
      "edges": []
    },
    {
      "id": "help",
      "prompt_text": "Help menu",
      "edges": []
    }
  ]
}

Run from the directory that contains tree.json:

export TELEGRAM_BOT_TOKEN="123456:ABC..."
python3 -m speechbot telegram --tree tree.json --data-dir data \
    --speech-engine speechmatching --debug-users <admin_user_id>

Expected messages (example):

Bot: <audio message for "Say hello or help">
Bot: Say hello or help
User: <voice message>
Bot: Heard: hello
Bot: Last word: hello

Requirements

Install the package from PyPI [pypi]

pip install speechbot

The Telegram service requires a Telegram bot token. To get a token, create a bot with BotFather [botfather] in Telegram and copy the token it returns. BotFather can be found by searching for @BotFather in Telegram. Store the token in TELEGRAM_BOT_TOKEN before starting the bot.

The default speech engine uses the speechmatching package. Typical requirements include:

ffmpeg available on PATH (Telegram voice messages are compressed audio files);
the Python dependencies for speechmatching;
Docker access for the default Docker-based speech model.

Docker image

The CLI is available as the Docker image aukesch/speechbot. This can be used to run the bot without installing the Python package locally.

Example run:

docker pull aukesch/speechbot
docker run --rm \
    -e TELEGRAM_BOT_TOKEN="123456:ABC..." \
    -v "$PWD":/work \
    -w /work \
    --entrypoint speechbot \
    aukesch/speechbot \
    telegram --tree tree.json --data-dir data --speech-engine speechmatching \
        --debug-users <admin_user_id>

Example Dockerfile:

FROM aukesch/speechbot
COPY . /app
WORKDIR /app
ENTRYPOINT ["speechbot"]
CMD ["telegram", "--tree", "tree.json", "--data-dir", "data",
     "--speech-engine", "speechmatching"]

Run the bot

Export the token, then start the Telegram service:

export TELEGRAM_BOT_TOKEN=\"123456:ABC...\"
python3 -m speechbot telegram --tree examples/basic/tree.json \
    --data-dir data --speech-engine speechmatching \
    --debug-users <admin_user_id>

On startup, speechbot checks that all required assets exist. If word recordings, prompt recordings, or referenced media files are missing, the bot starts a guided setup process in Telegram to collect them.

While collecting recordings and uploads, you can send multiple voice messages per word recording. For prompt recordings and media files, sending another upload replaces the previous one. The /next command moves on, /skip moves on without saving, /status shows remaining items, and /done finishes setup.

When setup is active, the bot temporarily switches to a temporary setup tree. When --debug-users is set, setup is limited to those user identifiers. Setup is limited to one user per chat, and other users will see a busy message until setup completes. When all required assets exist, the bot returns to the main tree and normal interaction continues.

User commands

/start resets the state to the tree root.
/undo restores the previous state snapshot.

The tree cannot be moved through using text messages. Text messages only replay the prompt for the current block.

Example setup transcript

This is a short example of the prompt recording setup process:

Bot: Prompt setup (1/2). Send a voice message (or audio file) reading this text aloud:
Bot:
Bot: Hello there
Bot:
Bot: Send /next to continue (after at least 1 recording), /skip to move on without saving, /status for progress. Sending another recording replaces the previous one.
User: <voice message>
Bot: Saved prompt recording. Send another recording to replace it, or /next for the next prompt.
User: /next
Bot: Prompt setup (2/2). Send a voice message (or audio file) reading this text aloud:
Bot:
Bot: Welcome
Bot:
Bot: Send /next to continue (after at least 1 recording), /skip to move on without saving, /status for progress. Sending another recording replaces the previous one.
User: <voice message>
Bot: Saved prompt recording. All required prompt recordings exist. Send another recording to replace it, or /done (or /next) to finish setup.
User: /done
Bot: Prompt setup complete. Continuing...

Shop builder

The shop builder is an interactive admin process that runs inside Telegram and writes a new tree to data/shop/tree.json:

python3 examples/shop_builder/main.py telegram --data-dir data

When the shop is published (/publish), the builder also creates a .zip package (data/shop/shop.zip) containing tree.json and all referenced shop media under data/. The builder sends that zip back via Telegram.

To run the generated shop directly from that zip, start speechbot without a local tree.json and upload the zip as a document:

python3 -m speechbot telegram --data-dir data --debug-users <admin_user_id>

The generated tree can also be run directly by referencing its path:

python3 -m speechbot telegram --tree data/shop/tree.json \
    --data-dir data --speech-engine speechmatching

CLI reference

The CLI accepts one required argument and several optional arguments.

Required argument:

service: service name. Only telegram is supported.

Optional arguments:

--token: service token. If not set, the TELEGRAM_BOT_TOKEN environment variable is used.
--poll-timeout-s: long polling timeout in seconds.
--tree: path to the block tree JSON file. If not set, the BOT_TREE environment variable is used and defaults to tree.json.
--data-dir: root data directory. If not set, the BOT_DATA environment variable is used and defaults to data.
--debug-users: Telegram user identifiers separated by comma that are allowed to run debug commands. If not set, the BOT_DEBUG_USERS environment variable is used.
--speech-engine: matcher engine under speechbot/matchers. If not set, the BOT_SPEECH_ENGINE environment variable is used and defaults to speechmatching.

Tree.json reference

The tree JSON file defines blocks and how to move between blocks. Each block has a list of edges that map a word label to a destination block id.

Each block is a step in the conversation. The prompt_text is shown when the block is active. Each edge is a spoken option that moves to another block. A recording is needed for each word label under data/recordings/<word>/. A minimal example appears in Quick start.

Structure

The top level keys are:

root_id: id of the entry block.
blocks: list of block objects.

Each block supports:

id: unique block id.
prompt_text: text shown to the user when the block is active. If empty, the default prompt Say one of the available options. is used.
edges: list of edges in the form {"word": "...", "to": "..."}.
on_enter: optional section for actions and context updates.

The on_enter section supports:

text: a text message that is sent when entering the block.
photo: list of photo paths to send.
video: list of video paths to send.
audio: list of audio paths to send.
context: map to set context keys.
context_inc: map of number increases.
context_delete: list of keys to remove from context.

If multiple photos or videos are provided, the service sends them as an album. Text fields in prompt_text and on_enter.text are formatted with text.format(**context).

Media paths are resolved from the current working directory, or relative to the tree JSON file location if the file is not found.

Files on disk

Path	Purpose
tree.json	Block tree definition
data/state/	Per-user state JSON files
data/recordings/<word>/	Word recordings for matching
data/prompts/prompt_<sha256>/	Prompt recordings for prompts
data/inbox/	Downloaded service media
data/media/	Media referenced by tree.json

Prompt recordings send a spoken version of prompt text. They are not used for word matching.

State and debug

Context

Per-user context is stored in the user state file and is carried across blocks. The bot updates some keys automatically:

last_word
from_block_id
block_id

Text can include simple formatting expressions using text.format(**context). If formatting fails (for example because a key is missing), the original string is kept.

Missing format keys do not raise a user-visible error. The original text is used instead.

Undo history

The bot keeps a limited history of previous states. Users can restore the most recent state using the /undo command.

Debug commands

Debug commands can be enabled for specific Telegram user identifiers, for example with:

python3 -m speechbot telegram --tree examples/basic/tree.json \
    --data-dir data --debug-users 123,456

When enabled, the following commands are available:

/debug shows the raw state information.
/where shows the current block id.
/context shows the current context map.
/history shows the history length.

[pypi]

https://pypi.tw.martin98.com/project/speechbot/

[botfather]

https://core.telegram.org/bots/features#botfather

Developer guide

This section is for extending speechbot in Python.

Custom blocks

Blocks can be implemented in Python by using speechbot.blocks.CustomBlock and setting a block_id class attribute. Custom blocks run inside the bot like normal blocks, but they should override handle to implement custom logic. The handle method receives the incoming message, the user state, and a runtime object. When running under the standard bot, that runtime object is the speechbot.bot.Bot instance.

Custom blocks are used by example code such as the shop builder, where the interactive logic is written in Python rather than purely in JSON.

Example:

from speechbot.blocks import CustomBlock
from speechbot.protocol import OutgoingText

class HelloBlock(CustomBlock):
    block_id = 'hello'

    def __init__(self, prompt_text='Say hello'):
        CustomBlock.__init__(self, prompt_text)

    def handle(self, incoming, state, runtime):
        return ([OutgoingText(
            chat_id=incoming.chat_id,
            text='Hello from Python.'
        )], None)

    def on_enter_actions(self, incoming, state, runtime):
        return [OutgoingText(
            chat_id=incoming.chat_id,
            text='Entering the hello block.'
        )]

Custom services

The CLI only uses the Telegram service. To use another service, write a custom service that connects the platform to the bot message handler.

A custom service needs to:

receive messages and map them to Incoming classes from speechbot.protocol
include service, chat_id, user_id and message_id along with any metadata in meta
download media to disk and set path for IncomingVoice, IncomingAudio, IncomingPhoto, IncomingVideo and IncomingDocument
call the bot message handler and run every returned Outgoing action
map OutgoingMediaGroup to an album when supported, or send each item

Example:

from speechbot.protocol import IncomingText, OutgoingText

class DummyService:
    def __init__(self):
        self._message_handler = None

    def run(self, message_handler):
        self._message_handler = message_handler
        incoming = IncomingText(
            service='dummy',
            chat_id=1,
            user_id=1,
            message_id=1,
            data='hello'
        )
        actions = message_handler(incoming)
        self._send(actions)

    def _send(self, actions):
        for action in actions:
            if type(action) is OutgoingText:
                self._send_text(action.chat_id, action.text)

    def _send_text(self, chat_id, text):
        print('send to {}: {}'.format(chat_id, text))

Most services will also need a run loop like speechbot.services.telegram.TelegramService.run. The built-in CLI does not know about new services, so create a custom entrypoint or extend speechbot/cli.py to add a new service option.

Speech engines

speechmatching is the default matcher, but additional engines can be added. Create a module under speechbot/matchers that implements SpeechEngine from speechbot.matchers. The engine must provide add_recording and match. If debug output is needed, implement set_debug and get_last_debug similar to the speechmatching engine.

Add the engine to load_speech_engine in speechbot/matchers/__init__.py so the --speech-engine option can find it.

Example:

from speechbot.matchers import SpeechEngine

class DummyEngine(SpeechEngine):
    def __init__(self):
        self._labels = set()

    def add_recording(self, identifier, path):
        self._labels.add(identifier)

    def match(self, voice_path, identifiers=None):
        if identifiers is None:
            selected_identifiers = list(self._labels)
        else:
            selected_identifiers = [
                i for i in identifiers if i in self._labels
            ]
        if len(selected_identifiers) == 0:
            return None
        return selected_identifiers[0]

Project details

Release history Release notifications | RSS feed

This version

1.0.0

Dec 24, 2025

0.1.0

Dec 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

speechbot-1.0.0.tar.gz (44.8 kB view details)

Uploaded Dec 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

speechbot-1.0.0-py3-none-any.whl (49.0 kB view details)

Uploaded Dec 24, 2025 Python 3

File details

Details for the file speechbot-1.0.0.tar.gz.

File metadata

Download URL: speechbot-1.0.0.tar.gz
Upload date: Dec 24, 2025
Size: 44.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for speechbot-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`0c53e91a1d0ac40f3c4fb81bf0ab47253f1d20b6fd2a7d560908dde3995d492d`
MD5	`b47f768a37ab5852521dfdde8774e277`
BLAKE2b-256	`c4c03ab499c78e4da3f0b5d48933802ebee647030438970a713b1367a72fac5b`

See more details on using hashes here.

File details

Details for the file speechbot-1.0.0-py3-none-any.whl.

File metadata

Download URL: speechbot-1.0.0-py3-none-any.whl
Upload date: Dec 24, 2025
Size: 49.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for speechbot-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`94a27f6b90628fc7efbe6535e3eca53c97b1e011f023c554c3742bd798d83abb`
MD5	`954aff6b9df65af26244ac6d99d0135d`
BLAKE2b-256	`be1bbd200617c2a68308b803bfd089f67cca50e4ac0b0f8b226f297b9cc0fd34`

See more details on using hashes here.

speechbot 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

CLI guide

Quick start

Requirements

Docker image

Run the bot

Shop builder

CLI reference

Tree.json reference

State and debug

Developer guide

Custom blocks

Custom services

Speech engines

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes