fix: agent capability matching in dispatch — only agent: labels are requirements

Previous bug: only code:* and review labels were checked, so agent:document, agent:tests etc. were never filtered. Any agent could pick up any task. Now: labels with agent: prefix are matched against agent capabilities. Other labels are treated as metadata. Includes regression test.
2026-05-12 23:51:08 +08:00 · 2026-05-12 23:51:08 +08:00 · a18cb2824e
commit a18cb2824e
parent 1f351a1734
6 changed files with 1271 additions and 8 deletions
--- a/README.md
+++ b/README.md
@ -0,0 +1,261 @@
 # Agent Fleet Platform
 Agent Fleet is a multi-agent orchestration system built with Rust, designed to coordinate AI agents for task execution across distributed environments. It integrates with [Forgejo](https://forgejo.org/) for task management and supports dual execution modes (SSH/CLI and HTTP pull).
 ## Overview
 Agent Fleet acts as the central orchestrator that:
 - Receives tasks from Forgejo Issues via webhooks
 - Dispatches tasks to agents based on capabilities and load
 - Tracks task lifecycle through a state machine
 - Validates receipts and artifacts (e.g., PRs)
 - Manages agent heartbeats and health
 ### Key Features
 - **Dual Execution Modes**: `ssh_cli` (orchestrator-initiated) and `http_pull` (agent-initiated)
 - **Event-Sourced State**: All task state transitions are recorded as events
 - **Capability-Based Dispatch**: Tasks are routed to agents based on label matching
 - **Auto-Retry**: Failed tasks can be retried up to `max_retries` times
 - **Timeout Enforcement**: Tasks are marked `failed` if they exceed `task_timeout_secs`
 - **Forgejo Integration**: Automatic task creation from labeled issues, PR lifecycle tracking
 ## Architecture
 ```
 ┌─────────────┐                    ┌─────────────────┐
 │   Forgejo   │◄──webhook────────┤ Agent Fleet    │
 │   (Issues)  │                    │ Orchestrator   │
 └─────────────┘                    └───────┬─────────┘
                                           │
                    ┌──────────────────────────┼──────────────────────────┐
                    │                      │                      │
                    ▼                      ▼                      ▼
            ┌───────────────┐    ┌───────────────┐      ┌───────────────┐
            │ ssh_cli Hosts │    │ http_pull     │      │ Dispatcher    │
            │ (SSH/Local)  │    │ Agents       │      │ Loop          │
            └───────────────┘    └───────────────┘      └───────────────┘
                   │                                           │
                   ▼                                           ▼
            ┌───────────────┐                          ┌───────────────┐
            │ Agent CLIs    │                          │ Event Store   │
            │ (codex, etc) │                          │ (SQLite)      │
            └───────────────┘                          └───────────────┘
 ```
 ### Components
 - **Event Store** (`src/core/event_store.rs`): SQLite-backed persistent event store
 - **State Machine** (`src/core/state_machine.rs`): Validates and executes state transitions
 - **Task Queue** (`src/core/task_queue.rs`): HTTP pull task queue with capability matching
 - **Dispatcher** (`src/dispatch.rs`): Periodic dispatch loop for `ssh_cli` tasks
 - **SshExecutor** (`src/execution/mod.rs`): Executes agent CLIs via SSH or local subprocess
 - **Forgejo Client** (`src/integrations/forgejo.rs`): Forgejo API integration and webhook handling
 - **API Handlers** (`src/api.rs`): REST API for agents and task management
 ## Quick Start
 ### Prerequisites
 - Rust 2024 edition
 - cargo-zigbuild (for cross-compilation)
 - Forgejo instance (or compatible forge)
 ### Development Setup
 ```bash
 # Clone the repository
 git clone https://git.0x08.org/zer4tul/agent-fleet.git
 cd agent-fleet
 # Copy example config
 cp config.example.toml config.toml
 # Edit config.toml with your settings
 # - Forgejo URL and token
 # - Webhook secret
 # - Host configurations for ssh_cli mode
 ```
 ### Local Development
 ```bash
 # Run tests
 cargo test
 # Run the server
 cargo run
 # Or with custom bind/port
 cargo run -- --bind 127.0.0.1 --port 9090
 ```
 ### Building for aarch64
 ```bash
 # Install cargo-zigbuild if not already installed
 cargo install cargo-zigbuild
 # Cross-compile for aarch64-unknown-linux-gnu
 cargo zigbuild --target aarch64-unknown-linux-gnu --release
 # Binary will be at: target/aarch64-unknown-linux-gnu/release/agent-fleet
 ```
 ## Configuration
 Configuration is done via TOML file. See `config.example.toml` for a complete example.
 ### Server Settings
 ```toml
 [server]
 bind = "0.0.0.0"     # Listen address
 port = 9090             # HTTP port
 ```
 ### Forgejo Integration
 ```toml
 [forgejo]
 url = "https://git.0x08.org"
 token = "your-api-token"           # Forgejo API token
 webhook_secret = "your-webhook-secret"  # Shared secret for webhook validation
 ```
 ### Orchestrator Settings
 ```toml
 [orchestrator]
 db_path = "data/agent-fleet.db"              # SQLite database path
 heartbeat_interval_secs = 60                  # Agent heartbeat interval
 heartbeat_timeout_threshold = 3                # Missed heartbeats before offline
 task_timeout_secs = 1800                      # Default task timeout (30 min)
 default_max_retries = 2                       # Max retry attempts
 dispatch_interval_secs = 10                   # Dispatch loop interval
 # http_pull_token = "optional-bearer-token"   # Auth for http_pull agents
 ```
 ### SSH CLI Hosts
 Configure remote hosts for `ssh_cli` execution:
 ```toml
 [[hosts]]
 host_id = "host-worker-01"
 hostname = "192.168.1.100"
 ssh_user = "deploy"
 ssh_port = 22
 ssh_key_path = "/home/deploy/.ssh/id_ed25519"
 work_dir = "/opt/agent-workspace"
 agents = [
  { agent_type = "codex-cli", max_concurrency = 2, capabilities = ["code:rust", "code:python"] },
  { agent_type = "claude-code", max_concurrency = 1, capabilities = ["code:rust"] },
 ]
 # For local execution (same machine as orchestrator)
 [[hosts]]
 host_id = "local"
 hostname = "localhost"
 ssh_user = "runner"
 work_dir = "/tmp/agent-workspace"
 agents = [
  { agent_type = "codex-cli", max_concurrency = 1, capabilities = ["code:rust"] },
 ]
 ```
 ## API Summary
 Agent Fleet exposes a REST API for agent registration, task management, and webhooks.
 ### Agent Endpoints
 | Endpoint | Method | Description |
 |----------|---------|-------------|
 | `/api/v1/agents/register` | POST | Register or update an agent |
 | `/api/v1/agents/heartbeat` | POST | Update agent heartbeat |
 | `/api/v1/agents/deregister` | POST | Deregister an agent |
 | `/api/v1/agents` | GET | List agents with filters |
 ### Task Endpoints
 | Endpoint | Method | Description |
 |----------|---------|-------------|
 | `/api/v1/tasks` | GET | List tasks |
 | `/api/v1/tasks/{task_id}` | GET | Get task details |
 | `/api/v1/tasks/dequeue` | POST | Dequeue task (http_pull only) |
 | `/api/v1/tasks/{task_id}/status` | POST | Update task status (http_pull only) |
 | `/api/v1/tasks/{task_id}/complete` | POST | Complete task with receipt |
 | `/api/v1/tasks/{task_id}/retry` | POST | Retry failed task |
 ### Other Endpoints
 | Endpoint | Method | Description |
 |----------|---------|-------------|
 | `/healthz` | GET | Health check |
 | `/api/v1/webhooks/forgejo` | POST | Forgejo webhook handler |
 | `/api/v1/receipts` | POST | Submit task receipt |
 For detailed API documentation, see [docs/agent-api-reference.md](docs/agent-api-reference.md).
 ## Deployment
 See [docs/deployment.md](docs/deployment.md) for detailed deployment instructions including:
 - Cross-compilation with cargo-zigbuild
 - Systemd service configuration
 - Caddy reverse proxy setup
 ## Architecture Details
 For in-depth architectural information, see [docs/architecture.md](docs/architecture.md) covering:
 - Dual execution model comparison
 - Dispatch loop internals
 - Task lifecycle and state machine
 - Forgejo integration flow
 ## Agent Integration
 See [docs/agent-onboarding-guide.md](docs/agent-onboarding-guide.md) for:
 - Choosing between `ssh_cli` and `http_pull` modes
 - Agent registration and heartbeat
 - Task dequeue and completion workflows
 ## Development
 ### Running Tests
 ```bash
 cargo test
 ```
 ### Code Style
 - Rust 2024 edition
 - `thiserror` for error types
 - `serde` for serialization
 - All DB operations go through `EventStore`
 - `Arc<Mutex<EventStore>>` for shared state
 ### Project Structure
 ```
 src/
 ├── main.rs          # Entry point, server setup
 ├── config.rs        # TOML configuration
 ├── api.rs          # HTTP API handlers
 ├── dispatch.rs      # Task dispatch loop
 ├── execution/      # SSH execution
 ├── integrations/    # Forgejo client
 ├── adapters/       # Agent adapter interface
 └── core/           # Business logic
    ├── models.rs        # Data models
    ├── event_store.rs  # Event sourcing
    ├── state_machine.rs # State transitions
    ├── task_queue.rs   # HTTP pull queue
    ├── timeout.rs      # Timeout checker
    └── retry.rs       # Retry policy
 ```
 ## License
 MIT
--- a/docs/agent-api-reference.md
+++ b/docs/agent-api-reference.md
@ -23,10 +23,10 @@ Affected endpoints: `POST /api/v1/tasks/dequeue`, `POST /api/v1/tasks/{task_id}/
 ### Webhook HMAC-SHA256
-The `POST /api/v1/webhooks/forgejo` endpoint requires an `X-Hub-Signature-256` (or `X-Gitea-Signature` / `X-Forgejo-Signature`) header containing `sha256=<hex_hmac>` of the request body using the configured `webhook_secret`.
+The `POST /api/v1/webhooks/forgejo` endpoint requires an `X-Gitea-Signature` or `X-Forgejo-Signature` header containing `sha256=<hex_hmac>` of the request body using the configured `webhook_secret`.
 ```
-X-Hub-Signature-256: sha256=abcdef...
+X-Forgejo-Signature: sha256=abcdef...
 ```
 ---
--- a/docs/agent-onboarding-guide.md
+++ b/docs/agent-onboarding-guide.md
@ -298,12 +298,17 @@ curl -X POST http://FLEET_API_URL:PORT/api/v1/agents/deregister \
 ```
 created → assigned → running → review_pending → completed
-                               ↘ failed
+    ↓         ↓          ↓          ↓          ↓
-                  ↘ agent_lost
+cancelled   cancelled   failed    failed    cancelled
-         ↘ cancelled
+                         ↓
                   (retry) → assigned
 ```
-Any `failed` or `agent_lost` task can be retried via `POST /api/v1/tasks/{task_id}/retry` (transitions to `assigned`). Retry is limited by `max_retries` (default: 2).
+**Notes:**
 - `failed` and `agent_lost` tasks can be retried via `POST /api/v1/tasks/{task_id}/retry` (transitions to `assigned`)
 - Retry is limited by `max_retries` (default: 2)
 - `agent_lost` is set internally by the heartbeat checker when an agent times out
 - `review_pending` can transition back to `assigned`, `running`, `failed`, or `completed`
 ---
--- a/docs/architecture.md
+++ b/docs/architecture.md
@ -0,0 +1,514 @@
 # Agent Fleet Architecture
 This document describes the internal architecture of Agent Fleet, including the dual execution model, dispatch loop, task lifecycle, Forgejo integration flow, and state machine.
 ## System Overview
 Agent Fleet is an orchestrator for coordinating AI agents across distributed environments. It acts as the central hub that receives tasks from Forgejo, dispatches them to agents, and tracks their completion.
 ### Core Components
 ```
 ┌─────────────────────────────────────────────────────────────────┐
 │                      Agent Fleet Orchestrator                   │
 ├─────────────────────────────────────────────────────────────────┤
 │                                                               │
 │  ┌──────────────┐     ┌──────────────┐                    │
 │  │  HTTP API    │◄────┤   Axum       │                    │
 │  │  Handlers    │     │   Router     │                    │
 │  └──────┬───────┘     └──────────────┘                    │
 │         │                                                   │
 │         │                                                   │
 │  ┌──────▼──────────────┐  ┌──────────────────┐             │
 │  │  State Machine     │  │  Event Store    │             │
 │  │  (transitions)     │◄─┤  (SQLite)       │             │
 │  └───────────────────┘  └──────────────────┘             │
 │         │                                                   │
 │         │                                                   │
 │  ┌──────▼──────────────┐  ┌──────────────────┐             │
 │  │  Task Queue       │  │  Timeout       │             │
 │  │  (http_pull)     │  │  Checker       │             │
 │  └───────────────────┘  └──────────────────┘             │
 │         │                                                   │
 │         │                                                   │
 │  ┌──────▼──────────────┐  ┌──────────────────┐             │
 │  │  Dispatcher       │  │  Heartbeat     │             │
 │  │  Loop            │  │  Checker       │             │
 │  │  (ssh_cli)       │  └──────────────────┘             │
 │  └───────────────────┘                                 │
 │         │                                                   │
 └─────────┼───────────────────────────────────────────────────────┘
          │
          │
    ┌─────┴────────────────────────────┐
    │                              │
    ▼                              ▼
 ┌────────────────┐          ┌────────────────┐
 │  Forgejo     │          │  Agents       │
 │  (webhooks)  │          │  (ssh_cli &   │
 │              │          │   http_pull)  │
 └────────────────┘          └────────────────┘
 ```
 ## Dual Execution Model
 Agent Fleet supports two fundamentally different execution modes, each suited for different deployment scenarios.
 ### ssh_cli Mode
 In `ssh_cli` mode, the orchestrator initiates task execution by:
 1. SSH-ing into a configured host
 2. Executing an agent CLI binary with a structured prompt
 3. Parsing the JSON output to determine task result
 **Flow:**
 ```
 Dispatcher Loop
     │
     ▼
 Select task (status=created, execution_mode=ssh_cli)
     │
     ▼
 Select host (capability match + lowest load)
     │
     ▼
 Transition: created → assigned → running
     │
     ▼
 SshExecutor executes CLI
     │
     ├─► Success → Parse receipt → completed/review_pending
     │
     └─► Failure → failed
 ```
 **Characteristics:**
 - **Initiator**: Orchestrator
 - **Communication**: SSH or local subprocess
 - **Control Flow**: Orchestrator-managed
 - **Best For**: CLI-based agents (codex-cli, claude-code)
 **Implementation Details:**
 - `src/dispatch.rs` - Periodic dispatch loop
 - `src/execution/mod.rs` - SSH execution
 - `src/adapters/mod.rs` - CLI adapter configuration
 ### http_pull Mode
 In `http_pull` mode, agents independently poll the orchestrator for work using the HTTP API.
 **Flow:**
 ```
 Agent Loop
     │
     ▼
 Heartbeat: POST /api/v1/agents/heartbeat
     │
     ▼
 Dequeue: POST /api/v1/tasks/dequeue
     │
     ├─► No task → Wait, retry
     │
     └─► Got task
          │
          ▼
 Update status: running
          │
          ▼
 Execute task
          │
          ▼
 Complete: POST /api/v1/tasks/{task_id}/complete
          │
          └─► Receipt validated → completed/failed
 ```
 **Characteristics:**
 - **Initiator**: Agent
 - **Communication**: HTTP REST API
 - **Control Flow**: Agent-managed
 - **Best For**: Self-scheduled agents (OpenClaw, Hermes, custom bot frameworks)
 **Implementation Details:**
 - `src/core/task_queue.rs` - HTTP pull task queue
 - `src/api.rs` - Dequeue and status update endpoints
 ### Mode Comparison
 | Aspect | ssh_cli | http_pull |
 |--------|----------|-----------|
 | Who initiates? | Orchestrator | Agent |
 | Communication | SSH/Subprocess | HTTP REST API |
 | Configuration | `[[hosts]]` in config.toml | Agent-side registration |
 | Network topology | Orchestrator reaches agents | Agents reach orchestrator |
 | Firewalls | Requires outbound SSH from orchestrator | Requires inbound HTTP to orchestrator |
 | Latency | Orchestrator-driven pull | Agent-driven pull |
 | Failure detection | Process exit code | Heartbeat timeout |
 ## Dispatch Loop
 The dispatch loop (`src/dispatch.rs`) runs on a configurable interval (default: 10 seconds) and is responsible for assigning `ssh_cli` tasks to available hosts.
 ### Algorithm
 ```
 1. Fetch all tasks where:
   - status = 'created'
   - execution_mode = 'ssh_cli'
 2. For each task:
   a. Find available hosts matching task labels (capabilities)
   b. Filter hosts where current_load < max_concurrency
   c. Sort by current_load (lowest first)
   d. Select first host (if any)
 3. If host selected:
   a. Transition: created → assigned
   b. Transition: assigned → running
   c. Execute via SshExecutor
   d. Parse receipt and transition to final state
 4. Process review_pending tasks:
   a. If review_count > max_retries, transition to failed
 ```
 ### Host Selection
 Host selection uses a combination of:
 1. **Capability matching**: Task labels must be subset of agent capabilities
 2. **Load balancing**: Choose host with lowest current task count
 3. **Concurrency limits**: Respect `max_concurrency` per agent type
 ```rust
 // From src/dispatch.rs:select_host
 for host in &config.hosts {
    for agent in &host.agents {
        // Check capability match
        let supports_caps = task.labels.iter().all(|label| {
            !label.starts_with("code:") && !label.starts_with("review")
                || agent.capabilities.iter().any(|cap| cap == label)
        });
        // Check concurrency
        let current = *load.get(&(host.host_id, agent.agent_type)).unwrap_or(&0);
        if supports_caps && current < agent.max_concurrency {
            candidates.push((host, agent, current));
        }
    }
 }
 // Sort by load and pick first
 candidates.sort_by_key(|(_, _, current)| *current);
 ```
 ## Task Lifecycle
 A task progresses through a finite state machine from creation to completion or failure.
 ### State Machine
 ```
                    ┌─────────┐
                    │ Created │
                    └────┬────┘
                         │
                         │ (dispatch)
                         ▼
                  ┌──────────────┐
                  │   Assigned   │
                  └──────┬───────┘
                         │
                         │ (start)
                         ▼
                  ┌──────────────┐
                  │   Running    │
                  └──────┬───────┘
                         │
              ┌────────┴────────┐
              │                 │
         (partial)        (complete)
              │                 │
              ▼                 ▼
    ┌──────────────┐   ┌───────────┐
    │Review Pending │   │ Completed │
    └──────┬───────┘   └───────────┘
           │
           │ (review loop)
           ▼
    ┌──────────────┐
    │   Failed     │
    └──────┬───────┘
           │
           │ (retry)
           ▼
    ┌──────────────┐
    │   Assigned   │
    └──────────────┘
 ```
 ### State Definitions
 | State | Description | Triggers |
 |-------|-------------|-----------|
 | `created` | Initial state, task exists but not assigned | Forgejo webhook, manual insertion |
 | `assigned` | Task has been assigned to an agent/host | Dispatch loop, dequeue |
 | `running` | Agent is actively working on the task | Agent start, SSH execution start |
 | `review_pending` | Agent completed work, awaiting review/approval | Partial receipt, PR opened |
 | `completed` | Task successfully finished | Full receipt, PR merged |
 | `failed` | Task could not be completed | Error, timeout, retry limit |
 | `agent_lost` | Agent stopped responding during task | Heartbeat timeout |
 | `cancelled` | Task was cancelled | Manual cancellation |
 ### Valid Transitions
 Implemented in `src/core/state_machine.rs:validate_transition()`:
 ```rust
 Created → Assigned, Cancelled
 Assigned → Running, Cancelled
 Running → ReviewPending, Completed, Failed, AgentLost, Cancelled
 ReviewPending → Assigned, Running, Completed, Failed, Cancelled
 Failed → Assigned, Cancelled
 AgentLost → Assigned, Cancelled
 Completed → (terminal)
 Cancelled → (terminal)
 ```
 ### Timeout and Retry
 - **Task Timeout**: The `TimeoutChecker` (`src/core/timeout.rs`) monitors tasks and marks them `failed` if they exceed `task_timeout_secs`
 - **Agent Timeout**: The `HeartbeatChecker` (`src/api.rs`) monitors agent heartbeats and marks them `offline` if `heartbeat_interval_secs * heartbeat_timeout_threshold` elapses without a heartbeat
 - **Auto-Retry**: Tasks in `failed` or `agent_lost` can be retried via API, transitioning back to `assigned`
 ## Forgejo Integration
 The Forgejo integration handles the bi-directional flow between Agent Fleet and Forgejo.
 ### Webhook Events
 Forgejo sends webhook events to `POST /api/v1/webhooks/forgejo`:
 | Event Type | Source | Action |
 |------------|---------|---------|
 | `issues` (opened) | Issue created with `agent:*` label | Create new task |
 | `pull_request` (opened) | PR created on `task/*` branch | Mark task `review_pending` |
 | `pull_request` (closed, merged=true) | PR merged | Mark task `completed` |
 | `push` | Commit to `task/*` branch | Update `last_activity_at` |
 ### Task Creation Flow
 ```
 Forgejo Issue (with agent:code label)
         │
         ▼
   POST /api/v1/webhooks/forgejo
         │
         ▼
   Verify HMAC signature
         │
         ▼
   Parse event
         │
         ▼
   Extract:
   - task_type from agent:* label
   - priority from priority:* label
   - requirements from issue title + body
         │
         ▼
   Create task with:
   - task_id = "org/repo#42"
   - source = "forgejo:org/repo#42"
   - execution_mode = "ssh_cli"
   - branch_name = "task/org%2Frepo%2342"
   - pr_title = "feat: Title (#42)"
         │
         ▼
   StateMachine: create_task()
         │
         ▼
   Store in EventStore
 ```
 ### PR Lifecycle Flow
 ```
 PR opened on task/* branch
         │
         ▼
   POST /api/v1/webhooks/forgejo (pull_request event)
         │
         ▼
   Extract task_id from branch name
         │
         ▼
   StateMachine: transition(Running → ReviewPending)
         │
         ▼
   Update Forgejo issue label to "status:doing"
 ```
 ```
 PR merged
         │
         ▼
   POST /api/v1/webhooks/forgejo (pull_request event, merged=true)
         │
         ▼
   Extract task_id from branch name
         │
         ▼
   StateMachine: transition(Running/ReviewPending → Completed)
         │
         ▼
   Update Forgejo issue label to "status:done"
         │
         ▼
   Auto-generate receipt comment
 ```
 ### Branch Naming Convention
 Task branches follow the pattern:
 ```
 task/{url_encoded_task_id}
 ```
 Where `task_id` is `org/repo#42`, the branch becomes:
 ```
 task/org%2Frepo%2342
 ```
 This encoding allows the branch name to be safely used in Git while preserving the original task identifier.
 ### Label Conventions
 | Label Prefix | Usage | Example |
 |--------------|---------|----------|
 | `agent:` | Task type | `agent:code`, `agent:review` |
 | `code:` | Code capability requirement | `code:rust`, `code:python` |
 | `priority:` | Priority level | `priority:urgent`, `priority:high`, `priority:low` |
 | `status:` | Current status (managed by system) | `status:todo`, `status:doing`, `status:done` |
 ### Signature Verification
 Webhook signatures are verified using HMAC-SHA256:
 ```rust
 // From src/integrations/forgejo.rs
 pub fn verify_webhook_signature(secret: &str, body: &[u8], signature: &str) -> Result<(), ForgejoError> {
    let provided = signature.trim();
    let provided = provided.strip_prefix("sha256=").unwrap_or(provided);
    let mut mac = HmacSha256::new_from_slice(secret.as_bytes())?;
    mac.update(body);
    let expected = hex::encode(mac.finalize().into_bytes());
    if expected == provided { Ok(()) } else { Err(ForgejoError::InvalidSignature) }
 }
 ```
 ## Event Sourcing
 Agent Fleet uses an event-sourced architecture for task state management.
 ### Event Store
 The `EventStore` (`src/core/event_store.rs`) provides:
 - Persistent storage in SQLite
 - Event journaling for all state changes
 - Task snapshot projection
 - Agent registry
 ### Event Schema
 Each state transition creates a `TaskEvent`:
 ```rust
 pub struct TaskEvent {
    pub event_id: String,
    pub task_id: String,
    pub event_type: String,  // e.g., "task.created", "task.assigned"
    pub agent_id: Option<String>,
    pub timestamp: DateTime<Utc>,
    pub payload: serde_json::Value,
 }
 ```
 ### Benefits
 1. **Audit Trail**: Complete history of all state changes
 2. **Reproducibility**: Can reconstruct task state from events
 3. **Debugging**: Timeline of all transitions for troubleshooting
 4. **Integrations**: Event notifications can be sent to external systems
 ## Background Services
 Agent Fleet runs several background services concurrently:
 | Service | File | Function | Interval |
 |----------|-------|-----------|-----------|
 | Dispatcher Loop | `src/dispatch.rs` | Assign ssh_cli tasks to hosts | `dispatch_interval_secs` |
 | Timeout Checker | `src/core/timeout.rs` | Detect task timeouts | 30 seconds |
 | Heartbeat Checker | `src/api.rs` | Detect offline agents | `heartbeat_interval_secs` |
 All background services run as Tokio tasks in `src/main.rs`:
 ```rust
 #[tokio::main]
 async fn main() {
    // ... setup ...
    // Spawn background services
    tokio::spawn(async move { timeout_checker.run().await });
    tokio::spawn(async move { heartbeat_checker.run().await });
    tokio::spawn(async move { dispatcher.run().await });
    // Start HTTP server
    axum::serve(listener, app).await?;
 }
 ```
 ## Error Handling
 Agent Fleet uses structured error handling throughout:
 ```rust
 #[derive(Debug, thiserror::Error)]
 pub enum ApiError {
    #[error("database error: {0}")]
    Database(#[from] rusqlite::Error),
    #[error("not found: {0}")]
    NotFound(String),
    #[error("bad request: {0}")]
    BadRequest(String),
    #[error("unauthorized: {0}")]
    Unauthorized(String),
    #[error("forgejo error: {0}")]
    Forgejo(#[from] ForgejoError),
 }
 ```
 All errors are converted to appropriate HTTP status codes:
 - `400 Bad Request` - Invalid input, bad state transition
 - `401 Unauthorized` - Missing/invalid auth token or webhook signature
 - `404 Not Found` - Task or agent not found
 - `500 Internal Server Error` - Database errors, unexpected failures
 ## Concurrency Model
 Agent Fleet uses `Arc<Mutex<T>>` for shared state:
 ```rust
 // Shared event store across all handlers and background services
 let store = Arc::new(Mutex::new(event_store));
 // State machine gets a reference
 let state_machine = Arc::new(StateMachine::new(store.clone()));
 // Background services take clones
 tokio::spawn(async move { dispatcher.run().await });
 ```
 This ensures:
 - Thread-safe access to shared state
 - Background services don't block API handlers
 - All state transitions are serialized through the mutex
--- a/docs/deployment.md
+++ b/docs/deployment.md
@ -0,0 +1,465 @@
 # Agent Fleet Deployment Guide
 This guide covers deploying Agent Fleet Orchestrator to production, including cross-compilation, systemd service setup, and reverse proxy configuration with Caddy.
 ## Prerequisites
 - Development machine with Rust and cargo
 - Target server (e.g., aarch64 Linux)
 - cargo-zigbuild for cross-compilation
 - Caddy web server (optional, for reverse proxy)
 ## Building with cargo-zigbuild
 cargo-zigbuild enables cross-compilation by using Zig as the C toolchain, avoiding the need for target-specific cross-compilation toolchains.
 ### Installing cargo-zigbuild
 ```bash
 cargo install cargo-zigbuild
 ```
 ### Cross-Compiling for aarch64-unknown-linux-gnu
 ```bash
 # Build the release binary
 cargo zigbuild --target aarch64-unknown-linux-gnu --release
 # The binary will be at:
 # target/aarch64-unknown-linux-gnu/release/agent-fleet
 ```
 ### Other Target Architectures
 ```bash
 # For x86_64 Linux (standard)
 cargo build --release
 # For aarch64 (ARM64 servers)
 cargo zigbuild --target aarch64-unknown-linux-gnu --release
 # For musl (static linking)
 cargo zigbuild --target x86_64-unknown-linux-musl --release
 ```
 ### Building for Local Testing
 For local development on the same architecture:
 ```bash
 cargo build --release
 ```
 ## Deployment to aarch64
 ### 1. Transfer the Binary
 After building, transfer the binary to your target server:
 ```bash
 # Using scp
 scp target/aarch64-unknown-linux-gnu/release/agent-fleet user@target-host:/opt/agent-fleet/
 # Or using rsync
 rsync -avz target/aarch64-unknown-linux-gnu/release/agent-fleet user@target-host:/opt/agent-fleet/
 ```
 ### 2. Set Up Directory Structure
 ```bash
 ssh user@target-host
 # Create directory and user
 sudo useradd -r -s /bin/false agent-fleet
 sudo mkdir -p /opt/agent-fleet/{bin,data,config}
 sudo chown -R agent-fleet:agent-fleet /opt/agent-fleet
 # Copy binary
 sudo cp /path/to/agent-fleet /opt/agent-fleet/bin/
 sudo chmod +x /opt/agent-fleet/bin/agent-fleet
 ```
 ### 3. Create Configuration File
 Create `/opt/agent-fleet/config/config.toml`:
 ```toml
 [server]
 bind = "127.0.0.1"
 port = 9090
 [forgejo]
 url = "https://git.0x08.org"
 token = "your-forgejo-api-token"
 webhook_secret = "your-webhook-secret"
 [orchestrator]
 db_path = "/opt/agent-fleet/data/agent-fleet.db"
 heartbeat_interval_secs = 60
 heartbeat_timeout_threshold = 3
 task_timeout_secs = 1800
 default_max_retries = 2
 dispatch_interval_secs = 10
 # Configure remote hosts for ssh_cli execution
 [[hosts]]
 host_id = "worker-01"
 hostname = "192.168.1.100"
 ssh_user = "agent"
 ssh_port = 22
 ssh_key_path = "/home/agent/.ssh/id_ed25519"
 work_dir = "/opt/agent-workspace"
 agents = [
  { agent_type = "codex-cli", max_concurrency = 2, capabilities = ["code:rust"] },
 ]
 ```
 ### 4. Create Environment Variables (Optional)
 For sensitive values, use a `.env.local` file instead of config:
 ```bash
 # /opt/agent-fleet/.env.local
 FORGEJO_TOKEN="your-token"
 WEBHOOK_SECRET="your-secret"
 ```
 Then reference them in config (currently not supported, use direct config values).
 ## Systemd Service
 ### Create Systemd Service File
 Create `/etc/systemd/system/agent-fleet.service`:
 ```ini
 [Unit]
 Description=Agent Fleet Orchestrator
 After=network.target
 Documentation=https://git.0x08.org/zer4tul/agent-fleet
 [Service]
 Type=simple
 User=agent-fleet
 Group=agent-fleet
 WorkingDirectory=/opt/agent-fleet
 ExecStart=/opt/agent-fleet/bin/agent-fleet --config /opt/agent-fleet/config/config.toml
 Restart=always
 RestartSec=10
 # Security hardening
 NoNewPrivileges=true
 PrivateTmp=true
 ProtectSystem=strict
 ProtectHome=true
 ReadWritePaths=/opt/agent-fleet/data
 # Logging
 StandardOutput=journal
 StandardError=journal
 SyslogIdentifier=agent-fleet
 [Install]
 WantedBy=multi-user.target
 ```
 ### Enable and Start the Service
 ```bash
 # Reload systemd
 sudo systemctl daemon-reload
 # Enable to start on boot
 sudo systemctl enable agent-fleet
 # Start the service
 sudo systemctl start agent-fleet
 # Check status
 sudo systemctl status agent-fleet
 # View logs
 sudo journalctl -u agent-fleet -f
 ```
 ### Management Commands
 ```bash
 # Restart
 sudo systemctl restart agent-fleet
 # Stop
 sudo systemctl stop agent-fleet
 # Disable
 sudo systemctl disable agent-fleet
 ```
 ## Caddy Reverse Proxy
 Using Caddy as a reverse proxy provides:
 - Automatic HTTPS with Let's Encrypt
 - Path-based routing
 - Basic auth (optional)
 - Request logging
 ### Install Caddy
 ```bash
 # Ubuntu/Debian
 sudo apt install caddy
 # Or using the official package
 curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' | sudo gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg
 curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' | sudo tee /etc/apt/sources.list.d/caddy-stable.list
 sudo apt update
 sudo apt install caddy
 ```
 ### Configure Caddy
 Create or edit `/etc/caddy/Caddyfile`:
 ```caddyfile
 your-domain.example.com {
    reverse_proxy 127.0.0.1:9090
    # Optional: Basic authentication
    # basicauth {
    #     admin $2a$14$Zkx19YLhRnJ8O6l0ZPd.OqG9vXK4wQ6Y5wZQH5Y5x5x5x5x5x5x
    # }
    # Optional: Log requests
    log {
        output file /var/log/caddy/agent-fleet.log
        format json
    }
 }
 # Or with path prefix
 your-domain.example.com/agent-fleet {
    reverse_proxy 127.0.0.1:9090
 }
 ```
 ### Test and Reload Caddy
 ```bash
 # Test configuration
 sudo caddy validate --config /etc/caddy/Caddyfile
 # Reload Caddy
 sudo systemctl reload caddy
 # Or restart
 sudo systemctl restart caddy
 ```
 ### Verify HTTPS
 After Caddy starts, it will automatically provision an SSL certificate from Let's Encrypt. Verify:
 ```bash
 curl https://your-domain.example.com/healthz
 ```
 ## Configuration Walkthrough
 ### Server Configuration
 ```toml
 [server]
 bind = "127.0.0.1"  # Bind to localhost (Caddy handles external traffic)
 port = 9090            # Internal port
 ```
 **Notes:**
 - Use `127.0.0.1` when behind a reverse proxy
 - Use `0.0.0.0` if direct access is needed
 ### Forgejo Configuration
 ```toml
 [forgejo]
 url = "https://git.0x08.org"
 token = "your-api-token"
 webhook_secret = "your-webhook-secret"
 ```
 **Setup Steps:**
 1. Generate a Forgejo API token: User Settings → Applications → Generate New Token
 2. Configure webhook in Forgejo repo settings:
   - URL: `https://your-domain.com/api/v1/webhooks/forgejo`
   - Secret: same as `webhook_secret`
   - Events: Issues, Pull Requests, Push
 ### Orchestrator Configuration
 ```toml
 [orchestrator]
 db_path = "/opt/agent-fleet/data/agent-fleet.db"
 heartbeat_interval_secs = 60
 heartbeat_timeout_threshold = 3
 task_timeout_secs = 1800
 default_max_retries = 2
 dispatch_interval_secs = 10
 ```
 **Explanation:**
 - `heartbeat_interval_secs`: How often agents should send heartbeats
 - `heartbeat_timeout_threshold`: How many missed heartbeats before marking agent offline (3 × 60 = 180 seconds)
 - `task_timeout_secs`: Default timeout for tasks (1800 seconds = 30 minutes)
 - `default_max_retries`: How many times to retry failed tasks
 - `dispatch_interval_secs`: How often the dispatch loop checks for new `ssh_cli` tasks
 ### Host Configuration for SSH CLI
 ```toml
 [[hosts]]
 host_id = "worker-01"
 hostname = "192.168.1.100"
 ssh_user = "deploy"
 ssh_port = 22
 ssh_key_path = "/home/agent-fleet/.ssh/id_ed25519"
 work_dir = "/opt/agent-workspace"
 agents = [
  { agent_type = "codex-cli", max_concurrency = 2, capabilities = ["code:rust"] },
 ]
 ```
 **SSH Key Setup:**
 1. Generate SSH key on the orchestrator server:
   ```bash
   sudo -u agent-fleet ssh-keygen -t ed25519 -f /home/agent-fleet/.ssh/id_ed25519
   ```
 2. Add public key to remote host's `~/.ssh/authorized_keys`
 3. Test SSH connection:
   ```bash
   sudo -u agent-fleet ssh -p 22 deploy@192.168.1.100
   ```
 **Agent CLI Setup on Remote Host:**
 1. Ensure agent CLI is in `$PATH`
 2. Verify: `ssh deploy@host "which codex"`
 ## Troubleshooting
 ### Service Won't Start
 ```bash
 # Check service status
 sudo systemctl status agent-fleet
 # View logs
 sudo journalctl -u agent-fleet -n 100
 # Check file permissions
 ls -la /opt/agent-fleet/
 ```
 ### Webhook Not Received
 ```bash
 # Check Caddy logs
 sudo journalctl -u caddy -f
 # Check agent-fleet logs for webhook errors
 sudo journalctl -u agent-fleet | grep webhook
 # Verify webhook secret matches
 # The secret in config.toml must match Forgejo webhook secret
 ```
 ### SSH Connection Fails
 ```bash
 # Test SSH as the agent-fleet user
 sudo -u agent-fleet ssh -v deploy@host
 # Check SSH key path exists
 sudo -u agent-fleet ls -la /home/agent-fleet/.ssh/
 # Verify key permissions
 sudo -u agent-fleet chmod 600 /home/agent-fleet/.ssh/id_ed25519
 ```
 ### Database Lock Issues
 If the database becomes locked (e.g., after crash):
 ```bash
 # Stop the service
 sudo systemctl stop agent-fleet
 # Backup and remove old database
 mv /opt/agent-fleet/data/agent-fleet.db /opt/agent-fleet/data/agent-fleet.db.backup
 # Restart (will create new database)
 sudo systemctl start agent-fleet
 ```
 ## Monitoring
 ### Health Check
 ```bash
 curl http://localhost:9090/healthz
 # Expected output: "ok"
 ```
 ### Check Task Queue
 ```bash
 curl http://localhost:9090/api/v1/tasks?status=running
 ```
 ### Check Agents
 ```bash
 curl http://localhost:9090/api/v1/agents?status=online
 ```
 ### Log Monitoring
 ```bash
 # Follow logs
 sudo journalctl -u agent-fleet -f
 # Search for errors
 sudo journalctl -u agent-fleet | grep -i error
 ```
 ## Updates
 ### Updating the Binary
 ```bash
 # Build new version locally
 cargo zigbuild --target aarch64-unknown-linux-gnu --release
 # Transfer to server
 scp target/aarch64-unknown-linux-gnu/release/agent-fleet user@host:/tmp/
 # On the server, stop and replace
 sudo systemctl stop agent-fleet
 sudo cp /tmp/agent-fleet /opt/agent-fleet/bin/
 sudo chmod +x /opt/agent-fleet/bin/agent-fleet
 sudo systemctl start agent-fleet
 ```
 ### Zero-Downtime Updates
 For production deployments with minimal downtime:
 ```bash
 # Upload new binary alongside old one
 scp target/aarch64-unknown-linux-gnu/release/agent-fleet user@host:/opt/agent-fleet/bin/agent-fleet.new
 # On server
 sudo mv /opt/agent-fleet/bin/agent-fleet.new /opt/agent-fleet/bin/agent-fleet
 sudo systemctl restart agent-fleet
 ```
 Systemd will handle the restart gracefully with minimal downtime.
--- a/src/dispatch.rs
+++ b/src/dispatch.rs
@ -92,8 +92,11 @@ impl Dispatcher {
        for host in &self.config.hosts {
            for agent in &host.agents {
                let supports_caps = task.labels.iter().all(|label| {
-                    !label.starts_with("code:") && !label.starts_with("review")
+                    if let Some(cap) = label.strip_prefix("agent:") {
-                        || agent.capabilities.iter().any(|cap| cap == label)
+                        agent.capabilities.iter().any(|agent_cap| agent_cap == cap || agent_cap == label)
                    } else {
                        true
                    }
                });
                if !supports_caps {
                    continue;
@ -211,4 +214,19 @@ mod tests {
        let selected = dispatcher.select_host(&sample_task()).await.unwrap().unwrap();
        assert_eq!(selected.0.host_id, "h2");
    }
    #[tokio::test]
    async fn does_not_match_agent_label_without_capability() {
        let dir = TempDir::new().unwrap();
        let db = dir.path().join("test.db");
        let store = Arc::new(Mutex::new(EventStore::open(&db).unwrap()));
        let sm = Arc::new(StateMachine::new(store.clone()));
        let dispatcher = Dispatcher::new(config(), store, sm);
        let mut task = sample_task();
        task.labels = vec!["agent:document".into(), "priority:urgent".into()];
        let selected = dispatcher.select_host(&task).await.unwrap();
        assert!(selected.is_none());
    }
 }