agent-fleet/README.md
Zer4tul a18cb2824e fix: agent capability matching in dispatch — only agent: labels are requirements
Previous bug: only code:* and review labels were checked, so agent:document,
agent:tests etc. were never filtered. Any agent could pick up any task.

Now: labels with agent: prefix are matched against agent capabilities.
Other labels are treated as metadata. Includes regression test.
2026-05-12 23:51:08 +08:00

261 lines
9 KiB
Markdown

# Agent Fleet Platform
Agent Fleet is a multi-agent orchestration system built with Rust, designed to coordinate AI agents for task execution across distributed environments. It integrates with [Forgejo](https://forgejo.org/) for task management and supports dual execution modes (SSH/CLI and HTTP pull).
## Overview
Agent Fleet acts as the central orchestrator that:
- Receives tasks from Forgejo Issues via webhooks
- Dispatches tasks to agents based on capabilities and load
- Tracks task lifecycle through a state machine
- Validates receipts and artifacts (e.g., PRs)
- Manages agent heartbeats and health
### Key Features
- **Dual Execution Modes**: `ssh_cli` (orchestrator-initiated) and `http_pull` (agent-initiated)
- **Event-Sourced State**: All task state transitions are recorded as events
- **Capability-Based Dispatch**: Tasks are routed to agents based on label matching
- **Auto-Retry**: Failed tasks can be retried up to `max_retries` times
- **Timeout Enforcement**: Tasks are marked `failed` if they exceed `task_timeout_secs`
- **Forgejo Integration**: Automatic task creation from labeled issues, PR lifecycle tracking
## Architecture
```
┌─────────────┐ ┌─────────────────┐
│ Forgejo │◄──webhook────────┤ Agent Fleet │
│ (Issues) │ │ Orchestrator │
└─────────────┘ └───────┬─────────┘
┌──────────────────────────┼──────────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ ssh_cli Hosts │ │ http_pull │ │ Dispatcher │
│ (SSH/Local) │ │ Agents │ │ Loop │
└───────────────┘ └───────────────┘ └───────────────┘
│ │
▼ ▼
┌───────────────┐ ┌───────────────┐
│ Agent CLIs │ │ Event Store │
│ (codex, etc) │ │ (SQLite) │
└───────────────┘ └───────────────┘
```
### Components
- **Event Store** (`src/core/event_store.rs`): SQLite-backed persistent event store
- **State Machine** (`src/core/state_machine.rs`): Validates and executes state transitions
- **Task Queue** (`src/core/task_queue.rs`): HTTP pull task queue with capability matching
- **Dispatcher** (`src/dispatch.rs`): Periodic dispatch loop for `ssh_cli` tasks
- **SshExecutor** (`src/execution/mod.rs`): Executes agent CLIs via SSH or local subprocess
- **Forgejo Client** (`src/integrations/forgejo.rs`): Forgejo API integration and webhook handling
- **API Handlers** (`src/api.rs`): REST API for agents and task management
## Quick Start
### Prerequisites
- Rust 2024 edition
- cargo-zigbuild (for cross-compilation)
- Forgejo instance (or compatible forge)
### Development Setup
```bash
# Clone the repository
git clone https://git.0x08.org/zer4tul/agent-fleet.git
cd agent-fleet
# Copy example config
cp config.example.toml config.toml
# Edit config.toml with your settings
# - Forgejo URL and token
# - Webhook secret
# - Host configurations for ssh_cli mode
```
### Local Development
```bash
# Run tests
cargo test
# Run the server
cargo run
# Or with custom bind/port
cargo run -- --bind 127.0.0.1 --port 9090
```
### Building for aarch64
```bash
# Install cargo-zigbuild if not already installed
cargo install cargo-zigbuild
# Cross-compile for aarch64-unknown-linux-gnu
cargo zigbuild --target aarch64-unknown-linux-gnu --release
# Binary will be at: target/aarch64-unknown-linux-gnu/release/agent-fleet
```
## Configuration
Configuration is done via TOML file. See `config.example.toml` for a complete example.
### Server Settings
```toml
[server]
bind = "0.0.0.0" # Listen address
port = 9090 # HTTP port
```
### Forgejo Integration
```toml
[forgejo]
url = "https://git.0x08.org"
token = "your-api-token" # Forgejo API token
webhook_secret = "your-webhook-secret" # Shared secret for webhook validation
```
### Orchestrator Settings
```toml
[orchestrator]
db_path = "data/agent-fleet.db" # SQLite database path
heartbeat_interval_secs = 60 # Agent heartbeat interval
heartbeat_timeout_threshold = 3 # Missed heartbeats before offline
task_timeout_secs = 1800 # Default task timeout (30 min)
default_max_retries = 2 # Max retry attempts
dispatch_interval_secs = 10 # Dispatch loop interval
# http_pull_token = "optional-bearer-token" # Auth for http_pull agents
```
### SSH CLI Hosts
Configure remote hosts for `ssh_cli` execution:
```toml
[[hosts]]
host_id = "host-worker-01"
hostname = "192.168.1.100"
ssh_user = "deploy"
ssh_port = 22
ssh_key_path = "/home/deploy/.ssh/id_ed25519"
work_dir = "/opt/agent-workspace"
agents = [
{ agent_type = "codex-cli", max_concurrency = 2, capabilities = ["code:rust", "code:python"] },
{ agent_type = "claude-code", max_concurrency = 1, capabilities = ["code:rust"] },
]
# For local execution (same machine as orchestrator)
[[hosts]]
host_id = "local"
hostname = "localhost"
ssh_user = "runner"
work_dir = "/tmp/agent-workspace"
agents = [
{ agent_type = "codex-cli", max_concurrency = 1, capabilities = ["code:rust"] },
]
```
## API Summary
Agent Fleet exposes a REST API for agent registration, task management, and webhooks.
### Agent Endpoints
| Endpoint | Method | Description |
|----------|---------|-------------|
| `/api/v1/agents/register` | POST | Register or update an agent |
| `/api/v1/agents/heartbeat` | POST | Update agent heartbeat |
| `/api/v1/agents/deregister` | POST | Deregister an agent |
| `/api/v1/agents` | GET | List agents with filters |
### Task Endpoints
| Endpoint | Method | Description |
|----------|---------|-------------|
| `/api/v1/tasks` | GET | List tasks |
| `/api/v1/tasks/{task_id}` | GET | Get task details |
| `/api/v1/tasks/dequeue` | POST | Dequeue task (http_pull only) |
| `/api/v1/tasks/{task_id}/status` | POST | Update task status (http_pull only) |
| `/api/v1/tasks/{task_id}/complete` | POST | Complete task with receipt |
| `/api/v1/tasks/{task_id}/retry` | POST | Retry failed task |
### Other Endpoints
| Endpoint | Method | Description |
|----------|---------|-------------|
| `/healthz` | GET | Health check |
| `/api/v1/webhooks/forgejo` | POST | Forgejo webhook handler |
| `/api/v1/receipts` | POST | Submit task receipt |
For detailed API documentation, see [docs/agent-api-reference.md](docs/agent-api-reference.md).
## Deployment
See [docs/deployment.md](docs/deployment.md) for detailed deployment instructions including:
- Cross-compilation with cargo-zigbuild
- Systemd service configuration
- Caddy reverse proxy setup
## Architecture Details
For in-depth architectural information, see [docs/architecture.md](docs/architecture.md) covering:
- Dual execution model comparison
- Dispatch loop internals
- Task lifecycle and state machine
- Forgejo integration flow
## Agent Integration
See [docs/agent-onboarding-guide.md](docs/agent-onboarding-guide.md) for:
- Choosing between `ssh_cli` and `http_pull` modes
- Agent registration and heartbeat
- Task dequeue and completion workflows
## Development
### Running Tests
```bash
cargo test
```
### Code Style
- Rust 2024 edition
- `thiserror` for error types
- `serde` for serialization
- All DB operations go through `EventStore`
- `Arc<Mutex<EventStore>>` for shared state
### Project Structure
```
src/
├── main.rs # Entry point, server setup
├── config.rs # TOML configuration
├── api.rs # HTTP API handlers
├── dispatch.rs # Task dispatch loop
├── execution/ # SSH execution
├── integrations/ # Forgejo client
├── adapters/ # Agent adapter interface
└── core/ # Business logic
├── models.rs # Data models
├── event_store.rs # Event sourcing
├── state_machine.rs # State transitions
├── task_queue.rs # HTTP pull queue
├── timeout.rs # Timeout checker
└── retry.rs # Retry policy
```
## License
MIT