agent-fleet/docs/deployment.md
Zer4tul a18cb2824e fix: agent capability matching in dispatch — only agent: labels are requirements
Previous bug: only code:* and review labels were checked, so agent:document,
agent:tests etc. were never filtered. Any agent could pick up any task.

Now: labels with agent: prefix are matched against agent capabilities.
Other labels are treated as metadata. Includes regression test.
2026-05-12 23:51:08 +08:00

465 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Agent Fleet Deployment Guide
This guide covers deploying Agent Fleet Orchestrator to production, including cross-compilation, systemd service setup, and reverse proxy configuration with Caddy.
## Prerequisites
- Development machine with Rust and cargo
- Target server (e.g., aarch64 Linux)
- cargo-zigbuild for cross-compilation
- Caddy web server (optional, for reverse proxy)
## Building with cargo-zigbuild
cargo-zigbuild enables cross-compilation by using Zig as the C toolchain, avoiding the need for target-specific cross-compilation toolchains.
### Installing cargo-zigbuild
```bash
cargo install cargo-zigbuild
```
### Cross-Compiling for aarch64-unknown-linux-gnu
```bash
# Build the release binary
cargo zigbuild --target aarch64-unknown-linux-gnu --release
# The binary will be at:
# target/aarch64-unknown-linux-gnu/release/agent-fleet
```
### Other Target Architectures
```bash
# For x86_64 Linux (standard)
cargo build --release
# For aarch64 (ARM64 servers)
cargo zigbuild --target aarch64-unknown-linux-gnu --release
# For musl (static linking)
cargo zigbuild --target x86_64-unknown-linux-musl --release
```
### Building for Local Testing
For local development on the same architecture:
```bash
cargo build --release
```
## Deployment to aarch64
### 1. Transfer the Binary
After building, transfer the binary to your target server:
```bash
# Using scp
scp target/aarch64-unknown-linux-gnu/release/agent-fleet user@target-host:/opt/agent-fleet/
# Or using rsync
rsync -avz target/aarch64-unknown-linux-gnu/release/agent-fleet user@target-host:/opt/agent-fleet/
```
### 2. Set Up Directory Structure
```bash
ssh user@target-host
# Create directory and user
sudo useradd -r -s /bin/false agent-fleet
sudo mkdir -p /opt/agent-fleet/{bin,data,config}
sudo chown -R agent-fleet:agent-fleet /opt/agent-fleet
# Copy binary
sudo cp /path/to/agent-fleet /opt/agent-fleet/bin/
sudo chmod +x /opt/agent-fleet/bin/agent-fleet
```
### 3. Create Configuration File
Create `/opt/agent-fleet/config/config.toml`:
```toml
[server]
bind = "127.0.0.1"
port = 9090
[forgejo]
url = "https://git.0x08.org"
token = "your-forgejo-api-token"
webhook_secret = "your-webhook-secret"
[orchestrator]
db_path = "/opt/agent-fleet/data/agent-fleet.db"
heartbeat_interval_secs = 60
heartbeat_timeout_threshold = 3
task_timeout_secs = 1800
default_max_retries = 2
dispatch_interval_secs = 10
# Configure remote hosts for ssh_cli execution
[[hosts]]
host_id = "worker-01"
hostname = "192.168.1.100"
ssh_user = "agent"
ssh_port = 22
ssh_key_path = "/home/agent/.ssh/id_ed25519"
work_dir = "/opt/agent-workspace"
agents = [
{ agent_type = "codex-cli", max_concurrency = 2, capabilities = ["code:rust"] },
]
```
### 4. Create Environment Variables (Optional)
For sensitive values, use a `.env.local` file instead of config:
```bash
# /opt/agent-fleet/.env.local
FORGEJO_TOKEN="your-token"
WEBHOOK_SECRET="your-secret"
```
Then reference them in config (currently not supported, use direct config values).
## Systemd Service
### Create Systemd Service File
Create `/etc/systemd/system/agent-fleet.service`:
```ini
[Unit]
Description=Agent Fleet Orchestrator
After=network.target
Documentation=https://git.0x08.org/zer4tul/agent-fleet
[Service]
Type=simple
User=agent-fleet
Group=agent-fleet
WorkingDirectory=/opt/agent-fleet
ExecStart=/opt/agent-fleet/bin/agent-fleet --config /opt/agent-fleet/config/config.toml
Restart=always
RestartSec=10
# Security hardening
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/opt/agent-fleet/data
# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=agent-fleet
[Install]
WantedBy=multi-user.target
```
### Enable and Start the Service
```bash
# Reload systemd
sudo systemctl daemon-reload
# Enable to start on boot
sudo systemctl enable agent-fleet
# Start the service
sudo systemctl start agent-fleet
# Check status
sudo systemctl status agent-fleet
# View logs
sudo journalctl -u agent-fleet -f
```
### Management Commands
```bash
# Restart
sudo systemctl restart agent-fleet
# Stop
sudo systemctl stop agent-fleet
# Disable
sudo systemctl disable agent-fleet
```
## Caddy Reverse Proxy
Using Caddy as a reverse proxy provides:
- Automatic HTTPS with Let's Encrypt
- Path-based routing
- Basic auth (optional)
- Request logging
### Install Caddy
```bash
# Ubuntu/Debian
sudo apt install caddy
# Or using the official package
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' | sudo gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' | sudo tee /etc/apt/sources.list.d/caddy-stable.list
sudo apt update
sudo apt install caddy
```
### Configure Caddy
Create or edit `/etc/caddy/Caddyfile`:
```caddyfile
your-domain.example.com {
reverse_proxy 127.0.0.1:9090
# Optional: Basic authentication
# basicauth {
# admin $2a$14$Zkx19YLhRnJ8O6l0ZPd.OqG9vXK4wQ6Y5wZQH5Y5x5x5x5x5x5x
# }
# Optional: Log requests
log {
output file /var/log/caddy/agent-fleet.log
format json
}
}
# Or with path prefix
your-domain.example.com/agent-fleet {
reverse_proxy 127.0.0.1:9090
}
```
### Test and Reload Caddy
```bash
# Test configuration
sudo caddy validate --config /etc/caddy/Caddyfile
# Reload Caddy
sudo systemctl reload caddy
# Or restart
sudo systemctl restart caddy
```
### Verify HTTPS
After Caddy starts, it will automatically provision an SSL certificate from Let's Encrypt. Verify:
```bash
curl https://your-domain.example.com/healthz
```
## Configuration Walkthrough
### Server Configuration
```toml
[server]
bind = "127.0.0.1" # Bind to localhost (Caddy handles external traffic)
port = 9090 # Internal port
```
**Notes:**
- Use `127.0.0.1` when behind a reverse proxy
- Use `0.0.0.0` if direct access is needed
### Forgejo Configuration
```toml
[forgejo]
url = "https://git.0x08.org"
token = "your-api-token"
webhook_secret = "your-webhook-secret"
```
**Setup Steps:**
1. Generate a Forgejo API token: User Settings → Applications → Generate New Token
2. Configure webhook in Forgejo repo settings:
- URL: `https://your-domain.com/api/v1/webhooks/forgejo`
- Secret: same as `webhook_secret`
- Events: Issues, Pull Requests, Push
### Orchestrator Configuration
```toml
[orchestrator]
db_path = "/opt/agent-fleet/data/agent-fleet.db"
heartbeat_interval_secs = 60
heartbeat_timeout_threshold = 3
task_timeout_secs = 1800
default_max_retries = 2
dispatch_interval_secs = 10
```
**Explanation:**
- `heartbeat_interval_secs`: How often agents should send heartbeats
- `heartbeat_timeout_threshold`: How many missed heartbeats before marking agent offline (3 × 60 = 180 seconds)
- `task_timeout_secs`: Default timeout for tasks (1800 seconds = 30 minutes)
- `default_max_retries`: How many times to retry failed tasks
- `dispatch_interval_secs`: How often the dispatch loop checks for new `ssh_cli` tasks
### Host Configuration for SSH CLI
```toml
[[hosts]]
host_id = "worker-01"
hostname = "192.168.1.100"
ssh_user = "deploy"
ssh_port = 22
ssh_key_path = "/home/agent-fleet/.ssh/id_ed25519"
work_dir = "/opt/agent-workspace"
agents = [
{ agent_type = "codex-cli", max_concurrency = 2, capabilities = ["code:rust"] },
]
```
**SSH Key Setup:**
1. Generate SSH key on the orchestrator server:
```bash
sudo -u agent-fleet ssh-keygen -t ed25519 -f /home/agent-fleet/.ssh/id_ed25519
```
2. Add public key to remote host's `~/.ssh/authorized_keys`
3. Test SSH connection:
```bash
sudo -u agent-fleet ssh -p 22 deploy@192.168.1.100
```
**Agent CLI Setup on Remote Host:**
1. Ensure agent CLI is in `$PATH`
2. Verify: `ssh deploy@host "which codex"`
## Troubleshooting
### Service Won't Start
```bash
# Check service status
sudo systemctl status agent-fleet
# View logs
sudo journalctl -u agent-fleet -n 100
# Check file permissions
ls -la /opt/agent-fleet/
```
### Webhook Not Received
```bash
# Check Caddy logs
sudo journalctl -u caddy -f
# Check agent-fleet logs for webhook errors
sudo journalctl -u agent-fleet | grep webhook
# Verify webhook secret matches
# The secret in config.toml must match Forgejo webhook secret
```
### SSH Connection Fails
```bash
# Test SSH as the agent-fleet user
sudo -u agent-fleet ssh -v deploy@host
# Check SSH key path exists
sudo -u agent-fleet ls -la /home/agent-fleet/.ssh/
# Verify key permissions
sudo -u agent-fleet chmod 600 /home/agent-fleet/.ssh/id_ed25519
```
### Database Lock Issues
If the database becomes locked (e.g., after crash):
```bash
# Stop the service
sudo systemctl stop agent-fleet
# Backup and remove old database
mv /opt/agent-fleet/data/agent-fleet.db /opt/agent-fleet/data/agent-fleet.db.backup
# Restart (will create new database)
sudo systemctl start agent-fleet
```
## Monitoring
### Health Check
```bash
curl http://localhost:9090/healthz
# Expected output: "ok"
```
### Check Task Queue
```bash
curl http://localhost:9090/api/v1/tasks?status=running
```
### Check Agents
```bash
curl http://localhost:9090/api/v1/agents?status=online
```
### Log Monitoring
```bash
# Follow logs
sudo journalctl -u agent-fleet -f
# Search for errors
sudo journalctl -u agent-fleet | grep -i error
```
## Updates
### Updating the Binary
```bash
# Build new version locally
cargo zigbuild --target aarch64-unknown-linux-gnu --release
# Transfer to server
scp target/aarch64-unknown-linux-gnu/release/agent-fleet user@host:/tmp/
# On the server, stop and replace
sudo systemctl stop agent-fleet
sudo cp /tmp/agent-fleet /opt/agent-fleet/bin/
sudo chmod +x /opt/agent-fleet/bin/agent-fleet
sudo systemctl start agent-fleet
```
### Zero-Downtime Updates
For production deployments with minimal downtime:
```bash
# Upload new binary alongside old one
scp target/aarch64-unknown-linux-gnu/release/agent-fleet user@host:/opt/agent-fleet/bin/agent-fleet.new
# On server
sudo mv /opt/agent-fleet/bin/agent-fleet.new /opt/agent-fleet/bin/agent-fleet
sudo systemctl restart agent-fleet
```
Systemd will handle the restart gracefully with minimal downtime.