fix: agent capability matching in dispatch — only agent: labels are requirements
Previous bug: only code:* and review labels were checked, so agent:document, agent:tests etc. were never filtered. Any agent could pick up any task. Now: labels with agent: prefix are matched against agent capabilities. Other labels are treated as metadata. Includes regression test.
This commit is contained in:
parent
1f351a1734
commit
a18cb2824e
6 changed files with 1271 additions and 8 deletions
465
docs/deployment.md
Normal file
465
docs/deployment.md
Normal file
|
|
@ -0,0 +1,465 @@
|
|||
# Agent Fleet Deployment Guide
|
||||
|
||||
This guide covers deploying Agent Fleet Orchestrator to production, including cross-compilation, systemd service setup, and reverse proxy configuration with Caddy.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Development machine with Rust and cargo
|
||||
- Target server (e.g., aarch64 Linux)
|
||||
- cargo-zigbuild for cross-compilation
|
||||
- Caddy web server (optional, for reverse proxy)
|
||||
|
||||
## Building with cargo-zigbuild
|
||||
|
||||
cargo-zigbuild enables cross-compilation by using Zig as the C toolchain, avoiding the need for target-specific cross-compilation toolchains.
|
||||
|
||||
### Installing cargo-zigbuild
|
||||
|
||||
```bash
|
||||
cargo install cargo-zigbuild
|
||||
```
|
||||
|
||||
### Cross-Compiling for aarch64-unknown-linux-gnu
|
||||
|
||||
```bash
|
||||
# Build the release binary
|
||||
cargo zigbuild --target aarch64-unknown-linux-gnu --release
|
||||
|
||||
# The binary will be at:
|
||||
# target/aarch64-unknown-linux-gnu/release/agent-fleet
|
||||
```
|
||||
|
||||
### Other Target Architectures
|
||||
|
||||
```bash
|
||||
# For x86_64 Linux (standard)
|
||||
cargo build --release
|
||||
|
||||
# For aarch64 (ARM64 servers)
|
||||
cargo zigbuild --target aarch64-unknown-linux-gnu --release
|
||||
|
||||
# For musl (static linking)
|
||||
cargo zigbuild --target x86_64-unknown-linux-musl --release
|
||||
```
|
||||
|
||||
### Building for Local Testing
|
||||
|
||||
For local development on the same architecture:
|
||||
|
||||
```bash
|
||||
cargo build --release
|
||||
```
|
||||
|
||||
## Deployment to aarch64
|
||||
|
||||
### 1. Transfer the Binary
|
||||
|
||||
After building, transfer the binary to your target server:
|
||||
|
||||
```bash
|
||||
# Using scp
|
||||
scp target/aarch64-unknown-linux-gnu/release/agent-fleet user@target-host:/opt/agent-fleet/
|
||||
|
||||
# Or using rsync
|
||||
rsync -avz target/aarch64-unknown-linux-gnu/release/agent-fleet user@target-host:/opt/agent-fleet/
|
||||
```
|
||||
|
||||
### 2. Set Up Directory Structure
|
||||
|
||||
```bash
|
||||
ssh user@target-host
|
||||
|
||||
# Create directory and user
|
||||
sudo useradd -r -s /bin/false agent-fleet
|
||||
sudo mkdir -p /opt/agent-fleet/{bin,data,config}
|
||||
sudo chown -R agent-fleet:agent-fleet /opt/agent-fleet
|
||||
|
||||
# Copy binary
|
||||
sudo cp /path/to/agent-fleet /opt/agent-fleet/bin/
|
||||
sudo chmod +x /opt/agent-fleet/bin/agent-fleet
|
||||
```
|
||||
|
||||
### 3. Create Configuration File
|
||||
|
||||
Create `/opt/agent-fleet/config/config.toml`:
|
||||
|
||||
```toml
|
||||
[server]
|
||||
bind = "127.0.0.1"
|
||||
port = 9090
|
||||
|
||||
[forgejo]
|
||||
url = "https://git.0x08.org"
|
||||
token = "your-forgejo-api-token"
|
||||
webhook_secret = "your-webhook-secret"
|
||||
|
||||
[orchestrator]
|
||||
db_path = "/opt/agent-fleet/data/agent-fleet.db"
|
||||
heartbeat_interval_secs = 60
|
||||
heartbeat_timeout_threshold = 3
|
||||
task_timeout_secs = 1800
|
||||
default_max_retries = 2
|
||||
dispatch_interval_secs = 10
|
||||
|
||||
# Configure remote hosts for ssh_cli execution
|
||||
[[hosts]]
|
||||
host_id = "worker-01"
|
||||
hostname = "192.168.1.100"
|
||||
ssh_user = "agent"
|
||||
ssh_port = 22
|
||||
ssh_key_path = "/home/agent/.ssh/id_ed25519"
|
||||
work_dir = "/opt/agent-workspace"
|
||||
agents = [
|
||||
{ agent_type = "codex-cli", max_concurrency = 2, capabilities = ["code:rust"] },
|
||||
]
|
||||
```
|
||||
|
||||
### 4. Create Environment Variables (Optional)
|
||||
|
||||
For sensitive values, use a `.env.local` file instead of config:
|
||||
|
||||
```bash
|
||||
# /opt/agent-fleet/.env.local
|
||||
FORGEJO_TOKEN="your-token"
|
||||
WEBHOOK_SECRET="your-secret"
|
||||
```
|
||||
|
||||
Then reference them in config (currently not supported, use direct config values).
|
||||
|
||||
## Systemd Service
|
||||
|
||||
### Create Systemd Service File
|
||||
|
||||
Create `/etc/systemd/system/agent-fleet.service`:
|
||||
|
||||
```ini
|
||||
[Unit]
|
||||
Description=Agent Fleet Orchestrator
|
||||
After=network.target
|
||||
Documentation=https://git.0x08.org/zer4tul/agent-fleet
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
User=agent-fleet
|
||||
Group=agent-fleet
|
||||
WorkingDirectory=/opt/agent-fleet
|
||||
ExecStart=/opt/agent-fleet/bin/agent-fleet --config /opt/agent-fleet/config/config.toml
|
||||
Restart=always
|
||||
RestartSec=10
|
||||
|
||||
# Security hardening
|
||||
NoNewPrivileges=true
|
||||
PrivateTmp=true
|
||||
ProtectSystem=strict
|
||||
ProtectHome=true
|
||||
ReadWritePaths=/opt/agent-fleet/data
|
||||
|
||||
# Logging
|
||||
StandardOutput=journal
|
||||
StandardError=journal
|
||||
SyslogIdentifier=agent-fleet
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
```
|
||||
|
||||
### Enable and Start the Service
|
||||
|
||||
```bash
|
||||
# Reload systemd
|
||||
sudo systemctl daemon-reload
|
||||
|
||||
# Enable to start on boot
|
||||
sudo systemctl enable agent-fleet
|
||||
|
||||
# Start the service
|
||||
sudo systemctl start agent-fleet
|
||||
|
||||
# Check status
|
||||
sudo systemctl status agent-fleet
|
||||
|
||||
# View logs
|
||||
sudo journalctl -u agent-fleet -f
|
||||
```
|
||||
|
||||
### Management Commands
|
||||
|
||||
```bash
|
||||
# Restart
|
||||
sudo systemctl restart agent-fleet
|
||||
|
||||
# Stop
|
||||
sudo systemctl stop agent-fleet
|
||||
|
||||
# Disable
|
||||
sudo systemctl disable agent-fleet
|
||||
```
|
||||
|
||||
## Caddy Reverse Proxy
|
||||
|
||||
Using Caddy as a reverse proxy provides:
|
||||
- Automatic HTTPS with Let's Encrypt
|
||||
- Path-based routing
|
||||
- Basic auth (optional)
|
||||
- Request logging
|
||||
|
||||
### Install Caddy
|
||||
|
||||
```bash
|
||||
# Ubuntu/Debian
|
||||
sudo apt install caddy
|
||||
|
||||
# Or using the official package
|
||||
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' | sudo gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg
|
||||
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' | sudo tee /etc/apt/sources.list.d/caddy-stable.list
|
||||
sudo apt update
|
||||
sudo apt install caddy
|
||||
```
|
||||
|
||||
### Configure Caddy
|
||||
|
||||
Create or edit `/etc/caddy/Caddyfile`:
|
||||
|
||||
```caddyfile
|
||||
your-domain.example.com {
|
||||
reverse_proxy 127.0.0.1:9090
|
||||
|
||||
# Optional: Basic authentication
|
||||
# basicauth {
|
||||
# admin $2a$14$Zkx19YLhRnJ8O6l0ZPd.OqG9vXK4wQ6Y5wZQH5Y5x5x5x5x5x5x
|
||||
# }
|
||||
|
||||
# Optional: Log requests
|
||||
log {
|
||||
output file /var/log/caddy/agent-fleet.log
|
||||
format json
|
||||
}
|
||||
}
|
||||
|
||||
# Or with path prefix
|
||||
your-domain.example.com/agent-fleet {
|
||||
reverse_proxy 127.0.0.1:9090
|
||||
}
|
||||
```
|
||||
|
||||
### Test and Reload Caddy
|
||||
|
||||
```bash
|
||||
# Test configuration
|
||||
sudo caddy validate --config /etc/caddy/Caddyfile
|
||||
|
||||
# Reload Caddy
|
||||
sudo systemctl reload caddy
|
||||
|
||||
# Or restart
|
||||
sudo systemctl restart caddy
|
||||
```
|
||||
|
||||
### Verify HTTPS
|
||||
|
||||
After Caddy starts, it will automatically provision an SSL certificate from Let's Encrypt. Verify:
|
||||
|
||||
```bash
|
||||
curl https://your-domain.example.com/healthz
|
||||
```
|
||||
|
||||
## Configuration Walkthrough
|
||||
|
||||
### Server Configuration
|
||||
|
||||
```toml
|
||||
[server]
|
||||
bind = "127.0.0.1" # Bind to localhost (Caddy handles external traffic)
|
||||
port = 9090 # Internal port
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- Use `127.0.0.1` when behind a reverse proxy
|
||||
- Use `0.0.0.0` if direct access is needed
|
||||
|
||||
### Forgejo Configuration
|
||||
|
||||
```toml
|
||||
[forgejo]
|
||||
url = "https://git.0x08.org"
|
||||
token = "your-api-token"
|
||||
webhook_secret = "your-webhook-secret"
|
||||
```
|
||||
|
||||
**Setup Steps:**
|
||||
1. Generate a Forgejo API token: User Settings → Applications → Generate New Token
|
||||
2. Configure webhook in Forgejo repo settings:
|
||||
- URL: `https://your-domain.com/api/v1/webhooks/forgejo`
|
||||
- Secret: same as `webhook_secret`
|
||||
- Events: Issues, Pull Requests, Push
|
||||
|
||||
### Orchestrator Configuration
|
||||
|
||||
```toml
|
||||
[orchestrator]
|
||||
db_path = "/opt/agent-fleet/data/agent-fleet.db"
|
||||
heartbeat_interval_secs = 60
|
||||
heartbeat_timeout_threshold = 3
|
||||
task_timeout_secs = 1800
|
||||
default_max_retries = 2
|
||||
dispatch_interval_secs = 10
|
||||
```
|
||||
|
||||
**Explanation:**
|
||||
- `heartbeat_interval_secs`: How often agents should send heartbeats
|
||||
- `heartbeat_timeout_threshold`: How many missed heartbeats before marking agent offline (3 × 60 = 180 seconds)
|
||||
- `task_timeout_secs`: Default timeout for tasks (1800 seconds = 30 minutes)
|
||||
- `default_max_retries`: How many times to retry failed tasks
|
||||
- `dispatch_interval_secs`: How often the dispatch loop checks for new `ssh_cli` tasks
|
||||
|
||||
### Host Configuration for SSH CLI
|
||||
|
||||
```toml
|
||||
[[hosts]]
|
||||
host_id = "worker-01"
|
||||
hostname = "192.168.1.100"
|
||||
ssh_user = "deploy"
|
||||
ssh_port = 22
|
||||
ssh_key_path = "/home/agent-fleet/.ssh/id_ed25519"
|
||||
work_dir = "/opt/agent-workspace"
|
||||
agents = [
|
||||
{ agent_type = "codex-cli", max_concurrency = 2, capabilities = ["code:rust"] },
|
||||
]
|
||||
```
|
||||
|
||||
**SSH Key Setup:**
|
||||
1. Generate SSH key on the orchestrator server:
|
||||
```bash
|
||||
sudo -u agent-fleet ssh-keygen -t ed25519 -f /home/agent-fleet/.ssh/id_ed25519
|
||||
```
|
||||
|
||||
2. Add public key to remote host's `~/.ssh/authorized_keys`
|
||||
|
||||
3. Test SSH connection:
|
||||
```bash
|
||||
sudo -u agent-fleet ssh -p 22 deploy@192.168.1.100
|
||||
```
|
||||
|
||||
**Agent CLI Setup on Remote Host:**
|
||||
1. Ensure agent CLI is in `$PATH`
|
||||
2. Verify: `ssh deploy@host "which codex"`
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Service Won't Start
|
||||
|
||||
```bash
|
||||
# Check service status
|
||||
sudo systemctl status agent-fleet
|
||||
|
||||
# View logs
|
||||
sudo journalctl -u agent-fleet -n 100
|
||||
|
||||
# Check file permissions
|
||||
ls -la /opt/agent-fleet/
|
||||
```
|
||||
|
||||
### Webhook Not Received
|
||||
|
||||
```bash
|
||||
# Check Caddy logs
|
||||
sudo journalctl -u caddy -f
|
||||
|
||||
# Check agent-fleet logs for webhook errors
|
||||
sudo journalctl -u agent-fleet | grep webhook
|
||||
|
||||
# Verify webhook secret matches
|
||||
# The secret in config.toml must match Forgejo webhook secret
|
||||
```
|
||||
|
||||
### SSH Connection Fails
|
||||
|
||||
```bash
|
||||
# Test SSH as the agent-fleet user
|
||||
sudo -u agent-fleet ssh -v deploy@host
|
||||
|
||||
# Check SSH key path exists
|
||||
sudo -u agent-fleet ls -la /home/agent-fleet/.ssh/
|
||||
|
||||
# Verify key permissions
|
||||
sudo -u agent-fleet chmod 600 /home/agent-fleet/.ssh/id_ed25519
|
||||
```
|
||||
|
||||
### Database Lock Issues
|
||||
|
||||
If the database becomes locked (e.g., after crash):
|
||||
|
||||
```bash
|
||||
# Stop the service
|
||||
sudo systemctl stop agent-fleet
|
||||
|
||||
# Backup and remove old database
|
||||
mv /opt/agent-fleet/data/agent-fleet.db /opt/agent-fleet/data/agent-fleet.db.backup
|
||||
|
||||
# Restart (will create new database)
|
||||
sudo systemctl start agent-fleet
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Health Check
|
||||
|
||||
```bash
|
||||
curl http://localhost:9090/healthz
|
||||
# Expected output: "ok"
|
||||
```
|
||||
|
||||
### Check Task Queue
|
||||
|
||||
```bash
|
||||
curl http://localhost:9090/api/v1/tasks?status=running
|
||||
```
|
||||
|
||||
### Check Agents
|
||||
|
||||
```bash
|
||||
curl http://localhost:9090/api/v1/agents?status=online
|
||||
```
|
||||
|
||||
### Log Monitoring
|
||||
|
||||
```bash
|
||||
# Follow logs
|
||||
sudo journalctl -u agent-fleet -f
|
||||
|
||||
# Search for errors
|
||||
sudo journalctl -u agent-fleet | grep -i error
|
||||
```
|
||||
|
||||
## Updates
|
||||
|
||||
### Updating the Binary
|
||||
|
||||
```bash
|
||||
# Build new version locally
|
||||
cargo zigbuild --target aarch64-unknown-linux-gnu --release
|
||||
|
||||
# Transfer to server
|
||||
scp target/aarch64-unknown-linux-gnu/release/agent-fleet user@host:/tmp/
|
||||
|
||||
# On the server, stop and replace
|
||||
sudo systemctl stop agent-fleet
|
||||
sudo cp /tmp/agent-fleet /opt/agent-fleet/bin/
|
||||
sudo chmod +x /opt/agent-fleet/bin/agent-fleet
|
||||
sudo systemctl start agent-fleet
|
||||
```
|
||||
|
||||
### Zero-Downtime Updates
|
||||
|
||||
For production deployments with minimal downtime:
|
||||
|
||||
```bash
|
||||
# Upload new binary alongside old one
|
||||
scp target/aarch64-unknown-linux-gnu/release/agent-fleet user@host:/opt/agent-fleet/bin/agent-fleet.new
|
||||
|
||||
# On server
|
||||
sudo mv /opt/agent-fleet/bin/agent-fleet.new /opt/agent-fleet/bin/agent-fleet
|
||||
sudo systemctl restart agent-fleet
|
||||
```
|
||||
|
||||
Systemd will handle the restart gracefully with minimal downtime.
|
||||
Loading…
Add table
Add a link
Reference in a new issue