Previous bug: only code:* and review labels were checked, so agent:document, agent:tests etc. were never filtered. Any agent could pick up any task. Now: labels with agent: prefix are matched against agent capabilities. Other labels are treated as metadata. Includes regression test.
465 lines
10 KiB
Markdown
465 lines
10 KiB
Markdown
# Agent Fleet Deployment Guide
|
||
|
||
This guide covers deploying Agent Fleet Orchestrator to production, including cross-compilation, systemd service setup, and reverse proxy configuration with Caddy.
|
||
|
||
## Prerequisites
|
||
|
||
- Development machine with Rust and cargo
|
||
- Target server (e.g., aarch64 Linux)
|
||
- cargo-zigbuild for cross-compilation
|
||
- Caddy web server (optional, for reverse proxy)
|
||
|
||
## Building with cargo-zigbuild
|
||
|
||
cargo-zigbuild enables cross-compilation by using Zig as the C toolchain, avoiding the need for target-specific cross-compilation toolchains.
|
||
|
||
### Installing cargo-zigbuild
|
||
|
||
```bash
|
||
cargo install cargo-zigbuild
|
||
```
|
||
|
||
### Cross-Compiling for aarch64-unknown-linux-gnu
|
||
|
||
```bash
|
||
# Build the release binary
|
||
cargo zigbuild --target aarch64-unknown-linux-gnu --release
|
||
|
||
# The binary will be at:
|
||
# target/aarch64-unknown-linux-gnu/release/agent-fleet
|
||
```
|
||
|
||
### Other Target Architectures
|
||
|
||
```bash
|
||
# For x86_64 Linux (standard)
|
||
cargo build --release
|
||
|
||
# For aarch64 (ARM64 servers)
|
||
cargo zigbuild --target aarch64-unknown-linux-gnu --release
|
||
|
||
# For musl (static linking)
|
||
cargo zigbuild --target x86_64-unknown-linux-musl --release
|
||
```
|
||
|
||
### Building for Local Testing
|
||
|
||
For local development on the same architecture:
|
||
|
||
```bash
|
||
cargo build --release
|
||
```
|
||
|
||
## Deployment to aarch64
|
||
|
||
### 1. Transfer the Binary
|
||
|
||
After building, transfer the binary to your target server:
|
||
|
||
```bash
|
||
# Using scp
|
||
scp target/aarch64-unknown-linux-gnu/release/agent-fleet user@target-host:/opt/agent-fleet/
|
||
|
||
# Or using rsync
|
||
rsync -avz target/aarch64-unknown-linux-gnu/release/agent-fleet user@target-host:/opt/agent-fleet/
|
||
```
|
||
|
||
### 2. Set Up Directory Structure
|
||
|
||
```bash
|
||
ssh user@target-host
|
||
|
||
# Create directory and user
|
||
sudo useradd -r -s /bin/false agent-fleet
|
||
sudo mkdir -p /opt/agent-fleet/{bin,data,config}
|
||
sudo chown -R agent-fleet:agent-fleet /opt/agent-fleet
|
||
|
||
# Copy binary
|
||
sudo cp /path/to/agent-fleet /opt/agent-fleet/bin/
|
||
sudo chmod +x /opt/agent-fleet/bin/agent-fleet
|
||
```
|
||
|
||
### 3. Create Configuration File
|
||
|
||
Create `/opt/agent-fleet/config/config.toml`:
|
||
|
||
```toml
|
||
[server]
|
||
bind = "127.0.0.1"
|
||
port = 9090
|
||
|
||
[forgejo]
|
||
url = "https://git.0x08.org"
|
||
token = "your-forgejo-api-token"
|
||
webhook_secret = "your-webhook-secret"
|
||
|
||
[orchestrator]
|
||
db_path = "/opt/agent-fleet/data/agent-fleet.db"
|
||
heartbeat_interval_secs = 60
|
||
heartbeat_timeout_threshold = 3
|
||
task_timeout_secs = 1800
|
||
default_max_retries = 2
|
||
dispatch_interval_secs = 10
|
||
|
||
# Configure remote hosts for ssh_cli execution
|
||
[[hosts]]
|
||
host_id = "worker-01"
|
||
hostname = "192.168.1.100"
|
||
ssh_user = "agent"
|
||
ssh_port = 22
|
||
ssh_key_path = "/home/agent/.ssh/id_ed25519"
|
||
work_dir = "/opt/agent-workspace"
|
||
agents = [
|
||
{ agent_type = "codex-cli", max_concurrency = 2, capabilities = ["code:rust"] },
|
||
]
|
||
```
|
||
|
||
### 4. Create Environment Variables (Optional)
|
||
|
||
For sensitive values, use a `.env.local` file instead of config:
|
||
|
||
```bash
|
||
# /opt/agent-fleet/.env.local
|
||
FORGEJO_TOKEN="your-token"
|
||
WEBHOOK_SECRET="your-secret"
|
||
```
|
||
|
||
Then reference them in config (currently not supported, use direct config values).
|
||
|
||
## Systemd Service
|
||
|
||
### Create Systemd Service File
|
||
|
||
Create `/etc/systemd/system/agent-fleet.service`:
|
||
|
||
```ini
|
||
[Unit]
|
||
Description=Agent Fleet Orchestrator
|
||
After=network.target
|
||
Documentation=https://git.0x08.org/zer4tul/agent-fleet
|
||
|
||
[Service]
|
||
Type=simple
|
||
User=agent-fleet
|
||
Group=agent-fleet
|
||
WorkingDirectory=/opt/agent-fleet
|
||
ExecStart=/opt/agent-fleet/bin/agent-fleet --config /opt/agent-fleet/config/config.toml
|
||
Restart=always
|
||
RestartSec=10
|
||
|
||
# Security hardening
|
||
NoNewPrivileges=true
|
||
PrivateTmp=true
|
||
ProtectSystem=strict
|
||
ProtectHome=true
|
||
ReadWritePaths=/opt/agent-fleet/data
|
||
|
||
# Logging
|
||
StandardOutput=journal
|
||
StandardError=journal
|
||
SyslogIdentifier=agent-fleet
|
||
|
||
[Install]
|
||
WantedBy=multi-user.target
|
||
```
|
||
|
||
### Enable and Start the Service
|
||
|
||
```bash
|
||
# Reload systemd
|
||
sudo systemctl daemon-reload
|
||
|
||
# Enable to start on boot
|
||
sudo systemctl enable agent-fleet
|
||
|
||
# Start the service
|
||
sudo systemctl start agent-fleet
|
||
|
||
# Check status
|
||
sudo systemctl status agent-fleet
|
||
|
||
# View logs
|
||
sudo journalctl -u agent-fleet -f
|
||
```
|
||
|
||
### Management Commands
|
||
|
||
```bash
|
||
# Restart
|
||
sudo systemctl restart agent-fleet
|
||
|
||
# Stop
|
||
sudo systemctl stop agent-fleet
|
||
|
||
# Disable
|
||
sudo systemctl disable agent-fleet
|
||
```
|
||
|
||
## Caddy Reverse Proxy
|
||
|
||
Using Caddy as a reverse proxy provides:
|
||
- Automatic HTTPS with Let's Encrypt
|
||
- Path-based routing
|
||
- Basic auth (optional)
|
||
- Request logging
|
||
|
||
### Install Caddy
|
||
|
||
```bash
|
||
# Ubuntu/Debian
|
||
sudo apt install caddy
|
||
|
||
# Or using the official package
|
||
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' | sudo gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg
|
||
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' | sudo tee /etc/apt/sources.list.d/caddy-stable.list
|
||
sudo apt update
|
||
sudo apt install caddy
|
||
```
|
||
|
||
### Configure Caddy
|
||
|
||
Create or edit `/etc/caddy/Caddyfile`:
|
||
|
||
```caddyfile
|
||
your-domain.example.com {
|
||
reverse_proxy 127.0.0.1:9090
|
||
|
||
# Optional: Basic authentication
|
||
# basicauth {
|
||
# admin $2a$14$Zkx19YLhRnJ8O6l0ZPd.OqG9vXK4wQ6Y5wZQH5Y5x5x5x5x5x5x
|
||
# }
|
||
|
||
# Optional: Log requests
|
||
log {
|
||
output file /var/log/caddy/agent-fleet.log
|
||
format json
|
||
}
|
||
}
|
||
|
||
# Or with path prefix
|
||
your-domain.example.com/agent-fleet {
|
||
reverse_proxy 127.0.0.1:9090
|
||
}
|
||
```
|
||
|
||
### Test and Reload Caddy
|
||
|
||
```bash
|
||
# Test configuration
|
||
sudo caddy validate --config /etc/caddy/Caddyfile
|
||
|
||
# Reload Caddy
|
||
sudo systemctl reload caddy
|
||
|
||
# Or restart
|
||
sudo systemctl restart caddy
|
||
```
|
||
|
||
### Verify HTTPS
|
||
|
||
After Caddy starts, it will automatically provision an SSL certificate from Let's Encrypt. Verify:
|
||
|
||
```bash
|
||
curl https://your-domain.example.com/healthz
|
||
```
|
||
|
||
## Configuration Walkthrough
|
||
|
||
### Server Configuration
|
||
|
||
```toml
|
||
[server]
|
||
bind = "127.0.0.1" # Bind to localhost (Caddy handles external traffic)
|
||
port = 9090 # Internal port
|
||
```
|
||
|
||
**Notes:**
|
||
- Use `127.0.0.1` when behind a reverse proxy
|
||
- Use `0.0.0.0` if direct access is needed
|
||
|
||
### Forgejo Configuration
|
||
|
||
```toml
|
||
[forgejo]
|
||
url = "https://git.0x08.org"
|
||
token = "your-api-token"
|
||
webhook_secret = "your-webhook-secret"
|
||
```
|
||
|
||
**Setup Steps:**
|
||
1. Generate a Forgejo API token: User Settings → Applications → Generate New Token
|
||
2. Configure webhook in Forgejo repo settings:
|
||
- URL: `https://your-domain.com/api/v1/webhooks/forgejo`
|
||
- Secret: same as `webhook_secret`
|
||
- Events: Issues, Pull Requests, Push
|
||
|
||
### Orchestrator Configuration
|
||
|
||
```toml
|
||
[orchestrator]
|
||
db_path = "/opt/agent-fleet/data/agent-fleet.db"
|
||
heartbeat_interval_secs = 60
|
||
heartbeat_timeout_threshold = 3
|
||
task_timeout_secs = 1800
|
||
default_max_retries = 2
|
||
dispatch_interval_secs = 10
|
||
```
|
||
|
||
**Explanation:**
|
||
- `heartbeat_interval_secs`: How often agents should send heartbeats
|
||
- `heartbeat_timeout_threshold`: How many missed heartbeats before marking agent offline (3 × 60 = 180 seconds)
|
||
- `task_timeout_secs`: Default timeout for tasks (1800 seconds = 30 minutes)
|
||
- `default_max_retries`: How many times to retry failed tasks
|
||
- `dispatch_interval_secs`: How often the dispatch loop checks for new `ssh_cli` tasks
|
||
|
||
### Host Configuration for SSH CLI
|
||
|
||
```toml
|
||
[[hosts]]
|
||
host_id = "worker-01"
|
||
hostname = "192.168.1.100"
|
||
ssh_user = "deploy"
|
||
ssh_port = 22
|
||
ssh_key_path = "/home/agent-fleet/.ssh/id_ed25519"
|
||
work_dir = "/opt/agent-workspace"
|
||
agents = [
|
||
{ agent_type = "codex-cli", max_concurrency = 2, capabilities = ["code:rust"] },
|
||
]
|
||
```
|
||
|
||
**SSH Key Setup:**
|
||
1. Generate SSH key on the orchestrator server:
|
||
```bash
|
||
sudo -u agent-fleet ssh-keygen -t ed25519 -f /home/agent-fleet/.ssh/id_ed25519
|
||
```
|
||
|
||
2. Add public key to remote host's `~/.ssh/authorized_keys`
|
||
|
||
3. Test SSH connection:
|
||
```bash
|
||
sudo -u agent-fleet ssh -p 22 deploy@192.168.1.100
|
||
```
|
||
|
||
**Agent CLI Setup on Remote Host:**
|
||
1. Ensure agent CLI is in `$PATH`
|
||
2. Verify: `ssh deploy@host "which codex"`
|
||
|
||
## Troubleshooting
|
||
|
||
### Service Won't Start
|
||
|
||
```bash
|
||
# Check service status
|
||
sudo systemctl status agent-fleet
|
||
|
||
# View logs
|
||
sudo journalctl -u agent-fleet -n 100
|
||
|
||
# Check file permissions
|
||
ls -la /opt/agent-fleet/
|
||
```
|
||
|
||
### Webhook Not Received
|
||
|
||
```bash
|
||
# Check Caddy logs
|
||
sudo journalctl -u caddy -f
|
||
|
||
# Check agent-fleet logs for webhook errors
|
||
sudo journalctl -u agent-fleet | grep webhook
|
||
|
||
# Verify webhook secret matches
|
||
# The secret in config.toml must match Forgejo webhook secret
|
||
```
|
||
|
||
### SSH Connection Fails
|
||
|
||
```bash
|
||
# Test SSH as the agent-fleet user
|
||
sudo -u agent-fleet ssh -v deploy@host
|
||
|
||
# Check SSH key path exists
|
||
sudo -u agent-fleet ls -la /home/agent-fleet/.ssh/
|
||
|
||
# Verify key permissions
|
||
sudo -u agent-fleet chmod 600 /home/agent-fleet/.ssh/id_ed25519
|
||
```
|
||
|
||
### Database Lock Issues
|
||
|
||
If the database becomes locked (e.g., after crash):
|
||
|
||
```bash
|
||
# Stop the service
|
||
sudo systemctl stop agent-fleet
|
||
|
||
# Backup and remove old database
|
||
mv /opt/agent-fleet/data/agent-fleet.db /opt/agent-fleet/data/agent-fleet.db.backup
|
||
|
||
# Restart (will create new database)
|
||
sudo systemctl start agent-fleet
|
||
```
|
||
|
||
## Monitoring
|
||
|
||
### Health Check
|
||
|
||
```bash
|
||
curl http://localhost:9090/healthz
|
||
# Expected output: "ok"
|
||
```
|
||
|
||
### Check Task Queue
|
||
|
||
```bash
|
||
curl http://localhost:9090/api/v1/tasks?status=running
|
||
```
|
||
|
||
### Check Agents
|
||
|
||
```bash
|
||
curl http://localhost:9090/api/v1/agents?status=online
|
||
```
|
||
|
||
### Log Monitoring
|
||
|
||
```bash
|
||
# Follow logs
|
||
sudo journalctl -u agent-fleet -f
|
||
|
||
# Search for errors
|
||
sudo journalctl -u agent-fleet | grep -i error
|
||
```
|
||
|
||
## Updates
|
||
|
||
### Updating the Binary
|
||
|
||
```bash
|
||
# Build new version locally
|
||
cargo zigbuild --target aarch64-unknown-linux-gnu --release
|
||
|
||
# Transfer to server
|
||
scp target/aarch64-unknown-linux-gnu/release/agent-fleet user@host:/tmp/
|
||
|
||
# On the server, stop and replace
|
||
sudo systemctl stop agent-fleet
|
||
sudo cp /tmp/agent-fleet /opt/agent-fleet/bin/
|
||
sudo chmod +x /opt/agent-fleet/bin/agent-fleet
|
||
sudo systemctl start agent-fleet
|
||
```
|
||
|
||
### Zero-Downtime Updates
|
||
|
||
For production deployments with minimal downtime:
|
||
|
||
```bash
|
||
# Upload new binary alongside old one
|
||
scp target/aarch64-unknown-linux-gnu/release/agent-fleet user@host:/opt/agent-fleet/bin/agent-fleet.new
|
||
|
||
# On server
|
||
sudo mv /opt/agent-fleet/bin/agent-fleet.new /opt/agent-fleet/bin/agent-fleet
|
||
sudo systemctl restart agent-fleet
|
||
```
|
||
|
||
Systemd will handle the restart gracefully with minimal downtime.
|