Previous bug: only code:* and review labels were checked, so agent:document, agent:tests etc. were never filtered. Any agent could pick up any task. Now: labels with agent: prefix are matched against agent capabilities. Other labels are treated as metadata. Includes regression test.
364 lines
13 KiB
Markdown
364 lines
13 KiB
Markdown
# Agent Fleet — Agent Onboarding Guide
|
||
|
||
`FLEET_API_URL:PORT` means the address of your Agent Fleet Orchestrator (for example, `100.102.101.43:9090`). If you do not know it, ask your user for the Fleet API address before using these examples.
|
||
|
||
This guide explains how to integrate an agent with the Agent Fleet Orchestrator.
|
||
|
||
---
|
||
|
||
## Execution Modes
|
||
|
||
Agent Fleet supports two execution modes. The mode is set per-task at creation time (defaults to `ssh_cli`).
|
||
|
||
| Aspect | `ssh_cli` | `http_pull` |
|
||
|--------|-----------|-------------|
|
||
| Who initiates? | Orchestrator (via SSH or local subprocess) | Agent (via HTTP API) |
|
||
| Control flow | Orchestrator builds prompt, runs CLI, collects output | Agent decides when to dequeue and execute |
|
||
| Agent requirements | CLI binary on a configured host | HTTP client, can call REST API |
|
||
| Auth needed? | No (Orchestrator manages) | Yes (Bearer token) |
|
||
| Best for | Codex CLI, Claude Code, OpenCode — agents with CLIs | OpenClaw/Jeeves, Hermes — agents with their own schedulers |
|
||
| Task creation trigger | Forgejo Issue webhook (default) | Same, or API call |
|
||
|
||
---
|
||
|
||
## How to Choose an Execution Mode
|
||
|
||
Start by asking: **"What is the address of the Agent Fleet Orchestrator?"**
|
||
If you do not know the address, ask your user.
|
||
|
||
Then decide based on connectivity and control model:
|
||
|
||
1. **Use `ssh_cli`** if:
|
||
- You have a CLI binary that can run non-interactively
|
||
- The Fleet orchestrator can SSH into your machine or server
|
||
- You want the Orchestrator to initiate execution for each task
|
||
|
||
2. **Use `http_pull`** if:
|
||
- The Fleet orchestrator cannot SSH into your machine
|
||
- You are behind NAT, a firewall, or on a network the Orchestrator cannot reach
|
||
- You already have your own scheduler, worker loop, or bot runtime
|
||
- You want your agent to decide when to pull work
|
||
|
||
3. **If unsure**, ask your user:
|
||
- **"Can the Fleet orchestrator SSH into this machine?"**
|
||
- If yes → `ssh_cli`
|
||
- If no → `http_pull`
|
||
|
||
---
|
||
|
||
### Self-Determination Decision Flow
|
||
|
||
As an agent, you can self-determine your mode by answering these questions:
|
||
|
||
```
|
||
1. Do I have a CLI binary that can execute tasks non-interactively?
|
||
└─ NO → Use http_pull
|
||
└─ YES → Continue
|
||
|
||
2. Can the Fleet orchestrator SSH into my machine?
|
||
└─ NO → Use http_pull
|
||
└─ YES → Continue
|
||
|
||
3. Do I have my own scheduler, worker loop, or bot runtime?
|
||
└─ YES → Use http_pull (you prefer to control your own task flow)
|
||
└─ NO → Continue
|
||
|
||
4. Do I want the Orchestrator to initiate execution remotely?
|
||
└─ YES → Use ssh_cli
|
||
└─ NO → Use http_pull
|
||
```
|
||
|
||
**Quick reference:** If you can be reached via SSH and don't have a custom scheduler, use `ssh_cli`. Otherwise, use `http_pull`.
|
||
|
||
### Decision Table
|
||
|
||
| Question | If Yes | If No |
|
||
|----------|--------|-------|
|
||
| Do you have a CLI binary that can execute tasks non-interactively? | Continue evaluating for `ssh_cli` | Use `http_pull` |
|
||
| Can the Fleet orchestrator SSH into this machine? | Use `ssh_cli` | Use `http_pull` |
|
||
| Is this agent behind NAT, a firewall, or otherwise unreachable from the Orchestrator? | Use `http_pull` | Continue evaluating |
|
||
| Does the agent already run its own scheduler or task loop? | Use `http_pull` | Either mode may fit |
|
||
| Do you want the Orchestrator to launch the agent process remotely? | Use `ssh_cli` | Use `http_pull` |
|
||
|
||
### Common Scenarios
|
||
|
||
| Scenario | Recommended Mode | Why |
|
||
|----------|------------------|-----|
|
||
| Codex / Claude Code / OpenCode on a reachable server | `ssh_cli` | Fleet can SSH in and run the CLI directly |
|
||
| OpenClaw / Hermes Agent / bot framework | `http_pull` | The agent already has a runtime and should pull work itself |
|
||
| Agent running on a laptop behind NAT | `http_pull` | Fleet cannot reach it reliably over SSH |
|
||
| Shared VM with a well-known SSH host and installed CLI | `ssh_cli` | Centralized orchestration is simpler |
|
||
|
||
### Simple Rule of Thumb
|
||
|
||
- If the Fleet server can **reach you**, `ssh_cli` is usually simpler.
|
||
- If you must **reach the Fleet server**, use `http_pull`.
|
||
|
||
---
|
||
|
||
## ssh_cli Workflow
|
||
|
||
### 1. Configure a Host
|
||
|
||
Add a `[[hosts]]` section to `config.toml` on the Orchestrator:
|
||
|
||
```toml
|
||
[[hosts]]
|
||
host_id = "host-worker-01"
|
||
hostname = "192.168.1.100"
|
||
ssh_user = "deploy"
|
||
ssh_port = 22
|
||
ssh_key_path = "/home/deploy/.ssh/id_ed25519"
|
||
work_dir = "/opt/agent-workspace"
|
||
agents = [
|
||
{ agent_type = "codex-cli", max_concurrency = 2, capabilities = ["code:rust", "code:python"] },
|
||
]
|
||
```
|
||
|
||
For local execution (same machine as Orchestrator), use `hostname = "localhost"` — the Orchestrator uses a local subprocess instead of SSH.
|
||
|
||
### 2. Install the Agent CLI
|
||
|
||
The CLI binary must be available on the target host in `$PATH`. The Orchestrator checks availability with `which <binary>`.
|
||
|
||
Built-in CLI templates:
|
||
|
||
| Agent Type | CLI Command |
|
||
|------------|-------------|
|
||
| `codex-cli` | `codex exec --json '{prompt}'` |
|
||
| `claude-code` | `claude -p '{prompt}' --output-format json --dangerously-skip-permissions` |
|
||
|
||
Custom templates can be defined in `config.toml` under `[adapters]`.
|
||
|
||
### 3. Orchestrator Handles Everything
|
||
|
||
When a Forgejo Issue with an `agent:*` label arrives:
|
||
|
||
1. Orchestrator creates a task (`execution_mode = ssh_cli`)
|
||
2. Dispatch loop picks the task, selects a host by capability + load
|
||
3. SSH (or local subprocess) executes the CLI with a structured prompt
|
||
4. Output is parsed (Codex JSON or Claude JSON format)
|
||
5. Task status updates: `created` → `assigned` → `running` → `completed` (or `failed`)
|
||
|
||
### 4. What the Agent Receives (Structured Prompt)
|
||
|
||
The Orchestrator constructs this prompt and passes it as the `{prompt}` variable:
|
||
|
||
```
|
||
Task ID: org/repo#42
|
||
Type: code
|
||
Goal:
|
||
Implement the feature described in the issue body
|
||
|
||
Constraints:
|
||
- Execution mode: ssh_cli
|
||
- Labels: code:rust
|
||
- Branch: task/org%2Frepo%2342
|
||
- Expected output: JSON receipt
|
||
|
||
Validation:
|
||
- Run relevant tests if code changed
|
||
- Summarize changes and artifacts
|
||
```
|
||
|
||
### 5. Expected CLI Output
|
||
|
||
The CLI must output JSON to stdout. The format depends on the parser:
|
||
|
||
**Codex JSON:**
|
||
```json
|
||
{"status": "completed", "summary": "done", "duration_seconds": 120, "artifacts": [{"artifact_type": "pr", "url": "https://..."}]}
|
||
```
|
||
|
||
**Claude JSON:**
|
||
```json
|
||
{"status": "completed", "summary": "done", "duration_seconds": 95, "error": null}
|
||
```
|
||
|
||
If output is not valid JSON, the task is marked `failed`.
|
||
|
||
---
|
||
|
||
## http_pull Workflow
|
||
|
||
### 1. Register
|
||
|
||
```bash
|
||
curl -X POST http://FLEET_API_URL:PORT/api/v1/agents/register \
|
||
-H 'Content-Type: application/json' \
|
||
-d '{"agent_id": "worker-03", "agent_type": "openclaw", "hostname": "arm0", "capabilities": ["code:rust"], "max_concurrency": 2}'
|
||
```
|
||
|
||
Response contains a `registry_token`. Keep it for subsequent API calls (if `http_pull_token` is configured, use that shared token instead).
|
||
|
||
Recommended immediately after registration:
|
||
- Persist `FLEET_API_URL`, your `agent_id`, and the returned `registry_token`
|
||
- Start the heartbeat loop before your first dequeue request
|
||
|
||
### 2. Heartbeat (periodic)
|
||
|
||
Heartbeat must be a background loop, not a one-shot call.
|
||
|
||
- Default heartbeat interval: every 60 seconds
|
||
- Recommended behavior: start the loop immediately after registration, before the first dequeue
|
||
- If the Orchestrator does not receive a heartbeat within `heartbeat_interval_secs × heartbeat_timeout_threshold` (default: 180 seconds), the agent is marked offline
|
||
- When an agent is marked offline, its assigned tasks are requeued
|
||
- The heartbeat loop should run for the entire lifetime of the agent
|
||
|
||
```bash
|
||
curl -X POST http://FLEET_API_URL:PORT/api/v1/agents/heartbeat \
|
||
-H 'Content-Type: application/json' \
|
||
-d '{"agent_id": "worker-03"}'
|
||
```
|
||
|
||
### 3. Dequeue a Task
|
||
|
||
```bash
|
||
curl -X POST http://FLEET_API_URL:PORT/api/v1/tasks/dequeue \
|
||
-H 'Content-Type: application/json' \
|
||
-H 'Authorization: Bearer <token>' \
|
||
-d '{"agent_id": "worker-03", "capabilities": ["code:rust"]}'
|
||
```
|
||
|
||
Returns `200 OK` with a Task object, or `204 No Content` if nothing available.
|
||
|
||
Only tasks with `execution_mode = http_pull` are returned.
|
||
|
||
### 4. Update Status While Working
|
||
|
||
```bash
|
||
curl -X POST http://FLEET_API_URL:PORT/api/v1/tasks/org%2Frepo%2342/status \
|
||
-H 'Content-Type: application/json' \
|
||
-H 'Authorization: Bearer <token>' \
|
||
-d '{"status": "running"}'
|
||
```
|
||
|
||
### 5. Complete the Task
|
||
|
||
```bash
|
||
curl -X POST http://FLEET_API_URL:PORT/api/v1/tasks/org%2Frepo%2342/complete \
|
||
-H 'Content-Type: application/json' \
|
||
-d '{
|
||
"task_id": "org/repo#42",
|
||
"agent_id": "worker-03",
|
||
"status": "completed",
|
||
"duration_seconds": 180,
|
||
"summary": "Fixed the issue",
|
||
"artifacts": [{"artifact_type": "pr", "url": "https://git.example/org/repo/pulls/15"}],
|
||
"error": null
|
||
}'
|
||
```
|
||
|
||
Or use the receipts endpoint:
|
||
|
||
```bash
|
||
curl -X POST http://FLEET_API_URL:PORT/api/v1/receipts \
|
||
-H 'Content-Type: application/json' \
|
||
-d '<same receipt body>'
|
||
```
|
||
|
||
### 6. Deregister When Done
|
||
|
||
```bash
|
||
curl -X POST http://FLEET_API_URL:PORT/api/v1/agents/deregister \
|
||
-H 'Content-Type: application/json' \
|
||
-d '{"agent_id": "worker-03"}'
|
||
```
|
||
|
||
---
|
||
|
||
## Forgejo Integration
|
||
|
||
### How Issues Become Tasks
|
||
|
||
1. A Forgejo Issue is opened with a label matching `agent:*` (e.g. `agent:code`)
|
||
2. Forgejo sends an `issues` webhook to `POST /api/v1/webhooks/forgejo`
|
||
3. The `agent:*` label value becomes `task_type` (e.g. `code`)
|
||
4. Priority is inferred from labels: `priority:urgent`, `priority:high`, `priority:low` (default: `normal`)
|
||
5. A task is created with:
|
||
- `task_id` = `{repo_full_name}#{issue_number}` (e.g. `org/repo#42`)
|
||
- `execution_mode` = `ssh_cli` (default for Forgejo-originated tasks)
|
||
- `branch_name` = `task/{url_encoded_task_id}` (e.g. `task/org%2Frepo%2342`)
|
||
- `pr_title` = `feat: {issue_title} (#{issue_number})`
|
||
|
||
### Branch Naming Convention
|
||
|
||
- Branch: `task/{url_encoded_task_id}`
|
||
- Example: task `org/repo#42` → branch `task/org%2Frepo%2342`
|
||
|
||
### PR Lifecycle
|
||
|
||
| Event | Effect |
|
||
|-------|--------|
|
||
| PR opened (branch = `task/*`) | Task → `review_pending` |
|
||
| PR merged | Task → `completed`, auto receipt generated |
|
||
| Push to `task/*` branch | Task `last_activity_at` updated |
|
||
|
||
### Task Status Flow
|
||
|
||
```
|
||
created → assigned → running → review_pending → completed
|
||
↓ ↓ ↓ ↓ ↓
|
||
cancelled cancelled failed failed cancelled
|
||
↓
|
||
(retry) → assigned
|
||
```
|
||
|
||
**Notes:**
|
||
- `failed` and `agent_lost` tasks can be retried via `POST /api/v1/tasks/{task_id}/retry` (transitions to `assigned`)
|
||
- Retry is limited by `max_retries` (default: 2)
|
||
- `agent_lost` is set internally by the heartbeat checker when an agent times out
|
||
- `review_pending` can transition back to `assigned`, `running`, `failed`, or `completed`
|
||
|
||
---
|
||
|
||
## Structured Prompt Format (ssh_cli)
|
||
|
||
When the Orchestrator executes an agent via SSH, it constructs a structured prompt:
|
||
|
||
```
|
||
Task ID: <task_id>
|
||
Type: <task_type>
|
||
Goal:
|
||
<requirements>
|
||
|
||
Constraints:
|
||
- Execution mode: ssh_cli
|
||
- Labels: <comma-separated labels or <none>>
|
||
- Branch: <branch_name>
|
||
- Expected output: JSON receipt
|
||
|
||
Validation:
|
||
- Run relevant tests if code changed
|
||
- Summarize changes and artifacts
|
||
```
|
||
|
||
The prompt is injected into the CLI template as the `{prompt}` variable. Other available variables: `{work_dir}`, `{task_id}`, `{branch}`.
|
||
|
||
---
|
||
|
||
## FAQ
|
||
|
||
**Q: How do I know which execution mode to use?**
|
||
A: First determine the Fleet API address, then ask whether the Orchestrator can SSH into the machine. If it can, `ssh_cli` is usually the best fit. If it cannot, use `http_pull`.
|
||
|
||
**Q: Do I need to register for ssh_cli mode?**
|
||
A: No. The Orchestrator manages ssh_cli tasks entirely. Registration is only for `http_pull` agents.
|
||
|
||
**Q: What happens if my agent crashes during ssh_cli execution?**
|
||
A: The task is marked `failed`. If `retry_count < max_retries`, the dispatch loop will retry automatically.
|
||
|
||
**Q: What happens if my http_pull agent stops sending heartbeats?**
|
||
A: After `heartbeat_interval_secs × heartbeat_timeout_threshold` seconds, the agent is marked offline and all its tasks are requeued with status `created`.
|
||
|
||
**Q: Can a task switch between execution modes?**
|
||
A: No. The `execution_mode` is set at creation time and cannot be changed.
|
||
|
||
**Q: How do I create a task manually?**
|
||
A: Use the Forgejo webhook flow (open an Issue with `agent:*` label), or directly insert into the database. There is no public "create task" API endpoint.
|
||
|
||
**Q: What label format triggers task creation?**
|
||
A: Issues must have a label starting with `agent:` (e.g. `agent:code`, `agent:review`). The value after `agent:` becomes the task type. Issues without such a label are ignored.
|
||
|
||
**Q: How does the review loop work?**
|
||
A: When a PR is opened (not merged), the task goes to `review_pending`. If the PR is not merged and the review cycle count exceeds `max_retries`, the task is marked `failed`. For `ssh_cli`, the Orchestrator re-dispatches automatically.
|