agent-fleet/docs/agent-onboarding-guide.md
Zer4tul a18cb2824e fix: agent capability matching in dispatch — only agent: labels are requirements
Previous bug: only code:* and review labels were checked, so agent:document,
agent:tests etc. were never filtered. Any agent could pick up any task.

Now: labels with agent: prefix are matched against agent capabilities.
Other labels are treated as metadata. Includes regression test.
2026-05-12 23:51:08 +08:00

364 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Agent Fleet — Agent Onboarding Guide
`FLEET_API_URL:PORT` means the address of your Agent Fleet Orchestrator (for example, `100.102.101.43:9090`). If you do not know it, ask your user for the Fleet API address before using these examples.
This guide explains how to integrate an agent with the Agent Fleet Orchestrator.
---
## Execution Modes
Agent Fleet supports two execution modes. The mode is set per-task at creation time (defaults to `ssh_cli`).
| Aspect | `ssh_cli` | `http_pull` |
|--------|-----------|-------------|
| Who initiates? | Orchestrator (via SSH or local subprocess) | Agent (via HTTP API) |
| Control flow | Orchestrator builds prompt, runs CLI, collects output | Agent decides when to dequeue and execute |
| Agent requirements | CLI binary on a configured host | HTTP client, can call REST API |
| Auth needed? | No (Orchestrator manages) | Yes (Bearer token) |
| Best for | Codex CLI, Claude Code, OpenCode — agents with CLIs | OpenClaw/Jeeves, Hermes — agents with their own schedulers |
| Task creation trigger | Forgejo Issue webhook (default) | Same, or API call |
---
## How to Choose an Execution Mode
Start by asking: **"What is the address of the Agent Fleet Orchestrator?"**
If you do not know the address, ask your user.
Then decide based on connectivity and control model:
1. **Use `ssh_cli`** if:
- You have a CLI binary that can run non-interactively
- The Fleet orchestrator can SSH into your machine or server
- You want the Orchestrator to initiate execution for each task
2. **Use `http_pull`** if:
- The Fleet orchestrator cannot SSH into your machine
- You are behind NAT, a firewall, or on a network the Orchestrator cannot reach
- You already have your own scheduler, worker loop, or bot runtime
- You want your agent to decide when to pull work
3. **If unsure**, ask your user:
- **"Can the Fleet orchestrator SSH into this machine?"**
- If yes → `ssh_cli`
- If no → `http_pull`
---
### Self-Determination Decision Flow
As an agent, you can self-determine your mode by answering these questions:
```
1. Do I have a CLI binary that can execute tasks non-interactively?
└─ NO → Use http_pull
└─ YES → Continue
2. Can the Fleet orchestrator SSH into my machine?
└─ NO → Use http_pull
└─ YES → Continue
3. Do I have my own scheduler, worker loop, or bot runtime?
└─ YES → Use http_pull (you prefer to control your own task flow)
└─ NO → Continue
4. Do I want the Orchestrator to initiate execution remotely?
└─ YES → Use ssh_cli
└─ NO → Use http_pull
```
**Quick reference:** If you can be reached via SSH and don't have a custom scheduler, use `ssh_cli`. Otherwise, use `http_pull`.
### Decision Table
| Question | If Yes | If No |
|----------|--------|-------|
| Do you have a CLI binary that can execute tasks non-interactively? | Continue evaluating for `ssh_cli` | Use `http_pull` |
| Can the Fleet orchestrator SSH into this machine? | Use `ssh_cli` | Use `http_pull` |
| Is this agent behind NAT, a firewall, or otherwise unreachable from the Orchestrator? | Use `http_pull` | Continue evaluating |
| Does the agent already run its own scheduler or task loop? | Use `http_pull` | Either mode may fit |
| Do you want the Orchestrator to launch the agent process remotely? | Use `ssh_cli` | Use `http_pull` |
### Common Scenarios
| Scenario | Recommended Mode | Why |
|----------|------------------|-----|
| Codex / Claude Code / OpenCode on a reachable server | `ssh_cli` | Fleet can SSH in and run the CLI directly |
| OpenClaw / Hermes Agent / bot framework | `http_pull` | The agent already has a runtime and should pull work itself |
| Agent running on a laptop behind NAT | `http_pull` | Fleet cannot reach it reliably over SSH |
| Shared VM with a well-known SSH host and installed CLI | `ssh_cli` | Centralized orchestration is simpler |
### Simple Rule of Thumb
- If the Fleet server can **reach you**, `ssh_cli` is usually simpler.
- If you must **reach the Fleet server**, use `http_pull`.
---
## ssh_cli Workflow
### 1. Configure a Host
Add a `[[hosts]]` section to `config.toml` on the Orchestrator:
```toml
[[hosts]]
host_id = "host-worker-01"
hostname = "192.168.1.100"
ssh_user = "deploy"
ssh_port = 22
ssh_key_path = "/home/deploy/.ssh/id_ed25519"
work_dir = "/opt/agent-workspace"
agents = [
{ agent_type = "codex-cli", max_concurrency = 2, capabilities = ["code:rust", "code:python"] },
]
```
For local execution (same machine as Orchestrator), use `hostname = "localhost"` — the Orchestrator uses a local subprocess instead of SSH.
### 2. Install the Agent CLI
The CLI binary must be available on the target host in `$PATH`. The Orchestrator checks availability with `which <binary>`.
Built-in CLI templates:
| Agent Type | CLI Command |
|------------|-------------|
| `codex-cli` | `codex exec --json '{prompt}'` |
| `claude-code` | `claude -p '{prompt}' --output-format json --dangerously-skip-permissions` |
Custom templates can be defined in `config.toml` under `[adapters]`.
### 3. Orchestrator Handles Everything
When a Forgejo Issue with an `agent:*` label arrives:
1. Orchestrator creates a task (`execution_mode = ssh_cli`)
2. Dispatch loop picks the task, selects a host by capability + load
3. SSH (or local subprocess) executes the CLI with a structured prompt
4. Output is parsed (Codex JSON or Claude JSON format)
5. Task status updates: `created``assigned``running``completed` (or `failed`)
### 4. What the Agent Receives (Structured Prompt)
The Orchestrator constructs this prompt and passes it as the `{prompt}` variable:
```
Task ID: org/repo#42
Type: code
Goal:
Implement the feature described in the issue body
Constraints:
- Execution mode: ssh_cli
- Labels: code:rust
- Branch: task/org%2Frepo%2342
- Expected output: JSON receipt
Validation:
- Run relevant tests if code changed
- Summarize changes and artifacts
```
### 5. Expected CLI Output
The CLI must output JSON to stdout. The format depends on the parser:
**Codex JSON:**
```json
{"status": "completed", "summary": "done", "duration_seconds": 120, "artifacts": [{"artifact_type": "pr", "url": "https://..."}]}
```
**Claude JSON:**
```json
{"status": "completed", "summary": "done", "duration_seconds": 95, "error": null}
```
If output is not valid JSON, the task is marked `failed`.
---
## http_pull Workflow
### 1. Register
```bash
curl -X POST http://FLEET_API_URL:PORT/api/v1/agents/register \
-H 'Content-Type: application/json' \
-d '{"agent_id": "worker-03", "agent_type": "openclaw", "hostname": "arm0", "capabilities": ["code:rust"], "max_concurrency": 2}'
```
Response contains a `registry_token`. Keep it for subsequent API calls (if `http_pull_token` is configured, use that shared token instead).
Recommended immediately after registration:
- Persist `FLEET_API_URL`, your `agent_id`, and the returned `registry_token`
- Start the heartbeat loop before your first dequeue request
### 2. Heartbeat (periodic)
Heartbeat must be a background loop, not a one-shot call.
- Default heartbeat interval: every 60 seconds
- Recommended behavior: start the loop immediately after registration, before the first dequeue
- If the Orchestrator does not receive a heartbeat within `heartbeat_interval_secs × heartbeat_timeout_threshold` (default: 180 seconds), the agent is marked offline
- When an agent is marked offline, its assigned tasks are requeued
- The heartbeat loop should run for the entire lifetime of the agent
```bash
curl -X POST http://FLEET_API_URL:PORT/api/v1/agents/heartbeat \
-H 'Content-Type: application/json' \
-d '{"agent_id": "worker-03"}'
```
### 3. Dequeue a Task
```bash
curl -X POST http://FLEET_API_URL:PORT/api/v1/tasks/dequeue \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <token>' \
-d '{"agent_id": "worker-03", "capabilities": ["code:rust"]}'
```
Returns `200 OK` with a Task object, or `204 No Content` if nothing available.
Only tasks with `execution_mode = http_pull` are returned.
### 4. Update Status While Working
```bash
curl -X POST http://FLEET_API_URL:PORT/api/v1/tasks/org%2Frepo%2342/status \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <token>' \
-d '{"status": "running"}'
```
### 5. Complete the Task
```bash
curl -X POST http://FLEET_API_URL:PORT/api/v1/tasks/org%2Frepo%2342/complete \
-H 'Content-Type: application/json' \
-d '{
"task_id": "org/repo#42",
"agent_id": "worker-03",
"status": "completed",
"duration_seconds": 180,
"summary": "Fixed the issue",
"artifacts": [{"artifact_type": "pr", "url": "https://git.example/org/repo/pulls/15"}],
"error": null
}'
```
Or use the receipts endpoint:
```bash
curl -X POST http://FLEET_API_URL:PORT/api/v1/receipts \
-H 'Content-Type: application/json' \
-d '<same receipt body>'
```
### 6. Deregister When Done
```bash
curl -X POST http://FLEET_API_URL:PORT/api/v1/agents/deregister \
-H 'Content-Type: application/json' \
-d '{"agent_id": "worker-03"}'
```
---
## Forgejo Integration
### How Issues Become Tasks
1. A Forgejo Issue is opened with a label matching `agent:*` (e.g. `agent:code`)
2. Forgejo sends an `issues` webhook to `POST /api/v1/webhooks/forgejo`
3. The `agent:*` label value becomes `task_type` (e.g. `code`)
4. Priority is inferred from labels: `priority:urgent`, `priority:high`, `priority:low` (default: `normal`)
5. A task is created with:
- `task_id` = `{repo_full_name}#{issue_number}` (e.g. `org/repo#42`)
- `execution_mode` = `ssh_cli` (default for Forgejo-originated tasks)
- `branch_name` = `task/{url_encoded_task_id}` (e.g. `task/org%2Frepo%2342`)
- `pr_title` = `feat: {issue_title} (#{issue_number})`
### Branch Naming Convention
- Branch: `task/{url_encoded_task_id}`
- Example: task `org/repo#42` → branch `task/org%2Frepo%2342`
### PR Lifecycle
| Event | Effect |
|-------|--------|
| PR opened (branch = `task/*`) | Task → `review_pending` |
| PR merged | Task → `completed`, auto receipt generated |
| Push to `task/*` branch | Task `last_activity_at` updated |
### Task Status Flow
```
created → assigned → running → review_pending → completed
↓ ↓ ↓ ↓ ↓
cancelled cancelled failed failed cancelled
(retry) → assigned
```
**Notes:**
- `failed` and `agent_lost` tasks can be retried via `POST /api/v1/tasks/{task_id}/retry` (transitions to `assigned`)
- Retry is limited by `max_retries` (default: 2)
- `agent_lost` is set internally by the heartbeat checker when an agent times out
- `review_pending` can transition back to `assigned`, `running`, `failed`, or `completed`
---
## Structured Prompt Format (ssh_cli)
When the Orchestrator executes an agent via SSH, it constructs a structured prompt:
```
Task ID: <task_id>
Type: <task_type>
Goal:
<requirements>
Constraints:
- Execution mode: ssh_cli
- Labels: <comma-separated labels or <none>>
- Branch: <branch_name>
- Expected output: JSON receipt
Validation:
- Run relevant tests if code changed
- Summarize changes and artifacts
```
The prompt is injected into the CLI template as the `{prompt}` variable. Other available variables: `{work_dir}`, `{task_id}`, `{branch}`.
---
## FAQ
**Q: How do I know which execution mode to use?**
A: First determine the Fleet API address, then ask whether the Orchestrator can SSH into the machine. If it can, `ssh_cli` is usually the best fit. If it cannot, use `http_pull`.
**Q: Do I need to register for ssh_cli mode?**
A: No. The Orchestrator manages ssh_cli tasks entirely. Registration is only for `http_pull` agents.
**Q: What happens if my agent crashes during ssh_cli execution?**
A: The task is marked `failed`. If `retry_count < max_retries`, the dispatch loop will retry automatically.
**Q: What happens if my http_pull agent stops sending heartbeats?**
A: After `heartbeat_interval_secs × heartbeat_timeout_threshold` seconds, the agent is marked offline and all its tasks are requeued with status `created`.
**Q: Can a task switch between execution modes?**
A: No. The `execution_mode` is set at creation time and cannot be changed.
**Q: How do I create a task manually?**
A: Use the Forgejo webhook flow (open an Issue with `agent:*` label), or directly insert into the database. There is no public "create task" API endpoint.
**Q: What label format triggers task creation?**
A: Issues must have a label starting with `agent:` (e.g. `agent:code`, `agent:review`). The value after `agent:` becomes the task type. Issues without such a label are ignored.
**Q: How does the review loop work?**
A: When a PR is opened (not merged), the task goes to `review_pending`. If the PR is not merged and the review cycle count exceeds `max_retries`, the task is marked `failed`. For `ssh_cli`, the Orchestrator re-dispatches automatically.