agent-fleet/docs/agent-onboarding-guide.md
Zer4tul 1f351a1734 docs: replace localhost with FLEET_API_URL, add persistence and heartbeat guidance
- Replace all localhost:9090 with FLEET_API_URL:PORT placeholder
- Add Step 0: persist Fleet API URL to agent memory
- Clarify heartbeat must be periodic loop (60s interval)
- Add execution mode self-selection decision flow
- Add persisting configuration section (URL, agent_id, token)
2026-05-12 18:34:23 +08:00

12 KiB
Raw Blame History

Agent Fleet — Agent Onboarding Guide

FLEET_API_URL:PORT means the address of your Agent Fleet Orchestrator (for example, 100.102.101.43:9090). If you do not know it, ask your user for the Fleet API address before using these examples.

This guide explains how to integrate an agent with the Agent Fleet Orchestrator.


Execution Modes

Agent Fleet supports two execution modes. The mode is set per-task at creation time (defaults to ssh_cli).

Aspect ssh_cli http_pull
Who initiates? Orchestrator (via SSH or local subprocess) Agent (via HTTP API)
Control flow Orchestrator builds prompt, runs CLI, collects output Agent decides when to dequeue and execute
Agent requirements CLI binary on a configured host HTTP client, can call REST API
Auth needed? No (Orchestrator manages) Yes (Bearer token)
Best for Codex CLI, Claude Code, OpenCode — agents with CLIs OpenClaw/Jeeves, Hermes — agents with their own schedulers
Task creation trigger Forgejo Issue webhook (default) Same, or API call

How to Choose an Execution Mode

Start by asking: "What is the address of the Agent Fleet Orchestrator?" If you do not know the address, ask your user.

Then decide based on connectivity and control model:

  1. Use ssh_cli if:

    • You have a CLI binary that can run non-interactively
    • The Fleet orchestrator can SSH into your machine or server
    • You want the Orchestrator to initiate execution for each task
  2. Use http_pull if:

    • The Fleet orchestrator cannot SSH into your machine
    • You are behind NAT, a firewall, or on a network the Orchestrator cannot reach
    • You already have your own scheduler, worker loop, or bot runtime
    • You want your agent to decide when to pull work
  3. If unsure, ask your user:

    • "Can the Fleet orchestrator SSH into this machine?"
    • If yes → ssh_cli
    • If no → http_pull

Self-Determination Decision Flow

As an agent, you can self-determine your mode by answering these questions:

1. Do I have a CLI binary that can execute tasks non-interactively?
   └─ NO → Use http_pull
   └─ YES → Continue

2. Can the Fleet orchestrator SSH into my machine?
   └─ NO → Use http_pull
   └─ YES → Continue

3. Do I have my own scheduler, worker loop, or bot runtime?
   └─ YES → Use http_pull (you prefer to control your own task flow)
   └─ NO → Continue

4. Do I want the Orchestrator to initiate execution remotely?
   └─ YES → Use ssh_cli
   └─ NO → Use http_pull

Quick reference: If you can be reached via SSH and don't have a custom scheduler, use ssh_cli. Otherwise, use http_pull.

Decision Table

Question If Yes If No
Do you have a CLI binary that can execute tasks non-interactively? Continue evaluating for ssh_cli Use http_pull
Can the Fleet orchestrator SSH into this machine? Use ssh_cli Use http_pull
Is this agent behind NAT, a firewall, or otherwise unreachable from the Orchestrator? Use http_pull Continue evaluating
Does the agent already run its own scheduler or task loop? Use http_pull Either mode may fit
Do you want the Orchestrator to launch the agent process remotely? Use ssh_cli Use http_pull

Common Scenarios

Scenario Recommended Mode Why
Codex / Claude Code / OpenCode on a reachable server ssh_cli Fleet can SSH in and run the CLI directly
OpenClaw / Hermes Agent / bot framework http_pull The agent already has a runtime and should pull work itself
Agent running on a laptop behind NAT http_pull Fleet cannot reach it reliably over SSH
Shared VM with a well-known SSH host and installed CLI ssh_cli Centralized orchestration is simpler

Simple Rule of Thumb

  • If the Fleet server can reach you, ssh_cli is usually simpler.
  • If you must reach the Fleet server, use http_pull.

ssh_cli Workflow

1. Configure a Host

Add a [[hosts]] section to config.toml on the Orchestrator:

[[hosts]]
host_id = "host-worker-01"
hostname = "192.168.1.100"
ssh_user = "deploy"
ssh_port = 22
ssh_key_path = "/home/deploy/.ssh/id_ed25519"
work_dir = "/opt/agent-workspace"
agents = [
  { agent_type = "codex-cli", max_concurrency = 2, capabilities = ["code:rust", "code:python"] },
]

For local execution (same machine as Orchestrator), use hostname = "localhost" — the Orchestrator uses a local subprocess instead of SSH.

2. Install the Agent CLI

The CLI binary must be available on the target host in $PATH. The Orchestrator checks availability with which <binary>.

Built-in CLI templates:

Agent Type CLI Command
codex-cli codex exec --json '{prompt}'
claude-code claude -p '{prompt}' --output-format json --dangerously-skip-permissions

Custom templates can be defined in config.toml under [adapters].

3. Orchestrator Handles Everything

When a Forgejo Issue with an agent:* label arrives:

  1. Orchestrator creates a task (execution_mode = ssh_cli)
  2. Dispatch loop picks the task, selects a host by capability + load
  3. SSH (or local subprocess) executes the CLI with a structured prompt
  4. Output is parsed (Codex JSON or Claude JSON format)
  5. Task status updates: createdassignedrunningcompleted (or failed)

4. What the Agent Receives (Structured Prompt)

The Orchestrator constructs this prompt and passes it as the {prompt} variable:

Task ID: org/repo#42
Type: code
Goal:
Implement the feature described in the issue body

Constraints:
- Execution mode: ssh_cli
- Labels: code:rust
- Branch: task/org%2Frepo%2342
- Expected output: JSON receipt

Validation:
- Run relevant tests if code changed
- Summarize changes and artifacts

5. Expected CLI Output

The CLI must output JSON to stdout. The format depends on the parser:

Codex JSON:

{"status": "completed", "summary": "done", "duration_seconds": 120, "artifacts": [{"artifact_type": "pr", "url": "https://..."}]}

Claude JSON:

{"status": "completed", "summary": "done", "duration_seconds": 95, "error": null}

If output is not valid JSON, the task is marked failed.


http_pull Workflow

1. Register

curl -X POST http://FLEET_API_URL:PORT/api/v1/agents/register \
  -H 'Content-Type: application/json' \
  -d '{"agent_id": "worker-03", "agent_type": "openclaw", "hostname": "arm0", "capabilities": ["code:rust"], "max_concurrency": 2}'

Response contains a registry_token. Keep it for subsequent API calls (if http_pull_token is configured, use that shared token instead).

Recommended immediately after registration:

  • Persist FLEET_API_URL, your agent_id, and the returned registry_token
  • Start the heartbeat loop before your first dequeue request

2. Heartbeat (periodic)

Heartbeat must be a background loop, not a one-shot call.

  • Default heartbeat interval: every 60 seconds
  • Recommended behavior: start the loop immediately after registration, before the first dequeue
  • If the Orchestrator does not receive a heartbeat within heartbeat_interval_secs × heartbeat_timeout_threshold (default: 180 seconds), the agent is marked offline
  • When an agent is marked offline, its assigned tasks are requeued
  • The heartbeat loop should run for the entire lifetime of the agent
curl -X POST http://FLEET_API_URL:PORT/api/v1/agents/heartbeat \
  -H 'Content-Type: application/json' \
  -d '{"agent_id": "worker-03"}'

3. Dequeue a Task

curl -X POST http://FLEET_API_URL:PORT/api/v1/tasks/dequeue \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer <token>' \
  -d '{"agent_id": "worker-03", "capabilities": ["code:rust"]}'

Returns 200 OK with a Task object, or 204 No Content if nothing available.

Only tasks with execution_mode = http_pull are returned.

4. Update Status While Working

curl -X POST http://FLEET_API_URL:PORT/api/v1/tasks/org%2Frepo%2342/status \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer <token>' \
  -d '{"status": "running"}'

5. Complete the Task

curl -X POST http://FLEET_API_URL:PORT/api/v1/tasks/org%2Frepo%2342/complete \
  -H 'Content-Type: application/json' \
  -d '{
    "task_id": "org/repo#42",
    "agent_id": "worker-03",
    "status": "completed",
    "duration_seconds": 180,
    "summary": "Fixed the issue",
    "artifacts": [{"artifact_type": "pr", "url": "https://git.example/org/repo/pulls/15"}],
    "error": null
  }'

Or use the receipts endpoint:

curl -X POST http://FLEET_API_URL:PORT/api/v1/receipts \
  -H 'Content-Type: application/json' \
  -d '<same receipt body>'

6. Deregister When Done

curl -X POST http://FLEET_API_URL:PORT/api/v1/agents/deregister \
  -H 'Content-Type: application/json' \
  -d '{"agent_id": "worker-03"}'

Forgejo Integration

How Issues Become Tasks

  1. A Forgejo Issue is opened with a label matching agent:* (e.g. agent:code)
  2. Forgejo sends an issues webhook to POST /api/v1/webhooks/forgejo
  3. The agent:* label value becomes task_type (e.g. code)
  4. Priority is inferred from labels: priority:urgent, priority:high, priority:low (default: normal)
  5. A task is created with:
    • task_id = {repo_full_name}#{issue_number} (e.g. org/repo#42)
    • execution_mode = ssh_cli (default for Forgejo-originated tasks)
    • branch_name = task/{url_encoded_task_id} (e.g. task/org%2Frepo%2342)
    • pr_title = feat: {issue_title} (#{issue_number})

Branch Naming Convention

  • Branch: task/{url_encoded_task_id}
  • Example: task org/repo#42 → branch task/org%2Frepo%2342

PR Lifecycle

Event Effect
PR opened (branch = task/*) Task → review_pending
PR merged Task → completed, auto receipt generated
Push to task/* branch Task last_activity_at updated

Task Status Flow

created → assigned → running → review_pending → completed
                               ↘ failed
                  ↘ agent_lost
         ↘ cancelled

Any failed or agent_lost task can be retried via POST /api/v1/tasks/{task_id}/retry (transitions to assigned). Retry is limited by max_retries (default: 2).


Structured Prompt Format (ssh_cli)

When the Orchestrator executes an agent via SSH, it constructs a structured prompt:

Task ID: <task_id>
Type: <task_type>
Goal:
<requirements>

Constraints:
- Execution mode: ssh_cli
- Labels: <comma-separated labels or <none>>
- Branch: <branch_name>
- Expected output: JSON receipt

Validation:
- Run relevant tests if code changed
- Summarize changes and artifacts

The prompt is injected into the CLI template as the {prompt} variable. Other available variables: {work_dir}, {task_id}, {branch}.


FAQ

Q: How do I know which execution mode to use? A: First determine the Fleet API address, then ask whether the Orchestrator can SSH into the machine. If it can, ssh_cli is usually the best fit. If it cannot, use http_pull.

Q: Do I need to register for ssh_cli mode? A: No. The Orchestrator manages ssh_cli tasks entirely. Registration is only for http_pull agents.

Q: What happens if my agent crashes during ssh_cli execution? A: The task is marked failed. If retry_count < max_retries, the dispatch loop will retry automatically.

Q: What happens if my http_pull agent stops sending heartbeats? A: After heartbeat_interval_secs × heartbeat_timeout_threshold seconds, the agent is marked offline and all its tasks are requeued with status created.

Q: Can a task switch between execution modes? A: No. The execution_mode is set at creation time and cannot be changed.

Q: How do I create a task manually? A: Use the Forgejo webhook flow (open an Issue with agent:* label), or directly insert into the database. There is no public "create task" API endpoint.

Q: What label format triggers task creation? A: Issues must have a label starting with agent: (e.g. agent:code, agent:review). The value after agent: becomes the task type. Issues without such a label are ignored.

Q: How does the review loop work? A: When a PR is opened (not merged), the task goes to review_pending. If the PR is not merged and the review cycle count exceeds max_retries, the task is marked failed. For ssh_cli, the Orchestrator re-dispatches automatically.