Harbor Integration

Harbor is an agent evaluation framework that runs AI agents against tasks in containerized sandbox environments. Each task provides an instruction, a sandbox (Docker/Daytona/Modal), and a verification script that produces a reward. Harbor handles the full lifecycle: spinning up the sandbox, running the agent, verifying the result, and tearing everything down.

The SkyRL + Harbor integration uses Harbor as the environment and reward source for RL training. SkyRL generates model outputs via vLLM, Harbor executes the agent in a sandbox and verifies correctness, and the resulting reward drives policy optimization.

Quick Start

cd SkyRL

# 1. Set credentials
export WANDB_API_KEY=your_wandb_api_key
# Pick your sandbox provider:
export DAYTONA_API_KEY=your_daytona_api_key
# export MODAL_TOKEN_ID=your_modal_token_id
# export MODAL_TOKEN_SECRET=your_modal_token_secret

# 2. Prepare datasets (downloads from HuggingFace, extracts tasks to ~/data/harbor/)
uv run examples/train_integrations/harbor/prepare_harbor_dataset.py \
    --dataset open-thoughts/CodeContests
uv run examples/train_integrations/harbor/prepare_harbor_dataset.py \
    --dataset open-thoughts/OpenThoughts-TB-dev

# 3. Launch training
bash examples/train_integrations/harbor/run_codecontest.sh

How the Integration Works

SkyRL's architecture separates training into a Trainer (PPO optimization) and a Generator (trajectory generation). The Generator is the only component that needs to change to support a new environment. See the system overview for more detail.

The Harbor integration plugs into this boundary by implementing a custom HarborGenerator that replaces the default generator. It also provides a HarborTaskDataset that loads Harbor task directories instead of text prompts, and a HarborExp entrypoint that wires everything together.

SkyRL Training Loop (unchanged)
       |
       v
HarborGenerator (implements GeneratorInterface)
       |
       v
Harbor Trial  -->  sandbox + agent + verifier  -->  chat_history and reward

The key insight is that SkyRL's GeneratorInterface is minimal:

class GeneratorInterface(ABC):
    @abstractmethod
    async def generate(self, input_batch: GeneratorInput) -> GeneratorOutput:
        ...

A GeneratorInput provides a batch of prompts with trajectory IDs. A GeneratorOutput returns tokenized prompt/response IDs, per-trajectory rewards, and loss masks. HarborGenerator implements this interface by running Harbor trials and converting the results into the expected format.

Code Structure

The integration lives in examples/train_integrations/harbor/:

examples/train_integrations/harbor/
  harbor_generator.py          # HarborGenerator: core bridge between SkyRL and Harbor
  dataset.py                   # HarborTaskDataset: loads task directory paths
  harbor_trial_config/
    default.yaml               # Harbor TrialConfig template
  entrypoints/
    main_harbor.py             # HarborExp: full training entrypoint
    main_harbor_generate.py    # Generation-only debug entrypoint

Entrypoint

HarborExp extends SkyRL's BasePPOExp with three overrides:

class HarborExp(BasePPOExp):
    def get_generator(self, cfg, tokenizer, inference_engine_client):
        return HarborGenerator(
            generator_cfg=cfg.generator,
            harbor_cfg=cfg.harbor_trial_config,
            inference_engine_client=inference_engine_client,
            tokenizer=tokenizer,
            max_seq_len=cfg.trainer.algorithm.max_seq_len,
        )

    def get_train_dataset(self):
        return HarborTaskDataset(data_files=self.cfg.data.train_data)

    def get_eval_dataset(self):
        if self.cfg.trainer.eval_interval > 0 and self.cfg.data.val_data:
            return HarborTaskDataset(data_files=self.cfg.data.val_data)
        return None

No other changes to SkyRL are needed.

Dataset

HarborTaskDataset replaces SkyRL's standard PromptDataset. Instead of text prompts, each dataset item is a path to a Harbor task directory. It scans directories for subdirectories containing instruction.md and yields them as dataset items:

# Each item returned by HarborTaskDataset:
{"prompt": "/path/to/task-dir", "env_class": None, "env_extras": {...}, "uid": "0"}

HarborGenerator Code Flow

Initialization

When HarborGenerator is created, it:

Builds a base_url from the vLLM HTTP endpoint host and port.
Converts the Harbor YAML config into a Python dict (the config template).
Injects the model name (hosted_vllm/{served_model_name}) and API base URL ({base_url}/v1) into the template. These stay constant across all trials.
Creates a rate limiter from the generator.rate_limit config (passed via Hydra +generator.rate_limit.* overrides, separate from the Harbor TrialConfig).

Per-Batch Generation

When generate() is called with a batch of prompts:

Creates one async task (harbor_agent_loop) per prompt.
Runs them all concurrently via tqdm.gather().
Calls _mask_failed_instances_and_compute_metrics() to handle failures.
Assembles a GeneratorOutput with tokenized IDs, rewards, and loss masks.

Per-Trajectory Execution (`harbor_agent_loop`)

For each trajectory, the loop:

Deep-copies the config template.
Injects task.path (from the dataset prompt) and a unique session_id (via uuid4().hex).
Validates the config into a TrialConfig Pydantic model.
Creates a Trial instance and runs it: await trial.run().
Extracts the reward from results.verifier_result.rewards["reward"] and the chat history from results.agent_result.metadata["all_messages"].
Tokenizes the chat history: the first user message becomes prompt_ids, and remaining messages become response_ids with a loss_mask (1 for assistant tokens, 0 for user/system tokens).
Returns a HarborAgentOutput with reward, token IDs, loss mask, and stop reason.

What Harbor Does (Trials)

From SkyRL's perspective, Harbor is a black box. SkyRL calls trial.run() and gets back a reward and chat history. Internally, a Harbor Trial runs the following steps:

Start environment: Spins up a sandbox (Daytona, Docker, Modal, etc.) from the task's Dockerfile.
Run agent: The agent (typically terminus-2, a tool-use coding agent) reads the task's instruction.md, then iterates: calling the LLM via the vLLM HTTP endpoint, executing commands in the sandbox, and observing results. This continues for up to max_turns iterations.
Run verifier: Executes tests/test.sh inside the sandbox and reads the reward from /logs/verifier/reward.txt.
Cleanup: Tears down the sandbox and returns results.

A Harbor task directory follows this structure:

task-dir/
  instruction.md          # Natural language task description
  task.toml               # Config: timeouts, resources, metadata
  environment/
    Dockerfile            # Container image
  tests/
    test.sh               # Verification script -> writes reward

SkyRL never manages the sandbox or agent directly. The vLLM inference engine is exposed as an HTTP endpoint, and the Harbor agent calls it through LiteLLM as if it were any OpenAI-compatible API.

Error Handling, Masking, and Retries

Per-Trajectory Retries

Each trajectory gets up to 2 attempts (MAX_NUM_RETRIES_PER_TRIAL = 2). The retry behavior depends on the error type:

Error	Retries?	Reward	Trained on?
Success	N/A	From verifier	Yes
`ContextLengthExceededError`	No	0	Configurable (see below)
`AgentTimeoutError`	No	-	No (loss-masked)
Missing verifier result	Yes	-	-
Other exceptions	Yes	-	-

When a trajectory hits ContextLengthExceededError, its reward is set to 0. What happens next depends on the generator.apply_overlong_filtering setting:

apply_overlong_filtering=true: The loss mask is zeroed out, so the trajectory does not contribute gradients. This prevents the model from training on truncated, incomplete trajectories that hit the context limit.
apply_overlong_filtering=false (default): The trajectory is trained with reward=0, treating context-length exceeded as a learnable signal.

Instance-Level Masking

After all trajectories in a batch complete, _mask_failed_instances_and_compute_metrics() scans the results. If any trajectory for a given prompt fails (timeout or error), all trajectories for that prompt are zeroed out (loss_mask=[0]) as a conservative approach.

Rate Limiting

The generator includes built-in rate limiting to avoid overloading sandbox providers (some impose rate limit on sandbox creation):

trajectories_per_second: Throttles trial submission rate (e.g., 5/sec).
max_concurrency: Caps parallel trial.run() calls (e.g., 512).

These are configured via Hydra overrides on the generator config (not in the Harbor TrialConfig YAML). For example:

+generator.rate_limit.enabled=true \
+generator.rate_limit.trajectories_per_second=5 \
+generator.rate_limit.max_concurrency=512

If +generator.rate_limit is omitted entirely, no rate limiting is applied.

TrialConfig and Key Knobs

The Harbor config template (harbor_trial_config/default.yaml) maps directly to Harbor's TrialConfig. SkyRL injects four values at runtime; everything else is user-configurable.

Agent Configuration

agent:
  name: terminus-2                    # Which Harbor agent to use
  override_timeout_sec: 1200          # Time (seconds) given for a single Trial to run
  kwargs:
    max_turns: 32                     # Max agent iterations per trial
    store_all_messages: true          # Required for SkyRL to extract training data
    temperature: 1.0                  # Sampling temperature (higher = more exploration)
    enable_summarize: false           # Context summarization when nearing token limits
    model_info:
      max_input_tokens: 32768        # Should match generator.engine_init_kwargs.max_model_len
      max_output_tokens: 32768

store_all_messages: true is required for training. Without it, SkyRL cannot extract the chat history needed to compute loss masks and train the model.

Key Knobs for RL Training

Knob	Where	Effect
`agent.kwargs.max_turns`	Harbor config	More turns = longer trajectories, richer signal, but slower and more expensive
`agent.kwargs.temperature`	Harbor config	Higher temperature increases exploration; typical RL value is 1.0
`agent.kwargs.model_info.max_input_tokens`	Harbor config	Controls the agent's context budget; should match vLLM's `max_model_len`
`generator.n_samples_per_prompt`	SkyRL config	Number of trajectories per prompt for GRPO advantage estimation (e.g., 8)
`trainer.algorithm.max_seq_len`	SkyRL config	Maximum total sequence length (context window length). Required for Harbor — set to match vLLM's `max_model_len`
`generator.apply_overlong_filtering`	SkyRL config	When `true`, zero out loss mask for context-length-exceeded trajectories (default: `false`)
`trainer.algorithm.advantage_estimator`	SkyRL config	Advantage method: `grpo`, `rloo`, `reinforce_pp`, etc.
`trainer.train_batch_size`	SkyRL config	Number of unique prompts per training batch
`environment.type`	Harbor config	Sandbox provider: `daytona`, `docker`, `modal`, `e2b`, `gke`
`+generator.rate_limit.max_concurrency`	Launch script	Parallel trial cap; tune based on sandbox provider capacity
`+generator.rate_limit.trajectories_per_second`	Launch script	Submission rate; prevents overloading the sandbox provider
`timeout_multiplier`	Harbor config	Scales all default timeouts; increase for harder tasks

Environment and Verifier

environment:
  type: daytona                       # Sandbox provider
  override_cpus: 1
  override_memory_mb: 1024
  kwargs:
    auto_stop_interval_mins: 30       # Daytona-specific: minutes of inactivity before sandbox auto-stops
    # sandbox_timeout_secs: 1800      # Modal-specific: sandbox timeout

verifier:
  disable: false                      # Set to true to skip verification (debugging)

LLM Request Settings

agent:
  kwargs:
    llm_kwargs:
      timeout: 900                    # LLM request timeout (seconds)
      max_retries: 0                  # OpenAI SDK retries (0 = disabled)
      top_p: 1.0
      top_k: -1
      min_p: 0.0

Preparing Datasets

Harbor Task Format

A Harbor dataset is a directory of task directories. Each task directory contains the files Harbor needs to spin up a sandbox, run the agent, and verify the result. The minimal structure is:

task-dir/
  instruction.md          # Natural language task description
  task.toml               # Config: timeouts, resources, metadata
  environment/
    Dockerfile            # Container image for the sandbox
  tests/
    test.sh               # Verification script → writes reward to /logs/verifier/reward.txt
  solution/               # Optional reference solution
    solve.sh

HarborTaskDataset scans the top-level directory for subdirectories containing instruction.md and treats each one as a valid task. So your dataset directory should look like:

my-dataset/
  task-001/
    instruction.md
    task.toml
    environment/
      Dockerfile
    tests/
      test.sh
  task-002/
    instruction.md
    ...

For the full specification of Harbor's task format and all supported fields in task.toml, see the Harbor documentation.

Using `prepare_harbor_dataset.py`

Most publicly available Harbor datasets are hosted on HuggingFace Hub. The prepare_harbor_dataset.py script handles downloading and extracting them into the task directory layout that HarborTaskDataset expects.

The script handles two dataset formats automatically:

Parquet-based datasets (e.g., CodeContests): Tasks are stored as tar archives in a parquet file with path and task_binary columns. The script extracts each archive into its own task directory.
Direct task directories (e.g., OpenThoughts-TB-dev): The dataset already contains task directories. The script creates a symlink to the downloaded snapshot.

cd SkyRL

# Download and extract training datasets
uv run examples/train_integrations/harbor/prepare_harbor_dataset.py \
    --dataset open-thoughts/CodeContests

# Download an eval dataset (symlinked, no extraction needed)
uv run examples/train_integrations/harbor/prepare_harbor_dataset.py \
    --dataset open-thoughts/OpenThoughts-TB-dev

The output directory defaults to ~/data/harbor/<repo-name>, derived from the dataset name. You can override with --output_dir ~/my-custom-path.

You must run prepare_harbor_dataset.py before launching training. The launch scripts (e.g. run_codecontest.sh) assume datasets are already prepared at ~/data/harbor/.

Running Training

Prerequisites

Sandbox access: Daytona API key (DAYTONA_API_KEY) or Modal credentials (MODAL_TOKEN_ID and MODAL_TOKEN_SECRET), or other sandbox providers that Harbor supports.
Dataset: Harbor task directories, each with instruction.md and tests/test.sh
GPUs: Typically 4-8 for combined training and inference
vLLM HTTP endpoint: Enabled via generator.enable_http_endpoint=true

Launch

cd SkyRL

# Full RL training
bash examples/train_integrations/harbor/run_codecontest.sh

# Generation-only debugging (no training, useful for testing the integration)
bash examples/train_integrations/harbor/run_harbor_gen.sh

The scripts use Hydra to configure the run. A typical invocation looks like:

uv run --isolated --extra fsdp --extra harbor \
  -m examples.train_integrations.harbor.entrypoints.main_harbor \
  data.train_data=$TRAIN_DATA \
  trainer.policy.model.path=Qwen/Qwen3-8B \
  generator.served_model_name=Qwen3-8B \
  hydra.searchpath=['file://examples/train_integrations/harbor'] \
  +harbor_trial_config=default \
  ++harbor_trial_config.trials_dir=$TRIALS_DIR \
  ++harbor_trial_config.environment.type=daytona \
  trainer.algorithm.max_seq_len=32768 \
  generator.apply_overlong_filtering=true \
  generator.n_samples_per_prompt=8 \
  trainer.algorithm.advantage_estimator=grpo \
  trainer.train_batch_size=64 \
  generator.enable_http_endpoint=true \
  +generator.rate_limit.enabled=true \
  +generator.rate_limit.trajectories_per_second=5 \
  +generator.rate_limit.max_concurrency=512

Metrics

The following metrics are logged to W&B:

Metric	Description
`generate/avg_reward`	Average reward across successful trajectories
`generate/avg_response_length`	Average response length in tokens
`generate/avg_num_turns`	Average agent interaction depth
`generate/num_timeout_trajectories`	Trajectories that hit agent timeout
`generate/num_error_trajectories`	Trajectories that hit sandbox/agent errors
`generate/num_masked_instances`	Instances excluded from training due to failures
`generate/trajectories_context_length_exceeded`	Trajectories that exceeded context window
`generate/trajectories_summarized`	Trajectories where context summarization was triggered

Harbor Integration

On this page