Harbor Integration
Harbor is an agent evaluation framework that runs AI agents against tasks in containerized sandbox environments. Each task provides an instruction, a sandbox (Docker/Daytona/Modal), and a verification script that produces a reward. Harbor handles the full lifecycle: spinning up the sandbox, running the agent, verifying the result, and tearing everything down.
The SkyRL + Harbor integration uses Harbor as the environment and reward source for RL training. SkyRL generates model outputs via vLLM, Harbor executes the agent in a sandbox and verifies correctness, and the resulting reward drives policy optimization.
Quick Start
cd SkyRL
# 1. Set credentials
export WANDB_API_KEY=your_wandb_api_key
# Pick your sandbox provider:
export DAYTONA_API_KEY=your_daytona_api_key
# export MODAL_TOKEN_ID=your_modal_token_id
# export MODAL_TOKEN_SECRET=your_modal_token_secret
# 2. Prepare datasets (downloads from HuggingFace, extracts tasks to ~/data/harbor/)
uv run examples/train_integrations/harbor/prepare_harbor_dataset.py \
--dataset open-thoughts/CodeContests
uv run examples/train_integrations/harbor/prepare_harbor_dataset.py \
--dataset open-thoughts/OpenThoughts-TB-dev
# 3. Launch training
bash examples/train_integrations/harbor/run_codecontest.shHow the Integration Works
SkyRL's architecture separates training into a Trainer (PPO optimization) and a Generator (trajectory generation). The Generator is the only component that needs to change to support a new environment. See the system overview for more detail.
The Harbor integration plugs into this boundary by implementing a custom HarborGenerator that replaces the default generator. It also provides a HarborTaskDataset that loads Harbor task directories instead of text prompts, and a HarborExp entrypoint that wires everything together.
SkyRL Training Loop (unchanged)
|
v
HarborGenerator (implements GeneratorInterface)
|
v
Harbor Trial --> sandbox + agent + verifier --> chat_history and rewardThe key insight is that SkyRL's GeneratorInterface is minimal:
class GeneratorInterface(ABC):
@abstractmethod
async def generate(self, input_batch: GeneratorInput) -> GeneratorOutput:
...A GeneratorInput provides a batch of prompts with trajectory IDs. A GeneratorOutput returns tokenized prompt/response IDs, per-trajectory rewards, and loss masks. HarborGenerator implements this interface by running Harbor trials and converting the results into the expected format.
Code Structure
The integration lives in examples/train_integrations/harbor/:
examples/train_integrations/harbor/
harbor_generator.py # HarborGenerator: core bridge between SkyRL and Harbor
dataset.py # HarborTaskDataset: loads task directory paths
harbor_trial_config/
default.yaml # Harbor TrialConfig template
entrypoints/
main_harbor.py # HarborExp: full training entrypoint
main_harbor_generate.py # Generation-only debug entrypointEntrypoint
HarborExp extends SkyRL's BasePPOExp with three overrides:
class HarborExp(BasePPOExp):
def get_generator(self, cfg, tokenizer, inference_engine_client):
return HarborGenerator(
generator_cfg=cfg.generator,
harbor_cfg=cfg.harbor_trial_config,
inference_engine_client=inference_engine_client,
tokenizer=tokenizer,
max_seq_len=cfg.trainer.algorithm.max_seq_len,
)
def get_train_dataset(self):
return HarborTaskDataset(data_files=self.cfg.data.train_data)
def get_eval_dataset(self):
if self.cfg.trainer.eval_interval > 0 and self.cfg.data.val_data:
return HarborTaskDataset(data_files=self.cfg.data.val_data)
return NoneNo other changes to SkyRL are needed.
Dataset
HarborTaskDataset replaces SkyRL's standard PromptDataset. Instead of text prompts, each dataset item is a path to a Harbor task directory. It scans directories for subdirectories containing instruction.md and yields them as dataset items:
# Each item returned by HarborTaskDataset:
{"prompt": "/path/to/task-dir", "env_class": None, "env_extras": {...}, "uid": "0"}HarborGenerator Code Flow
Initialization
When HarborGenerator is created, it:
- Builds a
base_urlfrom the vLLM HTTP endpoint host and port. - Converts the Harbor YAML config into a Python dict (the config template).
- Injects the model name (
hosted_vllm/{served_model_name}) and API base URL ({base_url}/v1) into the template. These stay constant across all trials. - Creates a rate limiter from the
generator.rate_limitconfig (passed via Hydra+generator.rate_limit.*overrides, separate from the Harbor TrialConfig).
Per-Batch Generation
When generate() is called with a batch of prompts:
- Creates one async task (
harbor_agent_loop) per prompt. - Runs them all concurrently via
tqdm.gather(). - Calls
_mask_failed_instances_and_compute_metrics()to handle failures. - Assembles a
GeneratorOutputwith tokenized IDs, rewards, and loss masks.
Per-Trajectory Execution (harbor_agent_loop)
For each trajectory, the loop:
- Deep-copies the config template.
- Injects
task.path(from the dataset prompt) and a uniquesession_id(viauuid4().hex). - Validates the config into a
TrialConfigPydantic model. - Creates a
Trialinstance and runs it:await trial.run(). - Extracts the reward from
results.verifier_result.rewards["reward"]and the chat history fromresults.agent_result.metadata["all_messages"]. - Tokenizes the chat history: the first user message becomes
prompt_ids, and remaining messages becomeresponse_idswith aloss_mask(1 for assistant tokens, 0 for user/system tokens). - Returns a
HarborAgentOutputwith reward, token IDs, loss mask, and stop reason.
What Harbor Does (Trials)
From SkyRL's perspective, Harbor is a black box. SkyRL calls trial.run() and gets back a reward and chat history. Internally, a Harbor Trial runs the following steps:
- Start environment: Spins up a sandbox (Daytona, Docker, Modal, etc.) from the task's Dockerfile.
- Run agent: The agent (typically terminus-2, a tool-use coding agent) reads the task's
instruction.md, then iterates: calling the LLM via the vLLM HTTP endpoint, executing commands in the sandbox, and observing results. This continues for up tomax_turnsiterations. - Run verifier: Executes
tests/test.shinside the sandbox and reads the reward from/logs/verifier/reward.txt. - Cleanup: Tears down the sandbox and returns results.
A Harbor task directory follows this structure:
task-dir/
instruction.md # Natural language task description
task.toml # Config: timeouts, resources, metadata
environment/
Dockerfile # Container image
tests/
test.sh # Verification script -> writes rewardSkyRL never manages the sandbox or agent directly. The vLLM inference engine is exposed as an HTTP endpoint, and the Harbor agent calls it through LiteLLM as if it were any OpenAI-compatible API.
Error Handling, Masking, and Retries
Per-Trajectory Retries
Each trajectory gets up to 2 attempts (MAX_NUM_RETRIES_PER_TRIAL = 2). The retry behavior depends on the error type:
| Error | Retries? | Reward | Trained on? |
|---|---|---|---|
| Success | N/A | From verifier | Yes |
ContextLengthExceededError | No | 0 | Configurable (see below) |
AgentTimeoutError | No | - | No (loss-masked) |
| Missing verifier result | Yes | - | - |
| Other exceptions | Yes | - | - |
When a trajectory hits ContextLengthExceededError, its reward is set to 0. What happens next depends on the generator.apply_overlong_filtering setting:
apply_overlong_filtering=true: The loss mask is zeroed out, so the trajectory does not contribute gradients. This prevents the model from training on truncated, incomplete trajectories that hit the context limit.apply_overlong_filtering=false(default): The trajectory is trained with reward=0, treating context-length exceeded as a learnable signal.
Instance-Level Masking
After all trajectories in a batch complete, _mask_failed_instances_and_compute_metrics() scans the results. If any trajectory for a given prompt fails (timeout or error), all trajectories for that prompt are zeroed out (loss_mask=[0]) as a conservative approach.
Rate Limiting
The generator includes built-in rate limiting to avoid overloading sandbox providers (some impose rate limit on sandbox creation):
trajectories_per_second: Throttles trial submission rate (e.g., 5/sec).max_concurrency: Caps paralleltrial.run()calls (e.g., 512).
These are configured via Hydra overrides on the generator config (not in the Harbor TrialConfig YAML). For example:
+generator.rate_limit.enabled=true \
+generator.rate_limit.trajectories_per_second=5 \
+generator.rate_limit.max_concurrency=512If +generator.rate_limit is omitted entirely, no rate limiting is applied.
TrialConfig and Key Knobs
The Harbor config template (harbor_trial_config/default.yaml) maps directly to Harbor's TrialConfig. SkyRL injects four values at runtime; everything else is user-configurable.
Agent Configuration
agent:
name: terminus-2 # Which Harbor agent to use
override_timeout_sec: 1200 # Time (seconds) given for a single Trial to run
kwargs:
max_turns: 32 # Max agent iterations per trial
store_all_messages: true # Required for SkyRL to extract training data
temperature: 1.0 # Sampling temperature (higher = more exploration)
enable_summarize: false # Context summarization when nearing token limits
model_info:
max_input_tokens: 32768 # Should match generator.engine_init_kwargs.max_model_len
max_output_tokens: 32768store_all_messages: true is required for training. Without it, SkyRL cannot extract the chat history needed to compute loss masks and train the model.
Key Knobs for RL Training
| Knob | Where | Effect |
|---|---|---|
agent.kwargs.max_turns | Harbor config | More turns = longer trajectories, richer signal, but slower and more expensive |
agent.kwargs.temperature | Harbor config | Higher temperature increases exploration; typical RL value is 1.0 |
agent.kwargs.model_info.max_input_tokens | Harbor config | Controls the agent's context budget; should match vLLM's max_model_len |
generator.n_samples_per_prompt | SkyRL config | Number of trajectories per prompt for GRPO advantage estimation (e.g., 8) |
trainer.algorithm.max_seq_len | SkyRL config | Maximum total sequence length (context window length). Required for Harbor — set to match vLLM's max_model_len |
generator.apply_overlong_filtering | SkyRL config | When true, zero out loss mask for context-length-exceeded trajectories (default: false) |
trainer.algorithm.advantage_estimator | SkyRL config | Advantage method: grpo, rloo, reinforce_pp, etc. |
trainer.train_batch_size | SkyRL config | Number of unique prompts per training batch |
environment.type | Harbor config | Sandbox provider: daytona, docker, modal, e2b, gke |
+generator.rate_limit.max_concurrency | Launch script | Parallel trial cap; tune based on sandbox provider capacity |
+generator.rate_limit.trajectories_per_second | Launch script | Submission rate; prevents overloading the sandbox provider |
timeout_multiplier | Harbor config | Scales all default timeouts; increase for harder tasks |
Environment and Verifier
environment:
type: daytona # Sandbox provider
override_cpus: 1
override_memory_mb: 1024
kwargs:
auto_stop_interval_mins: 30 # Daytona-specific: minutes of inactivity before sandbox auto-stops
# sandbox_timeout_secs: 1800 # Modal-specific: sandbox timeout
verifier:
disable: false # Set to true to skip verification (debugging)LLM Request Settings
agent:
kwargs:
llm_kwargs:
timeout: 900 # LLM request timeout (seconds)
max_retries: 0 # OpenAI SDK retries (0 = disabled)
top_p: 1.0
top_k: -1
min_p: 0.0Preparing Datasets
Harbor Task Format
A Harbor dataset is a directory of task directories. Each task directory contains the files Harbor needs to spin up a sandbox, run the agent, and verify the result. The minimal structure is:
task-dir/
instruction.md # Natural language task description
task.toml # Config: timeouts, resources, metadata
environment/
Dockerfile # Container image for the sandbox
tests/
test.sh # Verification script → writes reward to /logs/verifier/reward.txt
solution/ # Optional reference solution
solve.shHarborTaskDataset scans the top-level directory for subdirectories containing instruction.md and treats each one as a valid task. So your dataset directory should look like:
my-dataset/
task-001/
instruction.md
task.toml
environment/
Dockerfile
tests/
test.sh
task-002/
instruction.md
...For the full specification of Harbor's task format and all supported fields in task.toml, see the Harbor documentation.
Using prepare_harbor_dataset.py
Most publicly available Harbor datasets are hosted on HuggingFace Hub. The prepare_harbor_dataset.py script handles downloading and extracting them into the task directory layout that HarborTaskDataset expects.
The script handles two dataset formats automatically:
- Parquet-based datasets (e.g.,
CodeContests): Tasks are stored as tar archives in a parquet file withpathandtask_binarycolumns. The script extracts each archive into its own task directory. - Direct task directories (e.g.,
OpenThoughts-TB-dev): The dataset already contains task directories. The script creates a symlink to the downloaded snapshot.
cd SkyRL
# Download and extract training datasets
uv run examples/train_integrations/harbor/prepare_harbor_dataset.py \
--dataset open-thoughts/CodeContests
# Download an eval dataset (symlinked, no extraction needed)
uv run examples/train_integrations/harbor/prepare_harbor_dataset.py \
--dataset open-thoughts/OpenThoughts-TB-devThe output directory defaults to ~/data/harbor/<repo-name>, derived from the dataset name. You can override with --output_dir ~/my-custom-path.
You must run prepare_harbor_dataset.py before launching training. The launch scripts (e.g. run_codecontest.sh) assume datasets are already prepared at ~/data/harbor/.
Running Training
Prerequisites
- Sandbox access: Daytona API key (
DAYTONA_API_KEY) or Modal credentials (MODAL_TOKEN_IDandMODAL_TOKEN_SECRET), or other sandbox providers that Harbor supports. - Dataset: Harbor task directories, each with
instruction.mdandtests/test.sh - GPUs: Typically 4-8 for combined training and inference
- vLLM HTTP endpoint: Enabled via
generator.enable_http_endpoint=true
Launch
cd SkyRL
# Full RL training
bash examples/train_integrations/harbor/run_codecontest.sh
# Generation-only debugging (no training, useful for testing the integration)
bash examples/train_integrations/harbor/run_harbor_gen.shThe scripts use Hydra to configure the run. A typical invocation looks like:
uv run --isolated --extra fsdp --extra harbor \
-m examples.train_integrations.harbor.entrypoints.main_harbor \
data.train_data=$TRAIN_DATA \
trainer.policy.model.path=Qwen/Qwen3-8B \
generator.served_model_name=Qwen3-8B \
hydra.searchpath=['file://examples/train_integrations/harbor'] \
+harbor_trial_config=default \
++harbor_trial_config.trials_dir=$TRIALS_DIR \
++harbor_trial_config.environment.type=daytona \
trainer.algorithm.max_seq_len=32768 \
generator.apply_overlong_filtering=true \
generator.n_samples_per_prompt=8 \
trainer.algorithm.advantage_estimator=grpo \
trainer.train_batch_size=64 \
generator.enable_http_endpoint=true \
+generator.rate_limit.enabled=true \
+generator.rate_limit.trajectories_per_second=5 \
+generator.rate_limit.max_concurrency=512Metrics
The following metrics are logged to W&B:
| Metric | Description |
|---|---|
generate/avg_reward | Average reward across successful trajectories |
generate/avg_response_length | Average response length in tokens |
generate/avg_num_turns | Average agent interaction depth |
generate/num_timeout_trajectories | Trajectories that hit agent timeout |
generate/num_error_trajectories | Trajectories that hit sandbox/agent errors |
generate/num_masked_instances | Instances excluded from training due to failures |
generate/trajectories_context_length_exceeded | Trajectories that exceeded context window |
generate/trajectories_summarized | Trajectories where context summarization was triggered |