SkyRL
Tinker API

Architecture

This page describes how SkyRL implements the Tinker API, including the system architecture, training and sampling request lifecycles, and concurrency model.

System Architecture

The integration is organized in three high-level layers:

  1. API Layer (skyrl.tinker.api) - FastAPI HTTP server that accepts Tinker API requests, stores them in a database, and returns future IDs for async polling
  2. Engine Layer (skyrl.tinker.engine) - Background subprocess that polls the database, batches pending requests, and dispatches them to the backend
  3. Backend Layer (skyrl.backends) - Translates Tinker operations into training and inference calls, managing Ray workers, FSDP/Megatron training, and vLLM inference

SkyRL + Tinker Architecture

Model Creation

The Tinker call create_lora_training_client() triggers the full initialization of the SkyRL backend, spinning up training and inference workers, loading the base model (optionally with LoRA adaptors), and initializing weight sync state like NCCL transfer channels

Training

Training requests (forward_backward, optim_step, forward) go through the following lifecycle:

Client (tinker SDK)


API Server (FastAPI)
    │  Writes request to SQLite DB

Background Engine (subprocess)
    │  Polls DB, batches requests

SkyRL-Train Backend
    │  Converts Tinker batch → SkyRL TrainingInputBatch

RayPPOTrainer.dispatch
    │  Distributes work across Ray workers

GPU Workers (FSDP or Megatron)
    │  Execute forward/backward/optim

Results aggregated → DB updated → Client polls future

forward_backward()

Calls trainer.dispatch.forward_backward("policy", batch, loss_fn=loss_fn) which distributes computation across FSDP/Megatron workers. The dispatch returns:

  • loss_fn_outputs: Per-example dicts containing logprobs and elementwise_loss
  • Aggregate metrics: loss, policy_loss, policy_entropy, response_length

forward()

Calls trainer.dispatch.forward("policy", batch) for a gradient-free forward pass. Returns only logprobs per example (no loss computation).

optim_step()

Calls dispatch.optim_step("policy") to apply accumulated gradients.

Sampling

Sampling requests (sample) go through the following lifecycle:

Client calls sample()


SkyRL-Train Backend
    │  Converts Tinker SamplingParams → vLLM params

InferenceEngineClient
    │  Distributes prompts across vLLM engines

Inference Workers (vLLM)
    │  Generate tokens with logprobs

Results aggregated → GeneratedSequence objects returned

The sample() call is a fairly lightweight wrapper around SkyRL-Train's vLLM inference engines. The backend translates Tinker SamplingParams to vLLM format (e.g., stop_stringsstop, stop_tokensstop_token_ids) and delegates prompts to the InferenceEngineClient, which handles load balancing and sticky routing across vLLM workers.

Weight Sync

Training workers and inference engines hold separate copies of the model weights. After training updates the policy, the client must explicitly call save_weights_for_sampler() to broadcast the new weights to the inference engines before sampling. Multiple training steps (forward_backward + optim_step) can accumulate before a single sync.

Client calls save_weights_for_sampler()


SkyRL-Train Backend


RayPPOTrainer.dispatch.save_weights_for_sampler()
    │  NCCL broadcast

Training Workers ──→ Inference Engines (vLLM)
    (source weights)     (receive updated weights)

Weight Sync Modes: Ephemeral vs Persistent

The Tinker SDK provides two paths for syncing weights to inference engines, depending on whether the caller needs a durable checkpoint on disk.

Persistent mode

Triggered by save_weights_for_sampler(name="..."). This syncs the latest training weights to the inference engines and writes a full HuggingFace model checkpoint to disk. The call returns a tinker:// path that can be loaded later via load_checkpoint. Use persistent mode for checkpointing at milestones (e.g., end of epoch, best-so-far evaluation score).

Ephemeral mode

Triggered by save_weights_and_get_sampling_client(name="..."). This syncs weights to the inference engines only and skips the disk write entirely. Instead of returning a checkpoint path, it returns a sampling client directly. Use ephemeral mode in hot RL loops where you sync weights every batch but do not need to persist every iteration.

How the server distinguishes them

The Tinker SDK sends a sampling_session_seq_id field when using the ephemeral path. When the server sees this field (and no explicit checkpoint path or name), it skips the expensive disk write.

Why it matters

Persistent saves can be very expensive because they write full model weights to disk on every call. In RL training loops that sync weights every batch, ephemeral mode avoids this overhead entirely. In typical RL loops (e.g., tinker-cookbook's rl_loop), every iteration uses ephemeral mode before sampling, and persistent saves are reserved for periodic checkpointing.

Multiple LoRA tenants

On the Megatron backend, SkyRL supports multiple LoRA adapters trained and sampled concurrently against a single server. Each tenant's adapter weights and optimizer state live in pinned-CPU slots; the live GPU adapter is swapped on demand at the top of every per-model dispatch entry point (forward, forward_backward, optim_step, save_weights_for_sampler). On the inference side, vLLM serves each tenant's adapter by model_id after save_weights_for_sampler registers it via load_lora_adapter. See Multi-tenancy for the design and operator contract.

Full-parameter fine-tuning and the FSDP backend remain single-tenant — calling create_model a second time on those paths returns an error.

Checkpointing

The backend supports two checkpoint types:

  • Full checkpoint (save_checkpoint / load_checkpoint): Saves model weights, optimizer state, and LR scheduler as an uncompressed tar archive. Used for resuming training.
  • Sampler checkpoint (save_weights_for_sampler / save_weights_and_get_sampling_client): Syncs weights to inference engines. In persistent mode, also exports a HuggingFace model to disk; in ephemeral mode, skips the disk write (see Weight Sync Modes above).

Loss Functions

The following loss functions are validated through the Tinker API:

Loss FunctionDescriptionUse Case
cross_entropyStandard next-token prediction lossSupervised fine-tuning
importance_samplingOff-policy policy gradient: -(exp(logp - old_logp) * advantage)RL training (GRPO, REINFORCE)

SkyRL-Train's PolicyLossRegistry also contains additional loss functions (regular, dual_clip, gspo, sapo, cispo, clip_cov, kl_cov) used by SkyRL's native trainer. These are not yet wired through the Tinker data conversion path, which does not currently populate the required advantages and old_log_probs fields in the training batch for these loss types.

Concurrency Model

The Tinker API is inherently asynchronous:

  1. Clients submit requests and receive a request_id (future)
  2. The background engine batches compatible requests (e.g., multiple forward_backward calls for the same model)
  3. Barrier operations (optim_step, load_checkpoint) block until prior operations complete
  4. Clients poll retrieve_future to get results

This design allows the engine to batch small requests for better GPU utilization and to pipeline operations when possible.

Batching

Tinker represents training data as Datum objects with a ModelInput (containing one or more EncodedTextChunks of token IDs) and loss_fn_inputs (a flexible dictionary of TensorData fields whose keys vary by loss function — e.g., target_tokens and weights for SFT, or target_tokens, logprobs, advantages, and mask for RL). The backend converts these to SkyRL's TrainingInputBatch format:

  • Left-pads sequences to uniform length (SkyRL-Train expects padded tensors)
  • Shifts tokens: Tinker pre-shifts inputs/targets, but SkyRL-Train shifts internally, so the backend appends the last target token to reconstruct full sequences
  • Builds attention_mask, loss_mask, and response_mask tensors from token weights

The engine layer also batches multiple client requests together before passing them to the backend.

On this page