feat(supervisor): compute workload manager by nicktrn · Pull Request #3114 · triggerdotdev/trigger.dev

nicktrn · 2026-02-23T12:14:37Z

Adds the ComputeWorkloadManager for routing task execution through the compute gateway, including full checkpoint/restore support, OTel trace integration, and template pre-warming.

Changes

Compute workload manager (apps/supervisor/src/workloadManager/compute.ts)

Routes instance create, snapshot, delete, and restore through the compute gateway API
Wide event logging on create with full timing and context
Configurable gateway timeout, auth token, image digest stripping

Compute snapshot service (apps/supervisor/src/services/computeSnapshotService.ts)

Timer wheel for delayed snapshot dispatch (avoids wasted work on short-lived waitpoints)
Configurable dispatch concurrency limit (COMPUTE_SNAPSHOT_DISPATCH_LIMIT)
Snapshot-complete callback handler with suspend completion reporting
Trace context management and OTel span emission for snapshot operations

OTel trace service (apps/supervisor/src/services/otlpTraceService.ts)

Fire-and-forget OTLP span emission for compute operations (provision, restore, snapshot)
BigInt nanosecond conversion preserving sub-ms precision for span ordering

Template creation (apps/webapp/app/v3/services/computeTemplateCreation.server.ts)

Three-mode rollout: required (MICROVM projects), shadow (feature flag / percentage), skip
Integrated into deploy finalize flow

Shared compute package (internal-packages/compute/)

Gateway client with namespace-based API (instances, templates, snapshots)
Zod schemas for all gateway request/response types

Database

COMPUTE variant added to TaskRunCheckpointType enum
WorkloadType enum and column on WorkerInstanceGroup
hasComputeAccess feature flag

Env / config

Compute gateway URL, auth token, timeout
Snapshot enable flag, delay, dispatch limit
Dedicated OTLP endpoint for compute spans (COMPUTE_TRACE_OTLP_ENDPOINT)

Add a third WorkloadManager implementation that creates sandboxes via the compute gateway HTTP API (POST /api/sandboxes). Uses native fetch with no new dependencies. Enabled by setting COMPUTE_GATEWAY_URL, which takes priority over Kubernetes and Docker providers.

The fetch() call had no timeout, causing infinite hangs when the gateway accepted requests but never returned responses. Adds AbortSignal.timeout (30s) and consolidates all logging into a single structured event per create() call with timing, status, and error context.

Emit a single canonical log line in a finally block instead of scattered log calls at each early return. Adds business context (envId, envType, orgId, projectId, deploymentVersion, machine) and instanceName to the event. Always emits at info level with ok=true/false for queryability.

Pass business context (runId, envId, orgId, projectId, machine, etc.) as metadata on CreateSandboxRequest instead of relying on env vars. This enables wide event logging in the compute stack without parsing env or leaking secrets.

Passes machine preset cpu and memory as top-level fields on the CreateSandboxRequest so the compute stack can use them for admission control and resource allocation.

Thread timing context from queue consumer through to the compute workload manager's wide event: - dequeueResponseMs: platform dequeue HTTP round-trip - pollingIntervalMs: which polling interval was active (idle vs active) - warmStartCheckMs: warm start check duration All fields are optional to avoid breaking existing consumers.

…-manager

- Fix instance creation URL from /api/sandboxes to /api/instances - Pass name: runnerId when creating compute instances - Add snapshot(), deleteInstance(), and restore() methods to ComputeWorkloadManager - Add /api/v1/compute/snapshot-complete callback endpoint to WorkloadServer - Handle suspend requests in compute mode via fire-and-forget snapshot with callback - Handle restore in compute mode by calling gateway restore API directly - Wire computeManager into WorkloadServer for compute mode suspend/restore

…-manager

…re request Restore calls now send a request body with the runner name, env override metadata, cpu, and memory so the agent can inject them before the VM resumes. The runner fetches these overrides from TRIGGER_METADATA_URL at restore time. runnerId is derived per restore cycle as runner-{runIdShort}-{checkpointSuffix}, matching iceman's pattern.

Gates snapshot/restore behaviour independently of compute mode. When disabled, VMs won't receive the metadata URL and suspend/restore are no-ops. Defaults to off so compute mode can be used without snapshots.

changeset-bot · 2026-02-23T12:14:41Z

🦋 Changeset detected

Latest commit: 5b188a5

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 29 packages

Name	Type
trigger.dev	Patch
d3-chat	Patch
references-d3-openai-agents	Patch
references-nextjs-realtime	Patch
references-realtime-hooks-test	Patch
references-realtime-streams	Patch
references-telemetry	Patch
@trigger.dev/build	Patch
@trigger.dev/core	Patch
@trigger.dev/python	Patch
@trigger.dev/react-hooks	Patch
@trigger.dev/redis-worker	Patch
@trigger.dev/rsc	Patch
@trigger.dev/schema-to-json	Patch
@trigger.dev/sdk	Patch
@trigger.dev/database	Patch
@trigger.dev/otlp-importer	Patch
@internal/cache	Patch
@internal/clickhouse	Patch
@internal/llm-model-catalog	Patch
@internal/redis	Patch
@internal/replication	Patch
@internal/run-engine	Patch
@internal/schedule-engine	Patch
@internal/testcontainers	Patch
@internal/tracing	Patch
@internal/tsql	Patch
@internal/zod-worker	Patch
@internal/sdk-compat-tests	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

coderabbitai · 2026-02-23T12:14:55Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Adds end-to-end compute support: a new internal package @internal/compute (client, types, imageRef), supervisor compute workload manager and wiring (create/snapshot/restore), OTLP trace payload/dispatch, timer-wheel-based delayed snapshot orchestration and HTTP callback route, environment schema extensions, webapp compute template creation service with feature-flag and rollout logic, a DB migration adding WorkloadType and WorkerInstanceGroup.workloadType, propagation of dequeue/polling timing through the run queue, a CLI local-build --load behavior fix, and new tests and logging verbosity adjustments.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat(supervisor): compute workload manager' accurately and clearly describes the main change: adding a ComputeWorkloadManager to the supervisor component.
Description check	✅ Passed	The PR description provides comprehensive technical details about ComputeWorkloadManager, checkpoint/restore support, OTel integration, and template pre-warming, with clear sections outlining major changes across supervisor, webapp, compute package, database, and env/config.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/compute-workload-manager

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…nabled Remove the silent `localhost` fallback for the snapshot callback URL, which would be unreachable from external compute gateways. Add env validation and a runtime guard matching the existing metadata URL pattern.

…-manager

delay compute snapshot requests to avoid wasted work on short-lived waitpoints (e.g. triggerAndWait resolving in <5s). configurable via COMPUTE_SNAPSHOT_DELAY_MS (default 5s).

…-manager

…t for templates

…d headers

…-manager

…VM projects

…g deploy

…pTrace module

…ent helper

…ESTS=1

…n, OTLP endpoint, snapshot concurrency

…ISPATCH_LIMIT

…d version in compute package

…nal/compute, add zod pinning rule

…-manager

nicktrn · 2026-03-29T12:50:14Z

ready

nicktrn added 17 commits February 11, 2026 09:44

chore: merge main into feat/compute-workload-manager

ccc8fe2

fix(supervisor): strip image digest in ComputeWorkloadManager

3175a10

feat: make gateway fetch timeout configurable

1bccd1e

feat(supervisor): send machine cpu/memory in compute sandbox requests

ac3dadf

Passes machine preset cpu and memory as top-level fields on the CreateSandboxRequest so the compute stack can use them for admission control and resource allocation.

Merge branch 'main' into HEAD

7e251d4

Merge remote-tracking branch 'origin/main' into feat/compute-workload…

e4915c4

…-manager

Merge remote-tracking branch 'origin/main' into feat/compute-workload…

9466a47

…-manager

fix(cli): fix --load flag on local/self-hosted builds

c1511f9

feat(supervisor): add flag to enable compute snapshots

4332743

Gates snapshot/restore behaviour independently of compute mode. When disabled, VMs won't receive the metadata URL and suspend/restore are no-ops. Defaults to off so compute mode can be used without snapshots.

feat(supervisor): require metadata URL when compute snapshots enabled

5089bba

This comment was marked as resolved.

Sign in to view

nicktrn added 8 commits March 2, 2026 19:35

fix(supervisor): don't destroy compute instance after snapshot

e9b5fd3

Merge remote-tracking branch 'origin/main' into feat/compute-workload…

0531a23

…-manager

Merge remote-tracking branch 'origin/main' into feat/compute-workload…

9572c7d

…-manager

Merge remote-tracking branch 'origin/main' into feat/compute-workload…

5032b7f

…-manager

Merge remote-tracking branch 'origin/main' into feat/compute-workload…

f3e0cb8

…-manager

Merge remote-tracking branch 'origin/main' into feat/compute-workload…

0edc308

…-manager

feat(supervisor): add snapshot delay for compute path via timer wheel

63424fa

delay compute snapshot requests to avoid wasted work on short-lived waitpoints (e.g. triggerAndWait resolving in <5s). configurable via COMPUTE_SNAPSHOT_DELAY_MS (default 5s).

Merge remote-tracking branch 'origin/main' into feat/compute-workload…

80b62d4

…-manager

refactor: convert remaining compute types to zod schemas

641d6a3

This comment was marked as resolved.

Sign in to view

fix: bound trace context map, gate on compute mode, use machine prese…

c1021f2

…t for templates

This comment was marked as resolved.

Sign in to view

nicktrn added 2 commits March 27, 2026 16:51

fix: register trace context before restore/warm-start, sanitize logge…

1005428

…d headers

Merge remote-tracking branch 'origin/main' into feat/compute-workload…

5ffc7d4

…-manager

This comment was marked as resolved.

Sign in to view

fix: shadow mode for org-level compute access, require only for MICRO…

64729bb

…VM projects

This comment was marked as resolved.

Sign in to view

nicktrn added 13 commits March 27, 2026 22:47

fix: wrap writer.write in try/catch to handle client disconnect durin…

e9bcbe4

…g deploy

feat: add OtlpTraceService

061c2fb

refactor: move otlp trace tests to services/

8711f5b

refactor: remove env import from compute workload manager

9d72ae2

refactor: use OtlpTraceService in workload server

18eb7bb

refactor: wire up OtlpTraceService to workload server, delete old otl…

91f9fa3

…pTrace module

refactor: inline payload builder into trace service, extract tracepar…

36ecdb5

…ent helper

fix: skip k8s integration tests by default, require K8S_INTEGRATION_T…

30df9e2

…ESTS=1

fix: review fixes - COMPUTE checkpoint type, memory_gb standardizatio…

05a6721

…n, OTLP endpoint, snapshot concurrency

fix: make snapshot dispatch limit configurable via COMPUTE_SNAPSHOT_D…

cacee1e

…ISPATCH_LIMIT

refactor: extract ComputeSnapshotService from workload server, fix zo…

680f156

…d version in compute package

fix: remove unnecessary re-export, import schema directly from @inter…

5142954

…nal/compute, add zod pinning rule

Merge remote-tracking branch 'origin/main' into feat/compute-workload…

9925c72

…-manager

This comment was marked as resolved.

Sign in to view

nicktrn added 2 commits March 29, 2026 12:13

fix: use BigInt for OTLP nanosecond timestamps to avoid precision loss

48bbb87

docs: add server-changes for compute template pre-warming

5b188a5

nicktrn mentioned this pull request Mar 29, 2026

OTLP nanosecond timestamp overflow in webapp event repository #3292

Open

nicktrn added the ready label Mar 29, 2026

ericallam approved these changes Mar 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(supervisor): compute workload manager#3114

feat(supervisor): compute workload manager#3114
nicktrn wants to merge 70 commits intomainfrom
feat/compute-workload-manager

nicktrn commented Feb 23, 2026 •

edited

Loading

Uh oh!

changeset-bot bot commented Feb 23, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Feb 23, 2026 •

edited

Loading

Reviews paused

❌ Failed checks (1 warning)

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

nicktrn commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

nicktrn commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Uh oh!

changeset-bot bot commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

coderabbitai bot commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

nicktrn commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nicktrn commented Feb 23, 2026 •

edited

Loading

changeset-bot bot commented Feb 23, 2026 •

edited

Loading

coderabbitai bot commented Feb 23, 2026 •

edited

Loading