Host RPC timeout for environmentExecute ignores per-call timeoutMs (caps at 30s) #8

Closed
opened 2026-04-28 12:30:28 +00:00 by cpfarhood · 0 comments
cpfarhood commented 2026-04-28 12:30:28 +00:00 (Migrated from github.com)

Summary

executePluginEnvironmentCommand in server/src/services/plugin-environment-driver.ts calls the worker manager without forwarding params.timeoutMs, so the host RPC timeout always falls back to DEFAULT_RPC_TIMEOUT_MS (30 s). Any environment-driver execute call that legitimately runs longer than 30 s — e.g. a claude inference exec via the K8s or e2b sandbox provider — is killed by the host even though the plugin's worker eventually returns the correct result.

The worker manager already supports a per-call timeout (callInternal at plugin-worker-manager.ts:1004-1044, with Math.min(timeoutMs ?? rpcTimeoutMs, MAX_RPC_TIMEOUT_MS)), so this is purely a missing forward at the call site.

Symptom

Plugin-worker logs show:

RPC call "environmentExecute" timed out after 30000ms
   at Timeout.<anonymous> (server/src/services/plugin-worker-manager.ts:1039:11)
WARN: received response for unknown request id  { id: 9 }

i.e. the worker DID finish and respond, but the host had already abandoned the request 30 s in.

Affected

  • @farhoodlabs/paperclip-plugin-k8s (and paperclip-adapter-claude-k8s) running real claude inference inside the lease pod
  • @paperclipai/plugin-e2b would hit this for any command that runs >30 s in the sandbox (same pattern: it declares timeoutMs in config, defaults 300 000, threads it through params.timeoutMs ?? config.timeoutMs to the SDK call — but the host RPC timeout never sees that value)
  • Likely any future environment-driver plugin running real LLM/agent inference

Reproduction

  1. Configure a K8s sandbox provider with default execTimeoutMs: 300000.
  2. Trigger a heartbeat run that produces a real claude inference call (~25–35 s wall time).
  3. Plugin worker logs: RPC call "environmentExecute" timed out after 30000ms even though the lease pod's claude process exited cleanly with stdout.

Minimal fix

server/src/services/plugin-environment-driver.ts:205 (the line that compiles to the call below):

// before
return await input.workerManager.call(plugin.id, "environmentExecute", input.params);

// after — forward the per-call timeout (capped to MAX_RPC_TIMEOUT_MS host-side)
return await input.workerManager.call(
  plugin.id,
  "environmentExecute",
  input.params,
  input.params.timeoutMs,
);

PluginEnvironmentExecuteParams.timeoutMs already exists in the protocol (packages/plugins/sdk/src/protocol.ts), and MAX_RPC_TIMEOUT_MS = 5 * 60 * 1000 in plugin-worker-manager.ts:61 will continue to enforce a sane upper bound. Plugins that don't pass a timeoutMs keep the existing 30 s default.

Notes

  • The same lift is probably worth applying to other long-running RPCs on executePluginEnvironment*Acquire, Resume, RealizeWorkspace — for consistency, since they can also legitimately run > 30 s when an init container has to chown a large PVC or the sandbox provisions for the first time. Up to you whether to do them in the same PR.
  • This is independent of the tarExcludeFlags discussion in sandbox-managed-runtime.ts; that flow is correct as designed.
## Summary `executePluginEnvironmentCommand` in `server/src/services/plugin-environment-driver.ts` calls the worker manager without forwarding `params.timeoutMs`, so the host RPC timeout always falls back to `DEFAULT_RPC_TIMEOUT_MS` (30 s). Any environment-driver `execute` call that legitimately runs longer than 30 s — e.g. a `claude` inference exec via the K8s or e2b sandbox provider — is killed by the host even though the plugin's worker eventually returns the correct result. The worker manager already supports a per-call timeout (`callInternal` at `plugin-worker-manager.ts:1004-1044`, with `Math.min(timeoutMs ?? rpcTimeoutMs, MAX_RPC_TIMEOUT_MS)`), so this is purely a missing forward at the call site. ## Symptom Plugin-worker logs show: ``` RPC call "environmentExecute" timed out after 30000ms at Timeout.<anonymous> (server/src/services/plugin-worker-manager.ts:1039:11) WARN: received response for unknown request id { id: 9 } ``` i.e. the worker DID finish and respond, but the host had already abandoned the request 30 s in. ## Affected - `@farhoodlabs/paperclip-plugin-k8s` (and `paperclip-adapter-claude-k8s`) running real `claude` inference inside the lease pod - `@paperclipai/plugin-e2b` would hit this for any command that runs >30 s in the sandbox (same pattern: it declares `timeoutMs` in config, defaults 300 000, threads it through `params.timeoutMs ?? config.timeoutMs` to the SDK call — but the host RPC timeout never sees that value) - Likely any future environment-driver plugin running real LLM/agent inference ## Reproduction 1. Configure a K8s sandbox provider with default `execTimeoutMs: 300000`. 2. Trigger a heartbeat run that produces a real `claude` inference call (~25–35 s wall time). 3. Plugin worker logs: `RPC call "environmentExecute" timed out after 30000ms` even though the lease pod's `claude` process exited cleanly with stdout. ## Minimal fix `server/src/services/plugin-environment-driver.ts:205` (the line that compiles to the call below): ```ts // before return await input.workerManager.call(plugin.id, "environmentExecute", input.params); // after — forward the per-call timeout (capped to MAX_RPC_TIMEOUT_MS host-side) return await input.workerManager.call( plugin.id, "environmentExecute", input.params, input.params.timeoutMs, ); ``` `PluginEnvironmentExecuteParams.timeoutMs` already exists in the protocol (`packages/plugins/sdk/src/protocol.ts`), and `MAX_RPC_TIMEOUT_MS = 5 * 60 * 1000` in `plugin-worker-manager.ts:61` will continue to enforce a sane upper bound. Plugins that don't pass a `timeoutMs` keep the existing 30 s default. ## Notes - The same lift is probably worth applying to other long-running RPCs on `executePluginEnvironment*` — `Acquire`, `Resume`, `RealizeWorkspace` — for consistency, since they can also legitimately run > 30 s when an init container has to chown a large PVC or the sandbox provisions for the first time. Up to you whether to do them in the same PR. - This is independent of the `tarExcludeFlags` discussion in `sandbox-managed-runtime.ts`; that flow is correct as designed.
Sign in to join this conversation.