trebuchet/.claude/commands/debug.md at 4cbc4bc5e4b25e4e2d0cd29c631855e9d35d9ea5

Files

T

ezl-keygraph bc8fd203ed feat: add npx CLI with monorepo, CI/CD, and ephemeral worker architecture (#256 )

* feat: integrate npx CLI, CI/CD, and ephemeral worker architecture

Bring in changes from shannon-npx: npx-distributable CLI package (cli/),
semantic-release CI/CD workflows, ephemeral per-scan worker containers,
TOML config support, setup wizard, and workspace management.

Preserves all shannon-only changes: security hardening (localhost-bound
ports, MCP env allowlist, path traversal guard), updated benchmarks
(XBEN 19/31/35/44), README assets, and prompt injection disclaimer.

Applies security hardening to cli/infra/compose.yml as well.

* refactor: migrate to Turborepo + pnpm + Biome monorepo

Restructure into apps/worker, apps/cli, packages/mcp-server with
Turborepo task orchestration, pnpm workspaces, Biome linting/formatting,
and tsdown CLI bundling.

Key changes:
- src/ -> apps/worker/src/, cli/ -> apps/cli/, mcp-server/ -> packages/mcp-server/
- prompts/ and configs/ moved into apps/worker/
- npm replaced with pnpm, package-lock.json replaced with pnpm-lock.yaml
- Dockerfile updated for pnpm-based builds
- CLI logs command rewritten with chokidar for cross-platform reliability
- Router health checking added for auto-detected router mode
- Centralized path resolution via apps/worker/src/paths.ts

* fix: resolve all biome warnings and formatting issues

- Remove unnecessary non-null assertions where values are guaranteed
- Replace array index access with .at() for safer element retrieval
- Use local variables to avoid repeated process.env lookups
- Replace any types with unknown in functional utilities
- Use nullish coalescing for TOTP hash byte access
- Auto-format security patches to match biome config

* fix: pin pnpm to 10.12.1 in Dockerfile for catalog support

* fix: handle Esc cancellation in Bedrock setup flow

Replace p.group() with individual prompts and per-field cancel checks,
matching the pattern used by all other provider setup flows.

* feat: add optional model customization to Anthropic setup

* fix: resolve Docker bind mount permission errors on Linux

Use entrypoint-based UID remapping instead of --user flag so the
container's pentest user matches the host UID/GID, keeping bind-mounted
volumes writable. Git config moved to --system level to survive remapping.

* fix: show resumed workflow ID in splash screen URL

When resuming a workflow, the Temporal Web UI link pointed to the old
(terminated) workflow ID. Now extracts "New Workflow ID" from the resume
header in workflow.log, falling back to the original ID for fresh scans.

* style: fix biome formatting in docker.ts

* fix: align TypeScript config types with JSON Schema

- SuccessCondition.type: use schema values (url_contains,
  element_present, url_equals_exactly, text_contains) instead of
  stale values (url, cookie, element, redirect)
- Authentication.login_flow: mark optional to match schema which
  does not require it

* feat: mark GitHub release as latest during rollback

* fix: use native ARM64 runners for Docker multi-platform builds

Replace QEMU emulation with parallel native builds using a matrix
strategy (ubuntu-latest for amd64, ubuntu-24.04-arm for arm64).
Each platform pushes by digest, then a merge job creates the
multi-arch manifest list before signing with cosign.

* fix: resolve SessionMutex race condition with 3+ concurrent waiters

* fix: skip POSIX permission check on Windows

writeFileSync mode option is ignored on Windows, so config.toml
gets 0o666 and the guard rejects it.

* fix: resolve unsubstituted placeholders in report prompt

Remove unused {{GITHUB_URL}} placeholder and wire up {{AUTH_CONTEXT}}
with structured auth context (login type, username, URL, MFA status).

* fix: remove duplicate environment gate from merge-docker job

Move DOCKERHUB_USERNAME from vars to secrets so merge-docker can access
credentials without its own environment scope. This eliminates the
redundant double approval since build-docker already gates on
release-publish.

* fix: replace POSIX sleep binary with cross-platform async sleep

execFileSync('sleep') is unavailable on Windows. Use node:timers/promises
setTimeout instead, making ensureInfra async.

* fix: use session.json for workflow ID on resume instead of parsing workflow.log

On resume, workflow.log already exists with stale headers from the
previous run. The CLI poll found '====' immediately and extracted the
old workflow ID, producing a wrong Temporal Web UI URL.

Read the workflow ID from session.json instead — the worker writes
resume attempts there atomically. For fresh runs, poll until
originalWorkflowId appears. For resumes, poll until a new
resumeAttempts entry is appended.

* feat: add custom base URL support for Anthropic-compatible proxies

Support ANTHROPIC_BASE_URL + ANTHROPIC_AUTH_TOKEN to route SDK requests
through LiteLLM or any Anthropic-compatible proxy. Adds TUI wizard
option, TOML config mapping, credential validation, and preflight
endpoint reachability check via SDK query.

* fix: remove environment gates and add NPM_TOKEN to publish step

* feat: add beta release and rollback workflows with cosign signing

* fix: remove redundant checkout and pnpm steps from beta release workflow

* docs: normalize README commands to mode-neutral shorthand

Add a substitution note after Quick Start sections so all subsequent
examples use bare `shannon` instead of mixing `./shannon` and
`npx @keygraph/shannon`. Mode-specific commands (build, update,
uninstall) get inline annotations. Also fixes a broken command in the
Custom Base URL section.

* fix: remove redundant `update` command

Image is already auto-pulled by `ensureImage()` during `start` when the
pinned version tag is missing locally. Manual `update` was unnecessary.

* docs: add CLI package README stub

* docs: update README setup instructions for dual CLI modes

* docs: update announcement banner to npx availability

* feat: migrate from MCP tools to CLI based tools (#252)

* feat: migrate from MCP tools to CLI tools

* fix: restore browser action emoji formatters for CLI output

Adapt formatBrowserAction for playwright-cli commands, replacing the old
mcp__playwright__browser_* tool name matching removed during migration.

* fix: mount credential file to fixed container path for Vertex AI

GOOGLE_APPLICATION_CREDENTIALS was forwarded as-is to the container,
causing the relative host path to resolve against the repo mount
instead of the credentials mount. Now both local and npx modes mount
the resolved file to /app/credentials/google-sa-key.json and rewrite
the env var to match.

* feat: add git awareness and optional description field to config

* fix: drop redundant --ipc host flag from worker container

* fix: align announcement banner URL with main branch

* feat: add target URL reachability preflight check (#254)

* Moving asset benchmark graph image to this folder

* Move benchmark results to benchmark repo

Windows Defender flags exploit code in the pentest reports as false positives, forcing every Windows user to add a Defender exclusion just to clone Shannon.

* Updated README

* fix: case-insensitive grep for semantic-release version probe

* fix: harden supply chain security (#255)

* fix: patch smol-toml and tsdown vulnerabilities

Update smol-toml 1.6.0→1.6.1 (DoS via recursive comment parsing) and
tsdown 0.21.2→0.21.5 (picomatch ReDoS + method injection).

* fix: pin all unpinned dependency versions in Dockerfile

Pins subfinder v2.13.0, WhatWeb v0.6.3 (switched from git clone to
release tarball), schemathesis 4.13.0, addressable 2.8.9,
claude-code 2.1.84, and playwright-cli 0.1.1 for reproducible builds.

* fix: pin GitHub Actions to commit SHAs for supply chain security

* fix: pin GitHub Actions to commit SHAs in beta and rollback workflows

2026-03-27 02:34:29 +05:30

6.1 KiB

Raw Blame History

description

description
Systematically debug errors using context analysis and structured recovery

You are debugging an issue. Follow this structured approach to avoid spinning in circles.

Step 1: Capture Error Context

Read the full error message and stack trace
Identify the layer where the error originated:
- CLI/Args - Input validation, path resolution
- Config Parsing - YAML parsing, JSON Schema validation (src/config-parser.ts)
- Session Management - Agent definitions (src/session-manager.ts), mutex (src/utils/concurrency.ts)
- DI Container - Container initialization/lookup (src/services/container.ts)
- Services - AgentExecutionService, ConfigLoaderService, ExploitationCheckerService, error-handling (src/services/)
- Audit System - Logging, metrics tracking, atomic writes (src/audit/)
- Claude SDK - Agent execution, MCP servers, turn handling (src/ai/claude-executor.ts)
- Git Operations - Checkpoints, rollback, commit (src/services/git-manager.ts)
- Validation - Deliverable checks, queue validation (src/services/queue-validation.ts)

Step 2: Check Relevant Logs

Session audit logs:

# Find most recent session
ls -lt workspaces/ | head -5

# Check session metrics and errors
cat workspaces/<session>/session.json | jq '.errors, .agentMetrics'

# Check agent execution logs
ls -lt workspaces/<session>/agents/
cat workspaces/<session>/agents/<latest>.log

Step 3: Trace the Call Path

For Shannon, trace through these layers:

Worker + Client → src/temporal/worker.ts - Combined worker + workflow submission
Workflow → src/temporal/workflows.ts - Pipeline orchestration
Activities → src/temporal/activities.ts - Thin wrappers: heartbeat, error classification
Container → src/services/container.ts - Per-workflow DI
Services → src/services/agent-execution.ts - Agent lifecycle
Config → src/config-parser.ts via src/services/config-loader.ts
Prompts → src/services/prompt-manager.ts
Audit → src/audit/audit-session.ts - Logging facade, metrics tracking
Executor → src/ai/claude-executor.ts - SDK calls, MCP setup, retry logic
Validation → src/services/queue-validation.ts - Deliverable checks

Step 4: Identify Root Cause

Common Shannon-specific issues:

Symptom	Likely Cause	Fix
Agent hangs indefinitely	MCP server crashed, Playwright timeout	Check Playwright logs in `/tmp/playwright-*`
"Validation failed: Missing deliverable"	Agent didn't create expected file	Check `deliverables/` dir, review prompt
Git checkpoint fails	Uncommitted changes, git lock	Run `git status`, remove `.git/index.lock`
"Session limit reached"	Claude API billing limit	Not retryable - check API usage
Parallel agents all fail	Shared resource contention	Check mutex usage, stagger startup timing
Cost/timing not tracked	Metrics not reloaded before update	Add `metricsTracker.reload()` before updates
session.json corrupted	Partial write during crash	Delete and restart, or restore from backup
YAML config rejected	Invalid schema or unsafe content	Run through AJV validator manually
Prompt variable not replaced	Missing `{{VARIABLE}}` in context	Check `src/services/prompt-manager.ts` interpolation
Service returns Err result	Check `ErrorCode` in Result	Trace through `classifyErrorForTemporal()` in `src/services/error-handling.ts`
Container not found	`getOrCreateContainer()` not called	Check activity setup code in `src/temporal/activities.ts`
ActivityLogger undefined	`createActivityLogger()` not called	Must be called at top of each activity function

MCP Server Issues:

# Check if Playwright browsers are installed
npx playwright install chromium

# Check MCP server startup (look for connection errors)
grep -i "mcp\|playwright" workspaces/<session>/agents/*.log

Git State Issues:

# Check for uncommitted changes
git status

# Check for git locks
ls -la .git/*.lock

# View recent git operations from Shannon
git reflog | head -10

Step 5: Apply Fix with Retry Limit

CRITICAL: Track consecutive failed attempts
After 3 consecutive failures on the same issue, STOP and:
- Summarize what was tried
- Explain what's blocking progress
- Ask the user for guidance or additional context
After a successful fix, reset the failure counter

Step 6: Validate the Fix

For code changes:

# Compile TypeScript
npx tsc --noEmit

# Quick validation run
shannon <URL> <REPO> --pipeline-testing

For audit/session issues:

Verify session.json is valid JSON after fix
Check that atomic writes complete without errors
Confirm mutex release in finally blocks

For agent issues:

Verify deliverable files are created in correct location
Check that validation functions return expected results
Confirm retry logic triggers on appropriate errors

Anti-Patterns to Avoid

Don't delete session.json without checking if session is active
Don't modify git state while an agent is running
Don't retry billing/quota errors (they're not retryable)
Don't ignore PentestError type - it indicates the error category
Don't make random changes hoping something works
Don't fix symptoms without understanding root cause
Don't bypass mutex protection for "quick fixes"

Quick Reference: Error Types

ErrorCode enum in src/types/errors.ts provides finer-grained classification used by classifyErrorForTemporal() in src/services/error-handling.ts.

PentestError Type	Meaning	Retryable?
`config`	Configuration file issues	No
`network`	Connection/timeout issues	Yes
`tool`	External tool (nmap, etc.) failed	Yes
`prompt`	Claude SDK/API issues	Sometimes
`filesystem`	File read/write errors	Sometimes
`validation`	Deliverable validation failed	Yes (via retry)
`billing`	API quota/billing limit	No
`unknown`	Unexpected error	Depends

Now analyze the error and begin debugging systematically.

6.1 KiB Raw Blame History