PAL MCP: Many Workflows. One Context.
Formerly known as Zen MCP [PAL in action](https://github.com/user-attachments/assets/0d26061e-5f21-4ab1-b7d0-f883ddc2c3da) 👉 **[Watch more examples](#-watch-tools-in-action)** ### Your CLI + Multiple Models = Your AI Dev Team **Use the 🤖 CLI you love:** [Claude Code](https://www.anthropic.com/claude-code) · [Gemini CLI](https://github.com/google-gemini/gemini-cli) · [Codex CLI](https://github.com/openai/codex) · [Qwen Code CLI](https://qwenlm.github.io/qwen-code-docs/) · [Cursor](https://cursor.com) · _and more_ **With multiple models within a single prompt:** Gemini · OpenAI · Anthropic · Grok · Azure · Ollama · OpenRouter · DIAL · On-Device Model
🆕 Now with CLI-to-CLI Bridge
The new clink (CLI + Link) tool connects external AI CLIs directly into your workflow:
- Connect external CLIs like Gemini CLI, Codex CLI, and Claude Code directly into your workflow
- CLI Subagents - Launch isolated CLI instances from within your current CLI! Claude Code can spawn Codex subagents, Codex can spawn Gemini CLI subagents, etc. Offload heavy tasks (code reviews, bug hunting) to fresh contexts while your main session's context window remains unpolluted. Each subagent returns only final results.
- Context Isolation - Run separate investigations without polluting your primary workspace
- Role Specialization - Spawn
planner,codereviewer, or custom role agents with specialized system prompts - Full CLI Capabilities - Web search, file inspection, MCP tool access, latest documentation lookups
- Seamless Continuity - Sub-CLIs participate as first-class members with full conversation context between tools
# Codex spawns Codex subagent for isolated code review in fresh context
clink with codex codereviewer to audit auth module for security issues
# Subagent reviews in isolation, returns final report without cluttering your context as codex reads each file and walks the directory structure
# Consensus from different AI models → Implementation handoff with full context preservation between tools
Use consensus with gpt-5 and gemini-pro to decide: dark mode or offline support next
Continue with clink gemini - implement the recommended feature
# Gemini receives full debate context and starts coding immediately
Why PAL MCP?
Why rely on one AI model when you can orchestrate them all?
A Model Context Protocol server that supercharges tools like Claude Code, Codex CLI, and IDE clients such as Cursor or the Claude Dev VS Code extension. PAL MCP connects your favorite AI tool to multiple AI models for enhanced code analysis, problem-solving, and collaborative development.
True AI Collaboration with Conversation Continuity
PAL supports conversation threading so your CLI can discuss ideas with multiple AI models, exchange reasoning, get second opinions, and even run collaborative debates between models to help you reach deeper insights and better solutions.
Your CLI always stays in control but gets perspectives from the best AI for each subtask. Context carries forward seamlessly across tools and models, enabling complex workflows like: code reviews with multiple models → automated planning → implementation → pre-commit validation.
You're in control. Your CLI of choice orchestrates the AI team, but you decide the workflow. Craft powerful prompts that bring in Gemini Pro, GPT 5, Flash, or local offline models exactly when needed.
Reasons to Use PAL MCP
A typical workflow with Claude Code as an example: 1. **Multi-Model Orchestration** - Claude coordinates with Gemini Pro, O3, GPT-5, and 50+ other models to get the best analysis for each task 2. **Context Revival Magic** - Even after Claude's context resets, continue conversations seamlessly by having other models "remind" Claude of the discussion 3. **Guided Workflows** - Enforces systematic investigation phases that prevent rushed analysis and ensure thorough code examination 4. **Extended Context Windows** - Break Claude's limits by delegating to Gemini (1M tokens) or O3 (200K tokens) for massive codebases 5. **True Conversation Continuity** - Full context flows across tools and models - Gemini remembers what O3 said 10 steps ago 6. **Model-Specific Strengths** - Extended thinking with Gemini Pro, blazing speed with Flash, strong reasoning with O3, privacy with local Ollama 7. **Professional Code Reviews** - Multi-pass analysis with severity levels, actionable feedback, and consensus from multiple AI experts 8. **Smart Debugging Assistant** - Systematic root cause analysis with hypothesis tracking and confidence levels 9. **Automatic Model Selection** - Claude intelligently picks the right model for each subtask (or you can specify) 10. **Vision Capabilities** - Analyze screenshots, diagrams, and visual content with vision-enabled models 11. **Local Model Support** - Run Llama, Mistral, or other models locally for complete privacy and zero API costs 12. **Bypass MCP Token Limits** - Automatically works around MCP's 25K limit for large prompts and responses **The Killer Feature:** When Claude's context resets, just ask to "continue with O3" - the other model's response magically revives Claude's understanding without re-ingesting documents! #### Example: Multi-Model Code Review Workflow 1. `Perform a codereview using gemini pro and o3 and use planner to generate a detailed plan, implement the fixes and do a final precommit check by continuing from the previous codereview` 2. This triggers a [`codereview`](docs/tools/codereview.md) workflow where Claude walks the code, looking for all kinds of issues 3. After multiple passes, collects relevant code and makes note of issues along the way 4. Maintains a `confidence` level between `exploring`, `low`, `medium`, `high` and `certain` to track how confidently it's been able to find and identify issues 5. Generates a detailed list of critical -> low issues 6. Shares the relevant files, findings, etc with **Gemini Pro** to perform a deep dive for a second [`codereview`](docs/tools/codereview.md) 7. Comes back with a response and next does the same with o3, adding to the prompt if a new discovery comes to light 8. When done, Claude takes in all the feedback and combines a single list of all critical -> low issues, including good patterns in your code. The final list includes new findings or revisions in case Claude misunderstood or missed something crucial and one of the other models pointed this out 9. It then uses the [`planner`](docs/tools/planner.md) workflow to break the work down into simpler steps if a major refactor is required 10. Claude then performs the actual work of fixing highlighted issues 11. When done, Claude returns to Gemini Pro for a [`precommit`](docs/tools/precommit.md) review All within a single conversation thread! Gemini Pro in step 11 _knows_ what was recommended by O3 in step 7! Taking that context and review into consideration to aid with its final pre-commit review. **Think of it as Claude Code _for_ Claude Code.** This MCP isn't magic. It's just **super-glue**. > **Remember:** Claude stays in full control — but **YOU** call the shots. > PAL is designed to have Claude engage other models only when needed — and to follow through with meaningful back-and-forth. > **You're** the one who crafts the powerful prompt that makes Claude bring in Gemini, Flash, O3 — or fly solo. > You're the guide. The prompter. The puppeteer. > #### You are the AI - **Actually Intelligent**.Recommended AI Stack
For Claude Code Users
For best results when using [Claude Code](https://claude.ai/code): - **Sonnet 4.5** - All agentic work and orchestration - **Gemini 3.0 Pro** OR **GPT-5.2 / Pro** - Deep thinking, additional code reviews, debugging and validations, pre-commit analysisFor Codex Users
For best results when using [Codex CLI](https://developers.openai.com/codex/cli): - **GPT-5.2 Codex Medium** - All agentic work and orchestration - **Gemini 3.0 Pro** OR **GPT-5.2-Pro** - Deep thinking, additional code reviews, debugging and validations, pre-commit analysisQuick Start (5 minutes)
Prerequisites: Python 3.10+, Git, uv installed
1. Get API Keys (choose one or more): - OpenRouter - Access multiple models with one API - Gemini - Google's latest models - OpenAI - O3, GPT-5 series - Azure OpenAI - Enterprise deployments of GPT-4o, GPT-4.1, GPT-5 family - X.AI - Grok models - DIAL - Vendor-agnostic model access - Ollama - Local models (free)
2. Install (choose one):
Option A: Clone and Automatic Setup (recommended)
git clone https://github.com/BeehiveInnovations/pal-mcp-server.git
cd pal-mcp-server
# Handles everything: setup, config, API keys from system environment.
# Auto-configures Claude Desktop, Claude Code, Gemini CLI, Codex CLI, Qwen CLI
# Enable / disable additional settings in .env
./run-server.sh
Option B: Instant Setup with uvx
// Add to ~/.claude/settings.json or .mcp.json
// Don't forget to add your API keys under env
{
"mcpServers": {
"pal": {
"command": "bash",
"args": ["-c", "for p in $(which uvx 2>/dev/null) $HOME/.local/bin/uvx /opt/homebrew/bin/uvx /usr/local/bin/uvx uvx; do [ -x \"$p\" ] && exec \"$p\" --from git+https://github.com/BeehiveInnovations/pal-mcp-server.git pal-mcp-server; done; echo 'uvx not found' >&2; exit 1"],
"env": {
"PATH": "/usr/local/bin:/usr/bin:/bin:/opt/homebrew/bin:~/.local/bin",
"GEMINI_API_KEY": "your-key-here",
"DISABLED_TOOLS": "analyze,refactor,testgen,secaudit,docgen,tracer",
"DEFAULT_MODEL": "auto"
}
}
}
}
3. Start Using!
"Use pal to analyze this code for security issues with gemini pro"
"Debug this error with o3 and then get flash to suggest optimizations"
"Plan the migration strategy with pal, get consensus from multiple models"
"clink with cli_name=\"gemini\" role=\"planner\" to draft a phased rollout plan"
👉 Complete Setup Guide with detailed installation, configuration for Gemini / Codex / Qwen, and troubleshooting 👉 Cursor & VS Code Setup for IDE integration instructions 📺 Watch tools in action to see real-world examples
Provider Configuration
PAL activates any provider that has credentials in your .env. See .env.example for deeper customization.
Core Tools
Note: Each tool comes with its own multi-step workflow, parameters, and descriptions that consume valuable context window space even when not in use. To optimize performance, some tools are disabled by default. See Tool Configuration below to enable them.
Collaboration & Planning (Enabled by default)
- clink - Bridge requests to external AI CLIs (Gemini planner, codereviewer, etc.)
- chat - Brainstorm ideas, get second opinions, validate approaches. With capable models (GPT-5.2 Pro, Gemini 3.0 Pro), generates complete code / implementation
- thinkdeep - Extended reasoning, edge case analysis, alternative perspectives
- planner - Break down complex projects into structured, actionable plans
- consensus - Get expert opinions from multiple AI models with stance steering
Code Analysis & Quality
- debug - Systematic investigation and root cause analysis
- precommit - Validate changes before committing, prevent regressions
- codereview - Professional reviews with severity levels and actionable feedback
- analyze (disabled by default - enable) - Understand architecture, patterns, dependencies across entire codebases
Development Tools (Disabled by default - enable)
- refactor - Intelligent code refactoring with decomposition focus
- testgen - Comprehensive test generation with edge cases
- secaudit - Security audits with OWASP Top 10 analysis
- docgen - Generate documentation with complexity analysis
Utilities
- apilookup - Forces current-year API/SDK documentation lookups in a sub-process (saves tokens within the current context window), prevents outdated training data responses
- challenge - Prevent "You're absolutely right!" responses with critical analysis
- tracer (disabled by default - enable) - Static analysis prompts for call-flow mapping
👉 Tool Configuration
### Default Configuration To optimize context window usage, only essential tools are enabled by default: **Enabled by default:** - `chat`, `thinkdeep`, `planner`, `consensus` - Core collaboration tools - `codereview`, `precommit`, `debug` - Essential code quality tools - `apilookup` - Rapid API/SDK information lookup - `challenge` - Critical thinking utility **Disabled by default:** - `analyze`, `refactor`, `testgen`, `secaudit`, `docgen`, `tracer` ### Enabling Additional Tools To enable additional tools, remove them from the `DISABLED_TOOLS` list: **Option 1: Edit your .env file**# Default configuration (from .env.example)
DISABLED_TOOLS=analyze,refactor,testgen,secaudit,docgen,tracer
# To enable specific tools, remove them from the list
# Example: Enable analyze tool
DISABLED_TOOLS=refactor,testgen,secaudit,docgen,tracer
# To enable ALL tools
DISABLED_TOOLS=
**Option 2: Configure in MCP settings**
// In ~/.claude/settings.json or .mcp.json
{
"mcpServers": {
"pal": {
"env": {
// Tool configuration
"DISABLED_TOOLS": "refactor,testgen,secaudit,docgen,tracer",
"DEFAULT_MODEL": "pro",
"DEFAULT_THINKING_MODE_THINKDEEP": "high",
// API configuration
"GEMINI_API_KEY": "your-gemini-key",
"OPENAI_API_KEY": "your-openai-key",
"OPENROUTER_API_KEY": "your-openrouter-key",
// Logging and performance
"LOG_LEVEL": "INFO",
"CONVERSATION_TIMEOUT_HOURS": "6",
"MAX_CONVERSATION_TURNS": "50"
}
}
}
}
**Option 3: Enable all tools**
// Remove or empty the DISABLED_TOOLS to enable everything
{
"mcpServers": {
"pal": {
"env": {
"DISABLED_TOOLS": ""
}
}
}
}
**Note:**
- Essential tools (`version`, `listmodels`) cannot be disabled
- After changing tool configuration, restart your Claude session for changes to take effect
- Each tool adds to context window usage, so only enable what you need
📺 Watch Tools In Action
Chat Tool - Collaborative decision making and multi-turn conversations
**Picking Redis vs Memcached:** [Chat Redis or Memcached_web.webm](https://github.com/user-attachments/assets/41076cfe-dd49-4dfc-82f5-d7461b34705d) **Multi-turn conversation with continuation:** [Chat With Gemini_web.webm](https://github.com/user-attachments/assets/37bd57ca-e8a6-42f7-b5fb-11de271e95db)Consensus Tool - Multi-model debate and decision making
**Multi-model consensus debate:** [PAL Consensus Debate](https://github.com/user-attachments/assets/76a23dd5-887a-4382-9cf0-642f5cf6219e)PreCommit Tool - Comprehensive change validation
**Pre-commit validation workflow:**API Lookup Tool - Current vs outdated API documentation
**Without PAL - outdated APIs:** [API without PAL](https://github.com/user-attachments/assets/01a79dc9-ad16-4264-9ce1-76a56c3580ee) **With PAL - current APIs:** [API with PAL](https://github.com/user-attachments/assets/5c847326-4b66-41f7-8f30-f380453dce22)Challenge Tool - Critical thinking vs reflexive agreement
**Without PAL:**  **With PAL:** Key Features
AI Orchestration - Auto model selection - Claude picks the right AI for each task - Multi-model workflows - Chain different models in single conversations - Conversation continuity - Context preserved across tools and models - Context revival - Continue conversations even after context resets
Model Support - Multiple providers - Gemini, OpenAI, Azure, X.AI, OpenRouter, DIAL, Ollama - Latest models - GPT-5, Gemini 3.0 Pro, O3, Grok-4, local Llama - Thinking modes - Control reasoning depth vs cost - Vision support - Analyze images, diagrams, screenshots
Developer Experience - Guided workflows - Systematic investigation prevents rushed analysis - Smart file handling - Auto-expand directories, manage token limits - Web search integration - Access current documentation and best practices - Large prompt support - Bypass MCP's 25K token limit
Example Workflows
Multi-model Code Review:
"Perform a codereview using gemini pro and o3, then use planner to create a fix strategy"
→ Claude reviews code systematically → Consults Gemini Pro → Gets O3's perspective → Creates unified action plan
Collaborative Debugging:
"Debug this race condition with max thinking mode, then validate the fix with precommit"
→ Deep investigation → Expert analysis → Solution implementation → Pre-commit validation
Architecture Planning:
"Plan our microservices migration, get consensus from pro and o3 on the approach"
→ Structured planning → Multiple expert opinions → Consensus building → Implementation roadmap
👉 Advanced Usage Guide for complex workflows, model configuration, and power-user features
Quick Links
📖 Documentation - Docs Overview - High-level map of major guides - Getting Started - Complete setup guide - Tools Reference - All tools with examples - Advanced Usage - Power user features - Configuration - Environment variables, restrictions - Adding Providers - Provider-specific setup (OpenAI, Azure, custom gateways) - Model Ranking Guide - How intelligence scores drive auto-mode suggestions
🔧 Setup & Support - WSL Setup - Windows users - Troubleshooting - Common issues - Contributing - Code standards, PR process
License
Apache 2.0 License - see LICENSE file for details.
Acknowledgments
Built with the power of Multi-Model AI collaboration 🤝 - Actual Intelligence by real Humans - MCP (Model Context Protocol) - Codex CLI - Claude Code - Gemini - OpenAI - Azure OpenAI
Star History
Parameters
prompt
Your question or idea for collaborative thinking to be sent to the external model. Provide detailed context, including your goal, what you've tried, and any specific challenges. WARNING: Large inline code must NOT be shared in prompt. Provide full-path to files on disk as separate parameter.
required
absolute_file_paths
Full, absolute file paths to relevant code in order to share with external model
images
Image paths (absolute) or base64 strings for optional visual context.
working_directory_absolute_path
Absolute path to an existing directory where generated code artifacts can be saved.
required
model
Currently in auto model selection mode. CRITICAL: When the user names a model, you MUST use that exact name unless the server rejects it. If no model is provided, you may use the `listmodels` tool to review options and select an appropriate match. Top models: gpt-5.2 (score 100, 400K ctx, thinking, code-gen); gpt-5.1-codex (score 100, 400K ctx, thinking, code-gen); gemini-2.5-pro (score 100, 1.0M ctx, thinking, code-gen); gemini-3-pro-preview (score 100, 1.0M ctx, thinking, code-gen); gpt-5.2-pro (score 100, 400K ctx, thinking, code-gen); +26 more via `listmodels`.
required
temperature
0 = deterministic · 1 = creative.
thinking_mode
Reasoning depth: minimal, low, medium, high, or max.
continuation_id
Unique thread continuation ID for multi-turn conversations. Works across different tools. ALWAYS reuse the last continuation_id you were given—this preserves full conversation context, files, and findings so the agent can resume seamlessly.
Parameters
prompt
User request forwarded to the CLI (conversation context is pre-applied).
required
cli_name
Configured CLI client name (from conf/cli_clients). Available: claude, codex, gemini
required
role
Optional role preset defined for the selected CLI (defaults to 'default'). Roles per CLI: claude: codereviewer, default, planner; codex: codereviewer, default, planner; gemini: codereviewer, default, planner
absolute_file_paths
Full paths to relevant code
images
Optional absolute image paths or base64 blobs for visual context.
continuation_id
Unique thread continuation ID for multi-turn conversations. Works across different tools. ALWAYS reuse the last continuation_id you were given—this preserves full conversation context, files, and findings so the agent can resume seamlessly.
Parameters
step
Current work step content and findings from your overall work
required
step_number
Current step number in work sequence (starts at 1)
required
total_steps
Estimated total steps needed to complete work
required
next_step_required
Whether another work step is needed. When false, aim to reduce total_steps to match step_number to avoid mismatch.
required
findings
Important findings, evidence and insights discovered in this step
required
files_checked
List of files examined during this work step
relevant_files
Files identified as relevant to issue/goal (FULL absolute paths to real files/folders - DO NOT SHORTEN)
relevant_context
Methods/functions identified as involved in the issue
issues_found
Issues identified with severity levels during work
confidence
Confidence level: exploring (just starting), low (early investigation), medium (some evidence), high (strong evidence), very_high (comprehensive understanding), almost_certain (near complete confidence), certain (100% confidence locally - no external validation needed)
hypothesis
Current theory about issue/goal based on work
use_assistant_model
Use assistant model for expert analysis after workflow steps. False skips expert analysis, relies solely on your personal investigation. Defaults to True for comprehensive validation.
temperature
0 = deterministic · 1 = creative.
thinking_mode
Reasoning depth: minimal, low, medium, high, or max.
continuation_id
Unique thread continuation ID for multi-turn conversations. Works across different tools. ALWAYS reuse the last continuation_id you were given—this preserves full conversation context, files, and findings so the agent can resume seamlessly.
images
Optional absolute image paths or base64 blobs for visual context.
model
Currently in auto model selection mode. CRITICAL: When the user names a model, you MUST use that exact name unless the server rejects it. If no model is provided, you may use the `listmodels` tool to review options and select an appropriate match. Top models: gpt-5.2 (score 100, 400K ctx, thinking, code-gen); gpt-5.1-codex (score 100, 400K ctx, thinking, code-gen); gemini-2.5-pro (score 100, 1.0M ctx, thinking, code-gen); gemini-3-pro-preview (score 100, 1.0M ctx, thinking, code-gen); gpt-5.2-pro (score 100, 400K ctx, thinking, code-gen); +26 more via `listmodels`.
required
problem_context
Additional context about problem/goal. Be expressive.
focus_areas
Focus aspects (architecture, performance, security, etc.)
Parameters
step
Planning content for this step. Step 1: describe the task, problem and scope. Later steps: capture updates, revisions, branches, or open questions that shape the plan.
required
step_number
Current step number in work sequence (starts at 1)
required
total_steps
Estimated total steps needed to complete work
required
next_step_required
Whether another work step is needed. When false, aim to reduce total_steps to match step_number to avoid mismatch.
required
use_assistant_model
Use assistant model for expert analysis after workflow steps. False skips expert analysis, relies solely on your personal investigation. Defaults to True for comprehensive validation.
continuation_id
Unique thread continuation ID for multi-turn conversations. Works across different tools. ALWAYS reuse the last continuation_id you were given—this preserves full conversation context, files, and findings so the agent can resume seamlessly.
model
Currently in auto model selection mode. CRITICAL: When the user names a model, you MUST use that exact name unless the server rejects it. If no model is provided, you may use the `listmodels` tool to review options and select an appropriate match. Top models: gpt-5.2 (score 100, 400K ctx, thinking, code-gen); gpt-5.1-codex (score 100, 400K ctx, thinking, code-gen); gemini-2.5-pro (score 100, 1.0M ctx, thinking, code-gen); gemini-3-pro-preview (score 100, 1.0M ctx, thinking, code-gen); gpt-5.2-pro (score 100, 400K ctx, thinking, code-gen); +26 more via `listmodels`.
required
is_step_revision
Set true when you are replacing a previously recorded step.
revises_step_number
Step number being replaced when revising.
is_branch_point
True when this step creates a new branch to explore an alternative path.
branch_from_step
If branching, the step number that this branch starts from.
branch_id
Name for this branch (e.g. 'approach-A', 'migration-path').
more_steps_needed
True when you now expect to add additional steps beyond the prior estimate.
Parameters
step
Consensus prompt. Step 1: write the exact proposal/question every model will see (use 'Evaluate…', not meta commentary). Steps 2+: capture internal notes about the latest model response—these notes are NOT sent to other models.
required
step_number
Current step index (starts at 1). Step 1 is your analysis; steps 2+ handle each model response.
required
total_steps
Total steps = number of models consulted plus the final synthesis step.
required
next_step_required
True if more model consultations remain; set false when ready to synthesize.
required
findings
Step 1: your independent analysis for later synthesis (not shared with other models). Steps 2+: summarize the newest model response.
required
relevant_files
Optional supporting files that help the consensus analysis. Must be absolute full, non-abbreviated paths.
use_assistant_model
Use assistant model for expert analysis after workflow steps. False skips expert analysis, relies solely on your personal investigation. Defaults to True for comprehensive validation.
continuation_id
Unique thread continuation ID for multi-turn conversations. Works across different tools. ALWAYS reuse the last continuation_id you were given—this preserves full conversation context, files, and findings so the agent can resume seamlessly.
images
Optional absolute image paths or base64 references that add helpful visual context.
models
User-specified roster of models to consult (provide at least two entries). User-specified list of models to consult (provide at least two entries). Each entry may include model, stance (for/against/neutral), and stance_prompt. Each (model, stance) pair must be unique, e.g. [{'model':'gpt5','stance':'for'}, {'model':'pro','stance':'against'}]. When the user names a model, you MUST use that exact value or report the provider error—never swap in another option. Use the `listmodels` tool for the full roster. Top models: gpt-5.2 (score 100, 400K ctx, thinking, code-gen); gpt-5.1-codex (score 100, 400K ctx, thinking, code-gen); gemini-2.5-pro (score 100, 1.0M ctx, thinking, code-gen); gemini-3-pro-preview (score 100, 1.0M ctx, thinking, code-gen); gpt-5.2-pro (score 100, 400K ctx, thinking, code-gen); +26 more via `listmodels`.
current_model_index
0-based index of the next model to consult (managed internally).
model_responses
Internal log of responses gathered so far.
Parameters
step
Review narrative. Step 1: outline the review strategy. Later steps: report findings. MUST cover quality, security, performance, and architecture. Reference code via `relevant_files`; avoid dumping large snippets.
required
step_number
Current review step (starts at 1) – each step should build on the last.
required
total_steps
Number of review steps planned. External validation: two steps (analysis + summary). Internal validation: one step. Use the same limits when continuing an existing review via continuation_id.
required
next_step_required
True when another review step follows. External validation: step 1 → True, step 2 → False. Internal validation: set False immediately. Apply the same rule on continuation flows.
required
findings
Capture findings (positive and negative) across quality, security, performance, and architecture; update each step.
required
files_checked
Absolute paths of every file reviewed, including those ruled out.
relevant_files
Step 1: list all files/dirs under review. Must be absolute full non-abbreviated paths. Final step: narrow to files tied to key findings.
relevant_context
Methods/functions identified as involved in the issue
issues_found
Issues with severity (critical/high/medium/low) and descriptions.
confidence
Confidence level: exploring (just starting), low (early investigation), medium (some evidence), high (strong evidence), very_high (comprehensive understanding), almost_certain (near complete confidence), certain (100% confidence locally - no external validation needed)
hypothesis
Current theory about issue/goal based on work
use_assistant_model
Use assistant model for expert analysis after workflow steps. False skips expert analysis, relies solely on your personal investigation. Defaults to True for comprehensive validation.
temperature
0 = deterministic · 1 = creative.
thinking_mode
Reasoning depth: minimal, low, medium, high, or max.
continuation_id
Unique thread continuation ID for multi-turn conversations. Works across different tools. ALWAYS reuse the last continuation_id you were given—this preserves full conversation context, files, and findings so the agent can resume seamlessly.
images
Optional diagram or screenshot paths that clarify review context.
model
Currently in auto model selection mode. CRITICAL: When the user names a model, you MUST use that exact name unless the server rejects it. If no model is provided, you may use the `listmodels` tool to review options and select an appropriate match. Top models: gpt-5.2 (score 100, 400K ctx, thinking, code-gen); gpt-5.1-codex (score 100, 400K ctx, thinking, code-gen); gemini-2.5-pro (score 100, 1.0M ctx, thinking, code-gen); gemini-3-pro-preview (score 100, 1.0M ctx, thinking, code-gen); gpt-5.2-pro (score 100, 400K ctx, thinking, code-gen); +26 more via `listmodels`.
required
review_validation_type
Set 'external' (default) for expert follow-up or 'internal' for local-only review.
review_type
Review focus: full, security, performance, or quick.
focus_on
Optional note on areas to emphasise (e.g. 'threading', 'auth flow').
standards
Coding standards or style guides to enforce.
severity_filter
Lowest severity to include when reporting issues (critical/high/medium/low/all).
Parameters
step
Step 1: outline how you'll validate the git changes. Later steps: report findings. Review diffs and impacts, use `relevant_files`, and avoid pasting large snippets.
required
step_number
Current pre-commit step number (starts at 1).
required
total_steps
Planned number of validation steps. External validation: use at most three (analysis → follow-ups → summary). Internal validation: a single step. Honour these limits when resuming via continuation_id.
required
next_step_required
True to continue with another step, False when validation is complete. CRITICAL: If total_steps>=3 or when `precommit_type = external`, set to True until the final step. When continuation_id is provided: Follow the same validation rules based on precommit_type.
required
findings
Record git diff insights, risks, missing tests, security concerns, and positives; update previous notes as you go.
required
files_checked
Absolute paths for every file examined, including ruled-out candidates.
relevant_files
Absolute paths of files involved in the change or validation (code, configs, tests, docs). Must be absolute full non-abbreviated paths.
relevant_context
Methods/functions identified as involved in the issue
issues_found
List issues with severity (critical/high/medium/low) plus descriptions (bugs, security, performance, coverage).
confidence
Confidence level: exploring (just starting), low (early investigation), medium (some evidence), high (strong evidence), very_high (comprehensive understanding), almost_certain (near complete confidence), certain (100% confidence locally - no external validation needed)
hypothesis
Current theory about issue/goal based on work
use_assistant_model
Use assistant model for expert analysis after workflow steps. False skips expert analysis, relies solely on your personal investigation. Defaults to True for comprehensive validation.
temperature
0 = deterministic · 1 = creative.
thinking_mode
Reasoning depth: minimal, low, medium, high, or max.
continuation_id
Unique thread continuation ID for multi-turn conversations. Works across different tools. ALWAYS reuse the last continuation_id you were given—this preserves full conversation context, files, and findings so the agent can resume seamlessly.
images
Optional absolute paths to screenshots or diagrams that aid validation.
model
Currently in auto model selection mode. CRITICAL: When the user names a model, you MUST use that exact name unless the server rejects it. If no model is provided, you may use the `listmodels` tool to review options and select an appropriate match. Top models: gpt-5.2 (score 100, 400K ctx, thinking, code-gen); gpt-5.1-codex (score 100, 400K ctx, thinking, code-gen); gemini-2.5-pro (score 100, 1.0M ctx, thinking, code-gen); gemini-3-pro-preview (score 100, 1.0M ctx, thinking, code-gen); gpt-5.2-pro (score 100, 400K ctx, thinking, code-gen); +26 more via `listmodels`.
required
precommit_type
'external' (default, triggers expert model) or 'internal' (local-only validation).
path
Absolute path to the repository root. Required in step 1.
compare_to
Optional git ref (branch/tag/commit) to diff against; falls back to staged/unstaged changes.
include_staged
Whether to inspect staged changes (ignored when `compare_to` is set).
include_unstaged
Whether to inspect unstaged changes (ignored when `compare_to` is set).
focus_on
Optional emphasis areas such as security, performance, or test coverage.
severity_filter
Lowest severity to include when reporting issues.
Parameters
step
Investigation step. Step 1: State issue+direction. Symptoms misleading; 'no bug' valid. Trace dependencies, verify hypotheses. Use relevant_files for code; this for text only.
required
step_number
Current step index (starts at 1). Build upon previous steps.
required
total_steps
Estimated total steps needed to complete the investigation. Adjust as new findings emerge. IMPORTANT: When continuation_id is provided (continuing a previous conversation), set this to 1 as we're not starting a new multi-step investigation.
required
next_step_required
True if you plan to continue the investigation with another step. False means root cause is known or investigation is complete. IMPORTANT: When continuation_id is provided (continuing a previous conversation), set this to False to immediately proceed with expert analysis.
required
findings
Discoveries: clues, code/log evidence, disproven theories. Be specific. If no bug found, document clearly as valid.
required
files_checked
All examined files (absolute paths), including ruled-out ones.
relevant_files
Files directly relevant to issue (absolute paths). Cause, trigger, or manifestation locations.
relevant_context
Methods/functions identified as involved in the issue
issues_found
Issues identified with severity levels during work
confidence
Your confidence in the hypothesis: exploring (starting out), low (early idea), medium (some evidence), high (strong evidence), very_high (very strong evidence), almost_certain (nearly confirmed), certain (100% confidence - root cause and fix are both confirmed locally with no need for external validation). WARNING: Do NOT use 'certain' unless the issue can be fully resolved with a fix, use 'very_high' or 'almost_certain' instead when not 100% sure. Using 'certain' means you have ABSOLUTE confidence locally and PREVENTS external model validation.
hypothesis
Concrete root cause theory from evidence. Can revise. Valid: 'No bug found - user misunderstanding' or 'Symptoms unrelated to code' if supported.
use_assistant_model
Use assistant model for expert analysis after workflow steps. False skips expert analysis, relies solely on your personal investigation. Defaults to True for comprehensive validation.
temperature
0 = deterministic · 1 = creative.
thinking_mode
Reasoning depth: minimal, low, medium, high, or max.
continuation_id
Unique thread continuation ID for multi-turn conversations. Works across different tools. ALWAYS reuse the last continuation_id you were given—this preserves full conversation context, files, and findings so the agent can resume seamlessly.
images
Optional screenshots/visuals clarifying issue (absolute paths).
model
Currently in auto model selection mode. CRITICAL: When the user names a model, you MUST use that exact name unless the server rejects it. If no model is provided, you may use the `listmodels` tool to review options and select an appropriate match. Top models: gpt-5.2 (score 100, 400K ctx, thinking, code-gen); gpt-5.1-codex (score 100, 400K ctx, thinking, code-gen); gemini-2.5-pro (score 100, 1.0M ctx, thinking, code-gen); gemini-3-pro-preview (score 100, 1.0M ctx, thinking, code-gen); gpt-5.2-pro (score 100, 400K ctx, thinking, code-gen); +26 more via `listmodels`.
required
Parameters
step
Step 1: outline the audit strategy (OWASP Top 10, auth, validation, etc.). Later steps: report findings. MANDATORY: use `relevant_files` for code references and avoid large snippets.
required
step_number
Current security-audit step number (starts at 1).
required
total_steps
Expected number of audit steps; adjust as new risks surface.
required
next_step_required
True while additional threat analysis remains; set False once you are ready to hand off for validation.
required
findings
Summarize vulnerabilities, auth issues, validation gaps, compliance notes, and positives; update prior findings as needed.
required
files_checked
Absolute paths for every file inspected, including rejected candidates.
relevant_files
Absolute paths for security-relevant files (auth modules, configs, sensitive code).
relevant_context
Methods/functions identified as involved in the issue
issues_found
Security issues with severity (critical/high/medium/low) and descriptions (vulns, auth flaws, injection, crypto, config).
confidence
exploring/low/medium/high/very_high/almost_certain/certain. 'certain' blocks external validation—use only when fully complete.
hypothesis
Current theory about issue/goal based on work
use_assistant_model
Use assistant model for expert analysis after workflow steps. False skips expert analysis, relies solely on your personal investigation. Defaults to True for comprehensive validation.
temperature
0 = deterministic · 1 = creative.
thinking_mode
Reasoning depth: minimal, low, medium, high, or max.
continuation_id
Unique thread continuation ID for multi-turn conversations. Works across different tools. ALWAYS reuse the last continuation_id you were given—this preserves full conversation context, files, and findings so the agent can resume seamlessly.
images
Optional absolute paths to diagrams or threat models that inform the audit.
model
Currently in auto model selection mode. CRITICAL: When the user names a model, you MUST use that exact name unless the server rejects it. If no model is provided, you may use the `listmodels` tool to review options and select an appropriate match. Top models: gpt-5.2 (score 100, 400K ctx, thinking, code-gen); gpt-5.1-codex (score 100, 400K ctx, thinking, code-gen); gemini-2.5-pro (score 100, 1.0M ctx, thinking, code-gen); gemini-3-pro-preview (score 100, 1.0M ctx, thinking, code-gen); gpt-5.2-pro (score 100, 400K ctx, thinking, code-gen); +26 more via `listmodels`.
required
security_scope
Security context (web, mobile, API, cloud, etc.) including stack, user types, data sensitivity, and threat landscape.
threat_level
Assess the threat level: low (internal/low-risk), medium (customer-facing/business data), high (regulated or sensitive), critical (financial/healthcare/PII).
compliance_requirements
Applicable compliance frameworks or standards (SOC2, PCI DSS, HIPAA, GDPR, ISO 27001, NIST, etc.).
audit_focus
Primary focus area: owasp, compliance, infrastructure, dependencies, or comprehensive.
severity_filter
Minimum severity to include when reporting security issues.
Parameters
step
Current work step content and findings from your overall work
required
step_number
Current step number in work sequence (starts at 1)
required
total_steps
Estimated total steps needed to complete work
required
next_step_required
Whether another work step is needed. When false, aim to reduce total_steps to match step_number to avoid mismatch.
required
findings
Important findings, evidence and insights discovered in this step
required
relevant_files
Files identified as relevant to issue/goal (FULL absolute paths to real files/folders - DO NOT SHORTEN)
relevant_context
Methods/functions identified as involved in the issue
issues_found
Issues identified with severity levels during work
use_assistant_model
Use assistant model for expert analysis after workflow steps. False skips expert analysis, relies solely on your personal investigation. Defaults to True for comprehensive validation.
continuation_id
Unique thread continuation ID for multi-turn conversations. Works across different tools. ALWAYS reuse the last continuation_id you were given—this preserves full conversation context, files, and findings so the agent can resume seamlessly.
document_complexity
Include algorithmic complexity (Big O) analysis when True (default).
required
document_flow
Include call flow/dependency notes when True (default).
required
update_existing
True (default) to polish inaccurate or outdated docs instead of leaving them untouched.
required
comments_on_complex_logic
True (default) to add inline comments around non-obvious logic.
required
num_files_documented
Count of files finished so far. Increment only when a file is fully documented.
required
total_files_to_document
Total files identified in discovery; completion requires matching this count.
required
Parameters
step
The analysis plan. Step 1: State your strategy, including how you will map the codebase structure, understand business logic, and assess code quality, performance implications, and architectural patterns. Later steps: Report findings and adapt the approach as new insights emerge.
required
step_number
The index of the current step in the analysis sequence, beginning at 1. Each step should build upon or revise the previous one.
required
total_steps
Your current estimate for how many steps will be needed to complete the analysis. Adjust as new findings emerge.
required
next_step_required
Set to true if you plan to continue the investigation with another step. False means you believe the analysis is complete and ready for expert validation.
required
findings
Summary of discoveries from this step, including architectural patterns, tech stack assessment, scalability characteristics, performance implications, maintainability factors, and strategic improvement opportunities. IMPORTANT: Document both strengths (good patterns, solid architecture) and concerns (tech debt, overengineering, unnecessary complexity). In later steps, confirm or update past findings with additional evidence.
required
files_checked
List all files examined (absolute paths). Include even ruled-out files to track exploration path.
relevant_files
Subset of files_checked directly relevant to analysis findings (absolute paths). Include files with significant patterns, architectural decisions, or strategic improvement opportunities.
relevant_context
Methods/functions identified as involved in the issue
issues_found
Issues or concerns identified during analysis, each with severity level (critical, high, medium, low)
use_assistant_model
Use assistant model for expert analysis after workflow steps. False skips expert analysis, relies solely on your personal investigation. Defaults to True for comprehensive validation.
temperature
0 = deterministic · 1 = creative.
thinking_mode
Reasoning depth: minimal, low, medium, high, or max.
continuation_id
Unique thread continuation ID for multi-turn conversations. Works across different tools. ALWAYS reuse the last continuation_id you were given—this preserves full conversation context, files, and findings so the agent can resume seamlessly.
images
Optional absolute paths to architecture diagrams or visual references that help with analysis context.
model
Currently in auto model selection mode. CRITICAL: When the user names a model, you MUST use that exact name unless the server rejects it. If no model is provided, you may use the `listmodels` tool to review options and select an appropriate match. Top models: gpt-5.2 (score 100, 400K ctx, thinking, code-gen); gpt-5.1-codex (score 100, 400K ctx, thinking, code-gen); gemini-2.5-pro (score 100, 1.0M ctx, thinking, code-gen); gemini-3-pro-preview (score 100, 1.0M ctx, thinking, code-gen); gpt-5.2-pro (score 100, 400K ctx, thinking, code-gen); +26 more via `listmodels`.
required
confidence
Your confidence in the analysis: exploring, low, medium, high, very_high, almost_certain, or certain. 'certain' indicates the analysis is complete and ready for validation.
analysis_type
Type of analysis to perform (architecture, performance, security, quality, general)
output_format
How to format the output (summary, detailed, actionable)
Parameters
step
The refactoring plan. Step 1: State strategy. Later steps: Report findings. CRITICAL: Examine code for smells, and opportunities for decomposition, modernization, and organization. Use 'relevant_files' for code. FORBIDDEN: Large code snippets.
required
step_number
The index of the current step in the refactoring investigation sequence, beginning at 1. Each step should build upon or revise the previous one.
required
total_steps
Your current estimate for how many steps will be needed to complete the refactoring investigation. Adjust as new opportunities emerge.
required
next_step_required
Set to true if you plan to continue the investigation with another step. False means you believe the refactoring analysis is complete and ready for expert validation.
required
findings
Summary of discoveries from this step, including code smells and opportunities for decomposition, modernization, or organization. Document both strengths and weaknesses. In later steps, confirm or update past findings.
required
files_checked
List all files examined (absolute paths). Include even ruled-out files to track exploration path.
relevant_files
Subset of files_checked with code requiring refactoring (absolute paths). Include files with code smells, decomposition needs, or improvement opportunities.
relevant_context
Methods/functions identified as involved in the issue
issues_found
Refactoring opportunities as dictionaries with 'severity' (critical/high/medium/low), 'type' (codesmells/decompose/modernize/organization), and 'description'. Include all improvement opportunities found.
confidence
Your confidence in refactoring analysis: exploring (starting), incomplete (significant work remaining), partial (some opportunities found, more analysis needed), complete (comprehensive analysis finished, all major opportunities identified). WARNING: Use 'complete' ONLY when fully analyzed and can provide recommendations without expert help. 'complete' PREVENTS expert validation. Use 'partial' for large files or uncertain analysis.
hypothesis
Current theory about issue/goal based on work
use_assistant_model
Use assistant model for expert analysis after workflow steps. False skips expert analysis, relies solely on your personal investigation. Defaults to True for comprehensive validation.
temperature
0 = deterministic · 1 = creative.
thinking_mode
Reasoning depth: minimal, low, medium, high, or max.
continuation_id
Unique thread continuation ID for multi-turn conversations. Works across different tools. ALWAYS reuse the last continuation_id you were given—this preserves full conversation context, files, and findings so the agent can resume seamlessly.
images
Optional list of absolute paths to architecture diagrams, UI mockups, design documents, or visual references that help with refactoring context. Only include if they materially assist understanding or assessment.
model
Currently in auto model selection mode. CRITICAL: When the user names a model, you MUST use that exact name unless the server rejects it. If no model is provided, you may use the `listmodels` tool to review options and select an appropriate match. Top models: gpt-5.2 (score 100, 400K ctx, thinking, code-gen); gpt-5.1-codex (score 100, 400K ctx, thinking, code-gen); gemini-2.5-pro (score 100, 1.0M ctx, thinking, code-gen); gemini-3-pro-preview (score 100, 1.0M ctx, thinking, code-gen); gpt-5.2-pro (score 100, 400K ctx, thinking, code-gen); +26 more via `listmodels`.
required
refactor_type
Type of refactoring analysis to perform (codesmells, decompose, modernize, organization)
focus_areas
Specific areas to focus on (e.g., 'performance', 'readability', 'maintainability', 'security')
style_guide_examples
Optional existing code files to use as style/pattern reference (must be FULL absolute paths to real files / folders - DO NOT SHORTEN). These files represent the target coding style and patterns for the project.
Parameters
step
Current work step content and findings from your overall work
required
step_number
Current step number in work sequence (starts at 1)
required
total_steps
Estimated total steps needed to complete work
required
next_step_required
Whether another work step is needed. When false, aim to reduce total_steps to match step_number to avoid mismatch.
required
findings
Important findings, evidence and insights discovered in this step
required
files_checked
List of files examined during this work step
relevant_files
Files identified as relevant to issue/goal (FULL absolute paths to real files/folders - DO NOT SHORTEN)
relevant_context
Methods/functions identified as involved in the issue
confidence
Confidence level: exploring (just starting), low (early investigation), medium (some evidence), high (strong evidence), very_high (comprehensive understanding), almost_certain (near complete confidence), certain (100% confidence locally - no external validation needed)
use_assistant_model
Use assistant model for expert analysis after workflow steps. False skips expert analysis, relies solely on your personal investigation. Defaults to True for comprehensive validation.
continuation_id
Unique thread continuation ID for multi-turn conversations. Works across different tools. ALWAYS reuse the last continuation_id you were given—this preserves full conversation context, files, and findings so the agent can resume seamlessly.
images
Optional paths to architecture diagrams or flow charts that help understand the tracing context.
model
Currently in auto model selection mode. CRITICAL: When the user names a model, you MUST use that exact name unless the server rejects it. If no model is provided, you may use the `listmodels` tool to review options and select an appropriate match. Top models: gpt-5.2 (score 100, 400K ctx, thinking, code-gen); gpt-5.1-codex (score 100, 400K ctx, thinking, code-gen); gemini-2.5-pro (score 100, 1.0M ctx, thinking, code-gen); gemini-3-pro-preview (score 100, 1.0M ctx, thinking, code-gen); gpt-5.2-pro (score 100, 400K ctx, thinking, code-gen); +26 more via `listmodels`.
required
trace_mode
Type of tracing: 'ask' (default - prompts user to choose mode), 'precision' (execution flow) or 'dependencies' (structural relationships)
required
target_description
Description of what to trace and WHY. Include context about what you're trying to understand or analyze.
required
Parameters
step
Test plan for this step. Step 1: outline how you'll analyse structure, business logic, critical paths, and edge cases. Later steps: record findings and new scenarios as they emerge.
required
step_number
Current test-generation step (starts at 1) — each step should build on prior work.
required
total_steps
Estimated number of steps needed for test planning; adjust as new scenarios appear.
required
next_step_required
True while more investigation or planning remains; set False when test planning is ready for expert validation.
required
findings
Summarise functionality, critical paths, edge cases, boundary conditions, error handling, and existing test patterns. Cover both happy and failure paths.
required
files_checked
Absolute paths of every file examined, including those ruled out.
relevant_files
Absolute paths of code that requires new or updated tests (implementation, dependencies, existing test fixtures).
relevant_context
Methods/functions identified as involved in the issue
issues_found
Issues identified with severity levels during work
confidence
Indicate your current confidence in the test generation assessment. Use: 'exploring' (starting analysis), 'low' (early investigation), 'medium' (some patterns identified), 'high' (strong understanding), 'very_high' (very strong understanding), 'almost_certain' (nearly complete test plan), 'certain' (100% confidence - test plan is thoroughly complete and all test scenarios are identified with no need for external model validation). Do NOT use 'certain' unless the test generation analysis is comprehensively complete, use 'very_high' or 'almost_certain' instead if not 100% sure. Using 'certain' means you have complete confidence locally and prevents external model validation.
hypothesis
Current theory about issue/goal based on work
use_assistant_model
Use assistant model for expert analysis after workflow steps. False skips expert analysis, relies solely on your personal investigation. Defaults to True for comprehensive validation.
temperature
0 = deterministic · 1 = creative.
thinking_mode
Reasoning depth: minimal, low, medium, high, or max.
continuation_id
Unique thread continuation ID for multi-turn conversations. Works across different tools. ALWAYS reuse the last continuation_id you were given—this preserves full conversation context, files, and findings so the agent can resume seamlessly.
images
Optional absolute paths to diagrams or visuals that clarify the system under test.
model
Currently in auto model selection mode. CRITICAL: When the user names a model, you MUST use that exact name unless the server rejects it. If no model is provided, you may use the `listmodels` tool to review options and select an appropriate match. Top models: gpt-5.2 (score 100, 400K ctx, thinking, code-gen); gpt-5.1-codex (score 100, 400K ctx, thinking, code-gen); gemini-2.5-pro (score 100, 1.0M ctx, thinking, code-gen); gemini-3-pro-preview (score 100, 1.0M ctx, thinking, code-gen); gpt-5.2-pro (score 100, 400K ctx, thinking, code-gen); +26 more via `listmodels`.
required
Parameters
prompt
Statement to scrutinize. If you invoke `challenge` manually, strip the word 'challenge' and pass just the statement. Automatic invocations send the full user message as-is; do not modify it.
required
Parameters
prompt
The API, SDK, library, framework, or technology you need current documentation, version info, breaking changes, or migration guidance for.
required
out of 100
Security Review
Integration: Zen
Repository: https://github.com/beehiveinnovations/zen-mcp-server
Commit: latest
Scan Date: 2026-03-13 13:03 UTC
Security Score
35 / 100
Tier Classification
Reject
OWASP Alignment
OWASP Rubric
- Standard: OWASP Top 10 (2021) aligned review
- Core methodology: architecture context, trust boundaries, data-flow tracing, threat modeling, control verification, and evidence-backed validation
- Key characteristics considered: exploitability, impact, likelihood, attacker preconditions, and business context
OWASP Security Category Mapping
- A01 Broken Access Control: none
- A02 Cryptographic Failures: 4 finding(s)
- A03 Injection: 1 finding(s)
- A04 Insecure Design: none
- A05 Security Misconfiguration: 21 finding(s)
- A06 Vulnerable and Outdated Components: 1 finding(s)
- A07 Identification and Authentication Failures: none
- A08 Software and Data Integrity Failures: none
- A09 Security Logging and Monitoring Failures: 87 finding(s)
- A10 Server-Side Request Forgery: none
Static Analysis Findings (Bandit)
High Severity
- Use of weak MD5 hash for security. Consider usedforsecurity=False in tests/http_transport_recorder.py:326 (confidence: HIGH)
- Use of weak MD5 hash for security. Consider usedforsecurity=False in tests/http_transport_recorder.py:389 (confidence: HIGH)
- Use of weak MD5 hash for security. Consider usedforsecurity=False in tests/test_cassette_semantic_matching.py:75 (confidence: HIGH)
- Use of weak MD5 hash for security. Consider usedforsecurity=False in tests/test_cassette_semantic_matching.py:76 (confidence: HIGH)
Medium Severity
- Probable insecure usage of temp file/directory. in tests/conftest.py:17 (confidence: MEDIUM)
- Possible binding to all interfaces. in tests/pii_sanitizer.py:98 (confidence: MEDIUM)
- Probable insecure usage of temp file/directory. in tests/test_auto_mode.py:211 (confidence: MEDIUM)
- Probable insecure usage of temp file/directory. in tests/test_chat_simple.py:62 (confidence: MEDIUM)
- Probable insecure usage of temp file/directory. in tests/test_chat_simple.py:71 (confidence: MEDIUM)
- Probable insecure usage of temp file/directory. in tests/test_chat_simple.py:79 (confidence: MEDIUM)
- Probable insecure usage of temp file/directory. in tests/test_chat_simple.py:109 (confidence: MEDIUM)
- Probable insecure usage of temp file/directory. in tests/test_chat_simple.py:127 (confidence: MEDIUM)
- Probable insecure usage of temp file/directory. in tests/test_chat_simple.py:306 (confidence: MEDIUM)
- Probable insecure usage of temp file/directory. in tests/test_chat_simple.py:316 (confidence: MEDIUM)
- Probable insecure usage of temp file/directory. in tests/test_docker_claude_desktop_integration.py:190 (confidence: MEDIUM)
- Probable insecure usage of temp file/directory. in tests/test_path_traversal_security.py:51 (confidence: MEDIUM)
- Probable insecure usage of temp file/directory. in tests/test_path_traversal_security.py:52 (confidence: MEDIUM)
- Probable insecure usage of temp file/directory. in tests/test_path_traversal_security.py:57 (confidence: MEDIUM)
- Probable insecure usage of temp file/directory. in tests/test_path_traversal_security.py:58 (confidence: MEDIUM)
- Possible binding to all interfaces. in tests/test_pii_sanitizer.py:96 (confidence: MEDIUM)
- Possible SQL injection vector through string-based query construction. in tools/docgen.py:348 (confidence: LOW)
- Audit url open for permitted schemes. Allowing use of file:/ or custom schemes is often unexpected. in tools/version.py:97 (confidence: HIGH)
Low Severity
- Consider possible security implications associated with the subprocess module. in communication_simulator_test.py:74 (confidence: HIGH)
- subprocess call - check for execution of untrusted input. in communication_simulator_test.py:448 (confidence: HIGH)
- Consider possible security implications associated with the subprocess module. in docker/scripts/healthcheck.py:7 (confidence: HIGH)
- Starting a process with a partial executable path in docker/scripts/healthcheck.py:22 (confidence: HIGH)
- subprocess call - check for execution of untrusted input. in docker/scripts/healthcheck.py:22 (confidence: HIGH)
- Try, Except, Pass detected. in providers/gemini.py:401 (confidence: HIGH)
- Try, Except, Continue detected. in providers/openai_compatible.py:84 (confidence: HIGH)
- Try, Except, Pass detected. in providers/openai_compatible.py:797 (confidence: HIGH)
- Try, Except, Pass detected. in server.py:580 (confidence: HIGH)
- Try, Except, Pass detected. in server.py:583 (confidence: HIGH)
- Try, Except, Pass detected. in server.py:662 (confidence: HIGH)
- Try, Except, Pass detected. in server.py:756 (confidence: HIGH)
- Try, Except, Pass detected. in server.py:772 (confidence: HIGH)
- Try, Except, Pass detected. in server.py:872 (confidence: HIGH)
- Try, Except, Pass detected. in server.py:1062 (confidence: HIGH)
- Try, Except, Pass detected. in server.py:1285 (confidence: HIGH)
- Consider possible security implications associated with the subprocess module. in simulator_tests/base_test.py:11 (confidence: HIGH)
- subprocess call - check for execution of untrusted input. in simulator_tests/base_test.py:169 (confidence: HIGH)
- subprocess call - check for execution of untrusted input. in simulator_tests/base_test.py:276 (confidence: HIGH)
- Consider possible security implications associated with the subprocess module. in simulator_tests/log_utils.py:10 (confidence: HIGH)
Hardcoded Secrets
3 potential hardcoded secret(s) detected.
Build Status
SKIPPED
Build step was skipped to avoid running untrusted build commands by default.
Tests
Detected (pytest)
Documentation
README: Present
Dependency file: Present
AI Security Review
Security Code Review Report for repository: Zen
1) OWASP Review Methodology Applied
- Orientation: I inspected repository layout, the main server entry (server.py), providers, tools, clink agents, and utilities. I reviewed static analysis notes and prioritized files flagged by the scanner.
- Entry Points: I examined server.py (MCP stdo server & tool registry), tools (SimpleTool / BaseTool), provider implementations (providers/openai_compatible.py, providers/custom.py), clink agent execution (clink/agents/base.py and clink/registry.py), and file access & path validation utilities (utils/file_utils.py and utils/security_config.py).
- Data flows: Traced user-supplied/external inputs (MCP tool arguments including absolute_file_paths, CUSTOM_API_URL, CLI client config files) through validation and into sinks (file I/O, subprocess execution, network calls).
- Trust boundaries & entry points: MCP stdio messages -> server.call_tool -> tool code (SimpleTool/BaseTool) -> provider/client resolvers -> provider network calls (OpenAI/OpenRouter/Custom) and clink -> subprocess exec of configured CLIs.
- Threat modelling: Focused on path traversal, arbitrary command execution, SSRF, secret leakage in logs, insecure configuration loading, and unsafe deserialization in tests.
- Verification: Confirmed behavior by reading critical code implementing validation, path handling, provider base URL validation, CLI command execution, and logging sanitization.
2) OWASP Top 10 2021 Category Mapping
- A01: Broken Access Control: clink registry/agent executing configured local commands (clink/registry.py, clink/agents/base.py)
- A02: Cryptographic Failures: not directly observed.
- A03: Injection: potential command execution / CLI injection based on configured commands (clink).
- A04: Insecure Design: permissive acceptance of absolute paths in various config resolution functions; design choices allow operators to configure execution of arbitrary local commands.
- A05: Security Misconfiguration: .env override (utils/env.py) and logging configuration may leak sensitive information if mishandled; default debug logging enabled.
- A06: Vulnerable and Outdated Components: dependency review not exhaustively performed here; but code uses httpx, OpenAI SDK; pinned versions should be validated in pyproject/requirements.
- A07: Identification and Authentication Failures: not prominent in code reviewed (relying on environment-provided API keys), but operator-configured keys may be accidentally exposed in logs if not sanitized.
- A08: Software and Data Integrity Failures: no runtime plugin signing / integrity verification for custom provider endpoints; ModelProviderRegistry allows custom provider factories.
- A09: Security Logging and Monitoring Failures: some try/except:pass swallowing in critical shutdown/cleanup code that could hide failures (server.py cleanup_providers); however proper mcp_activity logging exists.
- A10: Server-Side Request Forgery (SSRF): provider base_url (_validate_base_url/_is_localhost_url) validation is limited; custom endpoints (CUSTOM_API_URL) can point to internal hosts and will be used (providers/openai_compatible.py, providers/custom.py).
3) Critical Vulnerabilities (RCE, injection, auth bypass, unsafe deserialization)
- No immediate unauthenticated remote RCE was found in code executed on MCP server directly from untrusted network inputs. The critical risks are configuration-driven local command execution that can run arbitrary local programs when CLI clients are configured.
- Unsafe deserialization: I found test code using pickle (simulator_tests/test_secaudit_validation.py), but only in tests. No production code unpickling untrusted data was found.
4) High Severity Issues
1. Arbitrary local command execution via CLINk configuration (potential local RCE / command injection)
- Files: clink/agents/base.py (create_subprocess_exec usage) and clink/registry.py (configuration parsing)
- Evidence: clink/registry.py -> _resolve_executable returns shlex.split(command) (no whitelist or strict validation) and configs are loaded from conf/cli_clients and user config directories (ClinkRegistry._iter_config_files). clink/agents/base.py then resolves the executable via shutil.which and executes the full command via asyncio.create_subprocess_exec (safe from shell=True injection, but will run whatever the configured executable+args are). See clink/agents/base.py at the process.launch call (search result: clink/agents/base.py:111) and clink/registry.py:_resolve_executable (search result reference).
- Severity: High (A01/A03)
- Exploitability: High if attacker can influence config files (e.g., user config dir or environment that points to config) or if operator config contains malicious/untrusted values. An attacker who can write a config JSON can cause arbitrary local execution with operator privileges.
- Remediation:
- Restrict CLIs that can be executed to a configured allow-list in code or config (whitelist of allowed executables/paths), or require executables to be absolute paths under an allowed directory.
- Validate and canonicalize executable paths and arguments during config load; disallow dangerous flags or redirections and disallow arbitrary output flag templates that may write to arbitrary paths without checks.
- Require operator confirmation / secure deployment process for CLI client definitions and treat them as high privilege.
- Consider running CLI agents in a sandboxed process / chroot or under reduced privileges.
- Suggested code changes:
- clink/registry.py::_resolve_executable: validate against a whitelist and force absolute/realpath checks. E.g. replace shlex.split(command) with parsing + validation. Add logging when config overrides occur.
- clink/agents/base.py: before executing, re-validate resolved_executable is under a safe directory and that role.args/config_args are within allowed set. (clink/agents/base.py around create_subprocess_exec call at line ~111)
- Server allowed to call arbitrary CUSTOM_API_URL (SSRF-like / internal network access)
- Files: providers/openai_compatible.py (base_url validation uses urlparse but does not perform DNS resolution to ban internal addresses), providers/custom.py (initialization) and server.py (configure_providers accepts CUSTOM_API_URL from env). Specifically, _validate_base_url (providers/openai_compatible.py) checks scheme/hostname/port only; _is_localhost_url detects localhost/private IPs but does not block them.
- Evidence: providers/openai_compatible.py: _validate_base_url only checks scheme, hostname, and port (search match). clients are then created with base_url assigned to OpenAI client (client_kwargs['base_url']). CUSTOM_API_URL is used without network isolation. Search results: providers/openai_compatible.py:_validate_base_url and _is_localhost_url.
- Severity: High (A10 SSRF)
- Exploitability: Medium - requires attacker control of CUSTOM_API_URL environment variable (or for multi-tenant deployments where an attacker can influence it). If that is possible, attacker can route model calls to internal services or exfiltrate data.
- Remediation:
- Harden URL validation: perform DNS resolution and block internal/private IP ranges by default unless explicitly whitelisted. Validate against e.g., ip.is_private, ip.is_loopback, link-local ranges, and also disallow IPv6 internal ranges unless explicitly allowed.
- Add an explicit operator opt-in to allow local/private addresses for CUSTOM_API_URL, and log/alert when such addresses are configured.
- Consider adding an allowlist of safe hostnames or require HTTPS with certificate verification for remote endpoints.
5) Medium Severity Issues
1. Potential log leakage of sensitive data
- Files: server.py logging calls, providers/openai_compatible.py _sanitize_for_logging mitigates API key logging but other fields may leak (server.py uses debug/info extensively); env override loads .env and enables overriding system env (utils/env.py). There are many spots where large prompt content or client_info are logged. Example: server.py logs incoming client info to mcp_activity.
- Severity: Medium (A09/A05)
- Remediation:
- Ensure all logs strip or mask potential API keys or sensitive tokens beyond 'api_key' and 'authorization' keys. Consider a centralized sanitizer for any dict logged.
- Default log level to INFO in production (server code uses LOG_LEVEL env that defaults to DEBUG) and document safe settings in README. Ensure log files are created with secure file permissions (600) and rotate/lock files appropriately.
- Path resolution trusts absolute paths in config and prompt files
- Files: clink/registry.py:_resolve_prompt_path/_resolve_path allows absolute candidate paths to be returned directly (no dangerous path checks). server uses BaseTool.get_input_schema and tools call read_file_content which enforces absolute paths and checks with resolve_and_validate_path.
- Evidence: clink/registry.py:_resolve_prompt_path -> _resolve_path simply returns absolute Path directly; no cross-check to disallow system prompt_path pointing to sensitive system files. (clink/registry.py:_resolve_prompt_path/_resolve_path documented in file.)
- Severity: Medium (A04/A01)
-
Remediation:
- Validate prompt_path is within expected configuration directories or ensure it doesn’t point to system-critical files. When accepting absolute paths from config, do explicit allow-listing or canonicalization checks.
-
Greedy JSON extraction from exception strings and parsing heuristics
- Files: providers/openai_compatible.py around line ~770: code searches exception text via re.search(r"{.*}", str(error)) and then uses ast.literal_eval(json_like_str) followed by fallback of replacing single quotes to double quotes and json.loads. The regex is greedy and may capture trailing content; literal_eval is safer than eval but feeding untrusted arbitrary string from remote model/provider exceptions may still cause unexpected parsing failures.
- Severity: Medium-Low (A03/A09)
- Remediation:
- Use a non-greedy regex and robust JSON extraction (e.g., use a small parser or try to find balanced braces), and prefer json.loads with strict validation. If literal_eval use is retained, ensure the string is strictly validated to be a literal.
- Add exception handling and avoid depending on heuristics that may silently mask the root error.
6) Low Severity Issues / Best-practice gaps
1. Swallowed exceptions and “except: pass” in cleanup code
- Files: server.py cleanup_providers (atexit handler) and various try/except: pass patterns reported by static analysis. Swallowing errors at shutdown can hide resource closure issues.
- Severity: Low (A09)
- Remediation: Log exceptions at debug level instead of silently passing.
- Test-only insecure code flagged by static analysis (subprocess usage, pickle)
- Files: simulator_tests and tests contain subprocess usage and pickle.loads in test code. These are test-only and not part of production; confirm they remain confined to test suites and are not used in production endpoints.
- Severity: Low (test-only)
- Remediation: Keep these in test suites and do not enable in production.
7) Key Risk Characteristics (Exploitability, Impact, Likelihood, Preconditions)
- Arbitrary local CLI execution (clink): Exploitability: High if attacker can modify CLI config or place files in the user config path. Impact: High (local code execution as service user, exfiltrate secrets, modify workspace). Likelihood: Medium in an environment where multiple users can drop files into user config directories; Low in single-operator deployments. Preconditions: Ability to write or modify CLI client config JSON (USER_CONFIG_DIR, conf/cli_clients or environment override path).
- SSRF via CUSTOM_API_URL: Exploitability: Medium (requires ability to set env var or influence env). Impact: Moderate-High (information disclosure from internal services, lateral movement). Likelihood: Low in secure deployments; Higher in ephemeral/containerized CI or misconfigured deployments that pull env from untrusted sources. Preconditions: Ability to set CUSTOM_API_URL in environment or .env file.
- Log leakage: Exploitability: Medium (an attacker who can read logs). Impact: Moderate (exposure of API keys, prompts). Likelihood: Medium (debug logs default). Preconditions: Access to logs or ability to craft data that gets logged.
- Path access from configs: Exploitability: Medium-Low (requires config modification). Impact: Moderate (leakage of system files used as prompts). Preconditions: ability to specify absolute prompt files in CLI config or to place prompt files in registries.
8) Positive Security Practices Observed
- File access hardening: utils/file_utils.resolve_and_validate_path enforces absolute paths, forbids dangerous system roots and home-root scanning, resolves symlinks, and checks against DANGEROUS_PATHS (utils/security_config.py). This is a strong defense in depth for file access from MCP tool requests. (utils/file_utils.py: resolve_and_validate_path, utils/security_config.py: is_dangerous_path)
- Logging sanitization: providers/openai_compatible.py implements _sanitize_for_logging to remove api_key and authorization entries and truncate long text before logging. This reduces risk of credential leakage in many API call logs.
- Timeout and proxy hardening: OpenAI-compatible provider avoids proxy env vars when creating HTTP client and configures reasonable timeouts, reducing some SSRF/proxy abuse risk.
- Prompt size validation: BaseTool._validate_token_limit enforces MCP_PROMPT_SIZE_LIMIT for user content crossing MCP boundary.
9) Recommendations (concrete fixes with file:line references)
NOTE: Line numbers are approximate and come from code locations discovered during review; follow references by file and function names below.
Critical / High priority fixes
- Harden CLINK command execution
- Files: clink/registry.py::_resolve_executable (where shlex.split is used); clink/agents/base.py (process creation at asyncio.create_subprocess_exec near line ~111).
- Fix: Implement a whitelist of allowed executables or require absolute path and validate it against a safe directory. Validate and sanitize arguments in registry load instead of executing them blindly. Example: on registry load, validate resolved_executable = Path(shutil.which(executable_name)).resolve(); ensure it is under /usr/bin or an operator-defined safe list; otherwise reject config with explicit error.
- OWASP mapping: A01 (Broken Access Control), A03 (Injection)
- Harden provider base_url handling (SSRF)
- Files: providers/openai_compatible.py:_validate_base_url and _is_localhost_url; server.py configure_providers (CUSTOM_API_URL handling at server.py:~479).
- Fix: Extend _validate_base_url to perform DNS resolution and reject addresses in private / link-local / loopback ranges by default, unless an explicit opt-in is set (e.g., CUSTOM_API_ALLOW_PRIVATE=true). Example: resolve hostname to IP(s) and for each ip do ipaddress.ip_address(ip).is_private or is_loopback checks; if so, require opt-in setting.
- OWASP mapping: A10 (SSRF)
Medium priority fixes
- Improve logging sanitization and default log level
- Files: server.py (logging setup), providers/openai_compatible.py:_sanitize_for_logging
- Fix: Ensure all logged dictionaries pass through a sanitizer that strips common secrets (API tokens, Authorization headers, environment secrets) and avoid logging full prompts or user-provided files at DEBUG in production. Default LOG_LEVEL to INFO in production or detect CI.
- OWASP mapping: A05 / A09
- Avoid fragile JSON extraction from exception text
- File: providers/openai_compatible.py (regex extraction around line ~770)
- Fix: Replace the greedy re.search(r"{.*}", ...) with a robust parser: try json.loads directly on candidate substrings, use a stack-based brace matching to find the first balanced JSON object, and do not attempt ast.literal_eval fallback unless absolutely necessary. Surround with try/except and log parsing failures, not silently converting malformed text.
- OWASP mapping: A03
Low priority fixes / best-practices
- Replace silent except: pass with logged debug exceptions in server cleanup
- Files: server.py cleanup_providers and other swallowed-exception sites
- Fix: Log exception stacktrace at DEBUG when cleanup fails to aid troubleshooting.
- Document operator responsibilities for CLI config and .env
- Files: README.md, SECURITY.md
- Fix: Add explicit warnings that CLI client configurations are powerful and must be managed as high-privilege config; document secure defaults for LOG_LEVEL and file permissions of logs and .env.
10) Next Tier Upgrade Plan (integration security posture)
- Current likely tier: Silver
- Rationale: The codebase demonstrates many strong security practices (robust file path validation, prompt-size checks, logging sanitization hooks, timeout/proxy hardening). However, the ability to execute configured CLIs without whitelisting and permissive handling of custom provider endpoints and config-sourced absolute paths are significant configuration-driven risks.
- Target next tier: Gold
- Required prioritized actions to reach Gold (highest priority first):
1. Harden CLINK execution path (whitelist executables, validate args, sandbox execution). (High priority)
2. Harden CUSTOM_API_URL and provider base_url validation (DNS resolution, reject internal ranges by default, opt-in for localhost). (High priority)
3. Centralize logging sanitization and default to INFO in production; ensure logs are created with secure permissions. (Medium)
4. Validate configuration file paths and disallow using absolute system file paths as prompts or CLI role files unless explicitly allowed. (Medium)
5. Add operational documentation and deployment security checks (CI scanning of env and config files). (Low)
Summary of concrete file:line remediation pointers (as discovered during review):
- clink/agents/base.py (around line ~111): validate resolved_executable and sanitize arguments before asyncio.create_subprocess_exec. Implement whitelist and sandboxing.
- clink/registry.py::_resolve_executable (function): do not accept arbitrary commands via shlex.split without validation; enforce absolute paths or whitelisted names.
- providers/openai_compatible.py (around lines ~752-780): replace greedy JSON extraction, avoid ast.literal_eval heuristics; improve _validate_base_url to resolve hostnames and block internal IPs by default.
- utils/file_utils.py: resolve_and_validate_path (start at function def around utils/file_utils.py:282) is a strong control — ensure all code paths that read files call this function (clink registry when resolving prompt paths should call resolve_and_validate_path or similar check).
- server.py cleanup_providers: remove silent suppression of exceptions; log at debug level.
Final notes & actionable next steps for maintainers
- Short-term (1-2 days): Implement quick hardening steps: (a) prevent CLI configs from referencing absolute prompt files outside config directories; (b) default LOG_LEVEL to INFO and ensure API keys are removed from logs.
- Medium-term (1-2 weeks): Implement CLIAgent allow-list or sandboxing; implement DNS-based validation for CUSTOM_API_URL and flag/risk when internal addresses are configured.
- Long-term (1-2 months): Perform dependency CVE scan (pyproject/requirements), add runtime tests for SSRF and clink configuration safety, adopt signed configuration or RBAC for config editing in multi-user contexts.
If you want I can produce small code patches / diff suggestions for the highest-priority items (CLINK command validation, provider base_url DNS checks, logging sanitization) referencing exact lines and proposed code.
-- End of review --
Summary
Security Score: 35/100 (Reject)
Static analysis found 4 high, 18 medium, and 2851 low severity issues.
Build step skipped for safety.
Tests detected.
Sign in to leave a review
No reviews yet — be the first!
Configuration
OPENROUTER_API_KEY
required
🔒 password
At least one API key is required for Zen to connect to AI model providers
GEMINI_API_KEY
required
🔒 password
At least one API key is required for Zen to connect to AI model providers
OPENAI_API_KEY
required
🔒 password
At least one API key is required for Zen to connect to AI model providers
XAI_API_KEY
required
🔒 password
At least one API key is required for Zen to connect to AI model providers
Docker Image
Docker HubPublished by github.com/beehiveinnovations