The definitive testing framework for Cursor plugins — with an autonomous AI assistant
Purpose-built for the MCP ecosystem. No retrofitting generic LLM evals onto plugin workflows.
Speaks the Model Context Protocol natively — tests tool registration, gateway routing, and schema validation the way Cursor actually uses your plugin.
From static analysis to LLM-driven skill evaluation, every layer of plugin quality is covered with dedicated evaluators.
1770+ tests across production plugins. Red-teaming, prompt injection defense, guardrails, and regression detection built in.
A progressive pipeline from static checks to full LLM-powered skill evaluation and protocol conformance.
Manifest validation, schema checks, file structure
Tool registration, handler isolation, input parsing
MCP gateway, end-to-end tool calls, fixture replay
Latency budgets, throughput limits, cost tracking
Tool selection accuracy, response quality, multi-turn
Collision detection, activation accuracy, conversation sim
MCP protocol spec compliance, 25 checks across 9 categories
An autonomous AI agent that handles your entire eval lifecycle. Point it at your plugin — it does the rest.
Discovers every MCP tool, skill, rule, agent, command, and hook in your plugin. Reads every file to understand what each component does.
Writes a comprehensive plugin-eval.yaml covering every component across all 7 layers — static, unit, integration, performance, LLM, skill, and conformance.
Creates Docker Compose, test data seeds, .env files, CI workflows, and orchestration scripts. Everything needed for e2e testing.
Executes all suites, analyzes failures, fixes YAML and assertions, and re-runs. Iterates up to 5 times per layer until CI passes.
After tests pass, tightens CI gates based on actual scores. Ensures quality never regresses — thresholds track real capability.
Detects new tools and skills as you add them. Auto-generates tests, updates coverage, and monitors for regressions — without being asked.
Purpose-built capabilities for Cursor plugin quality.
10 attack categories including prompt injection and jailbreak
AI-powered test generation from tool schemas
Auto-improve tool descriptions for better selection
Multi-turn dialogue testing with synthetic users
Real-time quality dashboards and trend tracking
Token usage analysis and budget enforcement
Catch quality drops before they ship
Safety boundaries with automated enforcement
Evaluate any plugin without committing eval files to the target repo
Measure robustness to prompt variations
Find overlapping skills before they confuse the LLM
Visual execution traces for debugging failures
Test across GPT, Claude, and open-source models
Version, share, and reuse evaluation datasets
Slack and webhook alerts on quality changes
Deterministic offline testing with recorded calls
Semantic caching for fast, cheap re-runs
MCP, plain-llm, cursor-cli, claude-sdk, gemini-cli, headless-coder, otel-trace
19 deterministic + 16 LLM-as-judge + multi-judge panel
Quality gates that block merges below your bar
A single A–F grade for every eval run
Industry-standard probabilistic metrics from Anthropic's agent eval methodology. Named presets for smoke, reliable, and regression testing.
Prove your skill adds value with statistical A/B comparison using Welch's t-test.
Auto-generate eval.yaml from SKILL.md — just point and run. AI-powered recommendations improve your config after every run.
Judge caching, per-evaluator model tiers, call deduplication, and pre-run cost estimation. Cut eval costs by 50-80%.
Execution-based scoring: run generated queries against live Elasticsearch, match patterns with equivalence classes, compare result sets against golden queries.
Send eval traces to Elasticsearch via OpenTelemetry for production observability dashboards.
Capture and compare tool responses with configurable sanitizers for timestamps, UUIDs, and IDs.
Function-level observation with TraceCollector for detailed agent execution analysis.
Single 0-100 score combining quality and cost. Letter grades A-F for every skill.
Describe what to check in plain English — gets auto-converted to an LLM evaluator.
Detect bidi overrides, homoglyphs, zero-width chars, YAML tag injection, and anchor bombs.
Pre-built evals for instruction-following, ambiguity handling, multi-step reasoning, and safety.
Test positive and negative triggers — does the right skill activate for matching prompts?
Score skill descriptions on clarity, specificity, actionability, and uniqueness vs other skills.
Per-skill token estimation with bloat detection. Know exactly how much context each skill costs.
Compare two skill versions with Welch's t-test for statistically significant differences.
Test multi-skill scenarios for interference, chaining, and compatibility.
Enterprise-grade 5-dimension scoring: Cost, Latency, Efficacy, Assurance, Reliability. Pareto efficiency detection.
9-category failure diagnosis: plan adherence, tool misuse, hallucination, loops, safety violations + critical step localization.
Per-tool precision/recall/F1, MRR, token waste ratio, planning latency. Research-backed MCP benchmarking.
See what comprehensive eval suites look like on real-world plugins.
Up and running in under two minutes.
Built for Cursor plugins, not retrofitted from generic LLM evals.
| Capability | cursor-plugin-evals | Skillgrade | DeepEval | Braintrust | MCP-Eval |
|---|---|---|---|---|---|
| Autonomous AI assistant | ✓ | ✗ | ✗ | ✗ | ✗ |
| MCP-native protocol testing | ✓ | ✗ | ✗ | ✗ | ✓ |
| Cursor-aware (skills, rules) | ✓ | ✓ | ✗ | ✗ | ✗ |
| Skill collision detection | ✓ | ✗ | ✗ | ✗ | ✗ |
| Red-teaming / guardrails | ✓ | ✗ | ✓ | ✗ | ✗ |
| Fixture record & replay | ✓ | ✗ | ✗ | ✗ | ✗ |
| pass@k / pass^k metrics | ✓ | ✓ | ✗ | ✗ | ✗ |
| Inline script graders | ✓ | ✓ | ✗ | ✗ | ✗ |
| Ablation testing (A/B) | ✓ | ✓ | ✗ | ✗ | ✗ |
| Execution-based evaluators (ES|QL) | ✓ | ✗ | ✗ | ✗ | ✗ |
| 6 task adapters | ✓ | ✗ | ✓ | ✓ | ✗ |
| 7 testing layers | ✓ | ✗ | ✗ | ✗ | ✗ |
| CI quality gates | ✓ | ✗ | ✓ | ✓ | ✗ |
| Conversation simulation | ✓ | ✗ | ✓ | ✗ | ✗ |
| Cost tracking | ✓ | ✗ | ✗ | ✓ | ✗ |
| Zero-config skill eval | ✓ | ✗ | ✗ | ✗ | ✗ |