cursor-plugin-evals

The definitive testing framework for Cursor plugins — with an autonomous AI assistant

7 layers 35 evaluators 6 adapters 22 security rules 154 community tests
Get Started → GitHub
$ cursor-plugin-evals run
plugin-structure ·················· 7/7 passed
unit-registration ················ 3/3 passed
integration-gateway ·············· 12/12 passed
llm-tool-selection ··············· 8/8 passed
30 passed 0 failed Quality: A (96.2%)

Why cursor-plugin-evals?

Purpose-built for the MCP ecosystem. No retrofitting generic LLM evals onto plugin workflows.

🔌

Only MCP-Native

Speaks the Model Context Protocol natively — tests tool registration, gateway routing, and schema validation the way Cursor actually uses your plugin.

🧪

7 Testing Layers

From static analysis to LLM-driven skill evaluation, every layer of plugin quality is covered with dedicated evaluators.

🛡️

Battle-Tested

1770+ tests across production plugins. Red-teaming, prompt injection defense, guardrails, and regression detection built in.

7 Testing Layers

A progressive pipeline from static checks to full LLM-powered skill evaluation and protocol conformance.

1

Static

Manifest validation, schema checks, file structure

2

Unit

Tool registration, handler isolation, input parsing

3

Integration

MCP gateway, end-to-end tool calls, fixture replay

4

Performance

Latency budgets, throughput limits, cost tracking

5

LLM

Tool selection accuracy, response quality, multi-turn

6

Skill

Collision detection, activation accuracy, conversation sim

7

Conformance

MCP protocol spec compliance, 25 checks across 9 categories

Meet the Framework Assistant

An autonomous AI agent that handles your entire eval lifecycle. Point it at your plugin — it does the rest.

1

Deep Scan

Discovers every MCP tool, skill, rule, agent, command, and hook in your plugin. Reads every file to understand what each component does.

2

Generate Coverage

Writes a comprehensive plugin-eval.yaml covering every component across all 7 layers — static, unit, integration, performance, LLM, skill, and conformance.

3

Set Up Infrastructure

Creates Docker Compose, test data seeds, .env files, CI workflows, and orchestration scripts. Everything needed for e2e testing.

4

Run → Fix → Converge

Executes all suites, analyzes failures, fixes YAML and assertions, and re-runs. Iterates up to 5 times per layer until CI passes.

5

Calibrate Thresholds

After tests pass, tightens CI gates based on actual scores. Ensures quality never regresses — thresholds track real capability.

6

Stay Proactive

Detects new tools and skills as you add them. Auto-generates tests, updates coverage, and monitors for regressions — without being asked.

Everything You Need

Purpose-built capabilities for Cursor plugin quality.

🔴

Red-Teaming

10 attack categories including prompt injection and jailbreak

Smart Gen

AI-powered test generation from tool schemas

🎯

Prompt Optimization

Auto-improve tool descriptions for better selection

💬

Conversation Sim

Multi-turn dialogue testing with synthetic users

📊

Monitoring

Real-time quality dashboards and trend tracking

💰

Cost Advisor

Token usage analysis and budget enforcement

📉

Regression Detection

Catch quality drops before they ship

🛡️

Guardrails

Safety boundaries with automated enforcement

🔗

External Eval

Evaluate any plugin without committing eval files to the target repo

🔬

Prompt Sensitivity

Measure robustness to prompt variations

💥

Collision Detection

Find overlapping skills before they confuse the LLM

🔍

Trace Viewer

Visual execution traces for debugging failures

🤖

Model Comparison

Test across GPT, Claude, and open-source models

📦

Dataset Management

Version, share, and reuse evaluation datasets

🔔

Notifications

Slack and webhook alerts on quality changes

📼

Fixture Record/Replay

Deterministic offline testing with recorded calls

🗄️

LLM Cache

Semantic caching for fast, cheap re-runs

🔌

7 Adapters

MCP, plain-llm, cursor-cli, claude-sdk, gemini-cli, headless-coder, otel-trace

35 Evaluators

19 deterministic + 16 LLM-as-judge + multi-judge panel

🚦

CI Thresholds

Quality gates that block merges below your bar

🏆

Quality Score

A single A–F grade for every eval run

📈

pass@k / pass^k Metrics

Industry-standard probabilistic metrics from Anthropic's agent eval methodology. Named presets for smoke, reliable, and regression testing.

🔬

Ablation Testing

Prove your skill adds value with statistical A/B comparison using Welch's t-test.

🪄

Zero-Config Skill Eval

Auto-generate eval.yaml from SKILL.md — just point and run. AI-powered recommendations improve your config after every run.

💰

LLM Cost Optimization

Judge caching, per-evaluator model tiers, call deduplication, and pre-run cost estimation. Cut eval costs by 50-80%.

🔎

ES|QL Evaluators

Execution-based scoring: run generated queries against live Elasticsearch, match patterns with equivalence classes, compare result sets against golden queries.

📡

Elastic OTEL Export

Send eval traces to Elasticsearch via OpenTelemetry for production observability dashboards.

📸

Snapshot Testing

Capture and compare tool responses with configurable sanitizers for timestamps, UUIDs, and IDs.

🔎

Deep Trajectory Tracing

Function-level observation with TraceCollector for detailed agent execution analysis.

💯

Cost-Efficiency Score

Single 0-100 score combining quality and cost. Letter grades A-F for every skill.

📝

NL Scorer

Describe what to check in plain English — gets auto-converted to an LLM evaluator.

🛡️

Unicode & YAML Security

Detect bidi overrides, homoglyphs, zero-width chars, YAML tag injection, and anchor bombs.

📦

20 Benchmark Tests

Pre-built evals for instruction-following, ambiguity handling, multi-step reasoning, and safety.

🎯

Skill Routing Accuracy

Test positive and negative triggers — does the right skill activate for matching prompts?

📋

Description Quality

Score skill descriptions on clarity, specificity, actionability, and uniqueness vs other skills.

📐

Context Budget

Per-skill token estimation with bloat detection. Know exactly how much context each skill costs.

🔄

Skill A/B Testing

Compare two skill versions with Welch's t-test for statistically significant differences.

🧩

Composability Testing

Test multi-skill scenarios for interference, chaining, and compatibility.

📊

CLEAR Framework

Enterprise-grade 5-dimension scoring: Cost, Latency, Efficacy, Assurance, Reliability. Pareto efficiency detection.

🔬

AgentRx Failure Taxonomy

9-category failure diagnosis: plan adherence, tool misuse, hallucination, loops, safety violations + critical step localization.

📡

MCP-Radar Scoring

Per-tool precision/recall/F1, MRR, token waste ratio, planning latency. Research-backed MCP benchmarking.

Battle-Tested in Production

See what comprehensive eval suites look like on real-world plugins.

elastic-cursor-plugin
Elasticsearch, Kibana, and Elastic Cloud — fully integrated into Cursor
The Elastic Cursor Plugin connects Cursor to the entire Elastic stack — Elasticsearch queries, Kibana dashboards, Cloud management, observability setup, security detection rules, and AI assistant agent building. Its evaluation suite is the reference implementation for cursor-plugin-evals.
38MCP Tools
31Test Suites
12Layers
AQuality Grade
Static Unit Integration Performance LLM Conformance Security Chaos Fuzz Schema Drift SAFE-MCP Multi-Server
attack-emulation-cursor-plugin
MITRE ATT&CK emulation with Caldera — security testing for Elastic
Full evaluation suite for a security-focused MCP plugin with 15 tools across 3 packages (infrastructure, Caldera, detection). Includes OWASP adversarial tests, trajectory evaluation, and multi-judge blind assessment.
15MCP Tools
24Test Suites
5Layers
AQuality Grade
Static Unit Integration Performance LLM

Quick Start

Up and running in under two minutes.

terminal
# Install
npm install cursor-plugin-evals --save-dev

# Initialize config
npx cursor-plugin-evals init

# Run all layers
npx cursor-plugin-evals run

# Run with red-teaming
npx cursor-plugin-evals run --red-team

How We Compare

Built for Cursor plugins, not retrofitted from generic LLM evals.

Capabilitycursor-plugin-evalsSkillgradeDeepEvalBraintrustMCP-Eval
Autonomous AI assistant
MCP-native protocol testing
Cursor-aware (skills, rules)
Skill collision detection
Red-teaming / guardrails
Fixture record & replay
pass@k / pass^k metrics
Inline script graders
Ablation testing (A/B)
Execution-based evaluators (ES|QL)
6 task adapters
7 testing layers
CI quality gates
Conversation simulation
Cost tracking
Zero-config skill eval