cursor-plugin-evals

The definitive testing framework for Cursor plugins — with an autonomous AI assistant

7 layers 35 evaluators 6 adapters 22 security rules 154 community tests

$ cursor-plugin-evals run
● plugin-structure ·················· 7/7 passed
● unit-registration ················ 3/3 passed
● integration-gateway ·············· 12/12 passed
● llm-tool-selection ··············· 8/8 passed
✓ 30 passed  0 failed  Quality: A (96.2%)

Why cursor-plugin-evals?

Purpose-built for the MCP ecosystem. No retrofitting generic LLM evals onto plugin workflows.

🔌

Only MCP-Native

Speaks the Model Context Protocol natively — tests tool registration, gateway routing, and schema validation the way Cursor actually uses your plugin.

🧪

7 Testing Layers

From static analysis to LLM-driven skill evaluation, every layer of plugin quality is covered with dedicated evaluators.

🛡️

Battle-Tested

1770+ tests across production plugins. Red-teaming, prompt injection defense, guardrails, and regression detection built in.

7 Testing Layers

A progressive pipeline from static checks to full LLM-powered skill evaluation and protocol conformance.

Static

Manifest validation, schema checks, file structure

→

Unit

Tool registration, handler isolation, input parsing

→

Integration

MCP gateway, end-to-end tool calls, fixture replay

→

Performance

Latency budgets, throughput limits, cost tracking

→

LLM

Tool selection accuracy, response quality, multi-turn

→

Skill

Collision detection, activation accuracy, conversation sim

→

Conformance

MCP protocol spec compliance, 25 checks across 9 categories

Meet the Framework Assistant

An autonomous AI agent that handles your entire eval lifecycle. Point it at your plugin — it does the rest.

Deep Scan

Discovers every MCP tool, skill, rule, agent, command, and hook in your plugin. Reads every file to understand what each component does.

Generate Coverage

Writes a comprehensive plugin-eval.yaml covering every component across all 7 layers — static, unit, integration, performance, LLM, skill, and conformance.

Set Up Infrastructure

Creates Docker Compose, test data seeds, .env files, CI workflows, and orchestration scripts. Everything needed for e2e testing.

Run → Fix → Converge

Executes all suites, analyzes failures, fixes YAML and assertions, and re-runs. Iterates up to 5 times per layer until CI passes.

Calibrate Thresholds

After tests pass, tightens CI gates based on actual scores. Ensures quality never regresses — thresholds track real capability.

Stay Proactive

Detects new tools and skills as you add them. Auto-generates tests, updates coverage, and monitors for regressions — without being asked.

Everything You Need

Purpose-built capabilities for Cursor plugin quality.

🔴

Red-Teaming

10 attack categories including prompt injection and jailbreak

⚡

Smart Gen

AI-powered test generation from tool schemas

🎯

Prompt Optimization

Auto-improve tool descriptions for better selection

💬

Conversation Sim

Multi-turn dialogue testing with synthetic users

📊

Monitoring

Real-time quality dashboards and trend tracking

💰

Cost Advisor

Token usage analysis and budget enforcement

📉

Regression Detection

Catch quality drops before they ship

🛡️

Guardrails

Safety boundaries with automated enforcement

🔗

External Eval

Evaluate any plugin without committing eval files to the target repo

🔬

Prompt Sensitivity

Measure robustness to prompt variations

💥

Collision Detection

Find overlapping skills before they confuse the LLM

🔍

Trace Viewer

Visual execution traces for debugging failures

🤖

Model Comparison

Test across GPT, Claude, and open-source models

📦

Dataset Management

Version, share, and reuse evaluation datasets

🔔

Notifications

Slack and webhook alerts on quality changes

📼

Fixture Record/Replay

Deterministic offline testing with recorded calls

🗄️

LLM Cache

Semantic caching for fast, cheap re-runs

🔌

7 Adapters

MCP, plain-llm, cursor-cli, claude-sdk, gemini-cli, headless-coder, otel-trace

✅

35 Evaluators

19 deterministic + 16 LLM-as-judge + multi-judge panel

🚦

CI Thresholds

Quality gates that block merges below your bar

🏆

Quality Score

A single A–F grade for every eval run

📈

pass@k / pass^k Metrics

Industry-standard probabilistic metrics from Anthropic's agent eval methodology. Named presets for smoke, reliable, and regression testing.

🔬

Ablation Testing

Prove your skill adds value with statistical A/B comparison using Welch's t-test.

🪄

Zero-Config Skill Eval

Auto-generate eval.yaml from SKILL.md — just point and run. AI-powered recommendations improve your config after every run.

💰

LLM Cost Optimization

Judge caching, per-evaluator model tiers, call deduplication, and pre-run cost estimation. Cut eval costs by 50-80%.

🔎

ES|QL Evaluators

Execution-based scoring: run generated queries against live Elasticsearch, match patterns with equivalence classes, compare result sets against golden queries.

📡

Elastic OTEL Export

Send eval traces to Elasticsearch via OpenTelemetry for production observability dashboards.

📸

Snapshot Testing

Capture and compare tool responses with configurable sanitizers for timestamps, UUIDs, and IDs.

🔎

Deep Trajectory Tracing

Function-level observation with TraceCollector for detailed agent execution analysis.

💯

Cost-Efficiency Score

Single 0-100 score combining quality and cost. Letter grades A-F for every skill.

📝

NL Scorer

Describe what to check in plain English — gets auto-converted to an LLM evaluator.

🛡️

Unicode & YAML Security

Detect bidi overrides, homoglyphs, zero-width chars, YAML tag injection, and anchor bombs.

📦

20 Benchmark Tests

Pre-built evals for instruction-following, ambiguity handling, multi-step reasoning, and safety.

🎯

Skill Routing Accuracy

Test positive and negative triggers — does the right skill activate for matching prompts?

📋

Description Quality

Score skill descriptions on clarity, specificity, actionability, and uniqueness vs other skills.

📐

Context Budget

Per-skill token estimation with bloat detection. Know exactly how much context each skill costs.

🔄

Skill A/B Testing

Compare two skill versions with Welch's t-test for statistically significant differences.

🧩

Composability Testing

Test multi-skill scenarios for interference, chaining, and compatibility.

📊

CLEAR Framework

Enterprise-grade 5-dimension scoring: Cost, Latency, Efficacy, Assurance, Reliability. Pareto efficiency detection.

🔬

AgentRx Failure Taxonomy

9-category failure diagnosis: plan adherence, tool misuse, hallucination, loops, safety violations + critical step localization.

📡

MCP-Radar Scoring

Per-tool precision/recall/F1, MRR, token waste ratio, planning latency. Research-backed MCP benchmarking.

Battle-Tested in Production

See what comprehensive eval suites look like on real-world plugins.

elastic-cursor-plugin

Elasticsearch, Kibana, and Elastic Cloud — fully integrated into Cursor

The Elastic Cursor Plugin connects Cursor to the entire Elastic stack — Elasticsearch queries, Kibana dashboards, Cloud management, observability setup, security detection rules, and AI assistant agent building. Its evaluation suite is the reference implementation for cursor-plugin-evals.

38MCP Tools

31Test Suites

12Layers

AQuality Grade

Static Unit Integration Performance LLM Conformance Security Chaos Fuzz Schema Drift SAFE-MCP Multi-Server

View Showcase → Plugin Repo

attack-emulation-cursor-plugin

MITRE ATT&CK emulation with Caldera — security testing for Elastic

Full evaluation suite for a security-focused MCP plugin with 15 tools across 3 packages (infrastructure, Caldera, detection). Includes OWASP adversarial tests, trajectory evaluation, and multi-judge blind assessment.

15MCP Tools

24Test Suites

5Layers

AQuality Grade

Static Unit Integration Performance LLM

View Showcase → Plugin Repo

Quick Start

Up and running in under two minutes.

terminal

# Install

npm install cursor-plugin-evals --save-dev

# Initialize config

npx cursor-plugin-evals init

# Run all layers

npx cursor-plugin-evals run

# Run with red-teaming

npx cursor-plugin-evals run --red-team

How We Compare

Built for Cursor plugins, not retrofitted from generic LLM evals.

Capability	cursor-plugin-evals	Skillgrade	DeepEval	Braintrust	MCP-Eval
Autonomous AI assistant	✓	✗	✗	✗	✗
MCP-native protocol testing	✓	✗	✗	✗	✓
Cursor-aware (skills, rules)	✓	✓	✗	✗	✗
Skill collision detection	✓	✗	✗	✗	✗
Red-teaming / guardrails	✓	✗	✓	✗	✗
Fixture record & replay	✓	✗	✗	✗	✗
pass@k / pass^k metrics	✓	✓	✗	✗	✗
Inline script graders	✓	✓	✗	✗	✗
Ablation testing (A/B)	✓	✓	✗	✗	✗
Execution-based evaluators (ES\|QL)	✓	✗	✗	✗	✗
6 task adapters	✓	✗	✓	✓	✗
7 testing layers	✓	✗	✗	✗	✗
CI quality gates	✓	✗	✓	✓	✗
Conversation simulation	✓	✗	✓	✗	✗
Cost tracking	✓	✗	✗	✓	✗
Zero-config skill eval	✓	✗	✗	✗	✗