How We Evaluate 930+ AI Skills: The FAOS Skill Quality Framework
When we open-sourced 930+ AI skills for Claude Code, Codex, Gemini CLI, Copilot, and Perplexity, the first question people asked was: "How do you know these skills actually work?"
Fair question. Most skill and prompt libraries have zero quality assurance — someone wrote a prompt, it seemed to work once, and it got committed. That's not how we do it.
Here's how we evaluate, structure, and maintain quality across 930+ skills at FAOS.
The Problem With "Trust Me, It Works"
Most open-source prompt collections suffer from three quality issues:
- No structured metadata — just a title and a blob of text. No way to categorize, discover, or validate.
- No evaluation criteria — nobody defined what "correct output" looks like, so nobody can test it.
- No regression testing — skills rot silently as models change. What worked on GPT-4 may fail on Claude Opus.
We built a system that solves all three.
Layer 1: The Schema — Structure Before Content
Every FAOS skill starts with a strict frontmatter schema (v1.3.0). This isn't optional metadata — it's a validated contract that our build tooling enforces.
---
name: langgraph # Must match folder name (kebab-case)
description: "Production-grade LangGraph patterns for stateful agent
workflows. Use when implementing multi-step agent graphs."
priority: high # critical | high | medium | low
domain: ai-ml # Must match parent directory
tags: [agents, workflows, state-management, llm]
provider: community # faos | anthropic | microsoft | community
license: Apache-2.0 # Required for non-FAOS skills
source: antigravity-awesome-skills
pattern: tool-wrapper # One of 6 content design patterns
---
What the schema enforces
Our build-skill-index.py validator runs 22 validation rules on every skill:
| Rule | What It Catches |
|---|---|
name-matches-folder | Skill named langchain but in a langchain-v2/ folder |
domain-matches-path | Skill says domain: backend but lives in security/ |
description-trigger | Missing "Use when:" phrase — how does the AI know to load it? |
license-required | External skill without license attribution |
no-duplicate-names | Two skills claiming the same name |
metrics-stale | Eval results older than 90 days — needs re-evaluation |
If validation fails, the skill doesn't make it into the index. No exceptions.
Six content design patterns
We don't just categorize skills by topic — we categorize by how the skill structures its logic. Inspired by the emerging agent skills ecosystem, we defined six canonical patterns:
| Pattern | How It Works | Example |
|---|---|---|
| Tool Wrapper | On-demand library/API context. Loads conventions when working with a technology. | langgraph, fastapi-templates, azure-cosmos-db-py |
| Reviewer | Separates WHAT to check from HOW. Stores a rubric, agent scores against it. | code-review, owasp-top10, api-security-patterns |
| Generator | Enforces consistent output format. Forces step-by-step execution. | deep-research, release-notes, adr-template |
| Inversion | Flips agent/user dynamic — agent interviews user before acting. | requirements-elicitation, discovery-prep, pricing-strategy |
| System | Background governance rules. Always loaded, never user-invoked. | naming-conventions, license-compliance, coding-standards |
| Composite | Combines multiple patterns. Documents which sub-patterns and why. | full-stack-review (reviewer + generator) |
Knowing the pattern matters because it determines how the skill should be tested.
Layer 2: The Eval Framework — Automated Quality Testing
Every skill can (and critical/high-priority skills should) have an evals/eval-spec.yaml that defines test cases. Here's a real example from our LangGraph skill:
version: "1.0"
skill: "langgraph"
evals:
- id: "langgraph-eval-01"
prompt: "Build a simple ReAct-style agent with LangGraph that has
a tool-calling loop."
success_criteria:
- type: contains
value: "StateGraph"
- type: regex
pattern: "(add_node|add_edge)"
- type: regex
pattern: "(tool_calls|ToolNode)"
- type: llm_judge
rubric: "Does the response build a LangGraph agent with
StateGraph, define agent and tool nodes, add conditional
edges for the tool loop, and compile the graph?"
judge_model: "claude-sonnet-4-6"
Four types of success criteria
| Type | What It Does | When to Use |
|---|---|---|
contains | Checks if output contains an exact string | Must-have API names, function calls, patterns |
not_contains | Checks that output does NOT contain a string | Catch "I don't know" or deprecated patterns |
regex | Pattern matching with flags | Flexible structural checks |
llm_judge | A separate LLM call evaluates the output against a rubric | Semantic quality — "Does this actually make sense?" |
The key insight: deterministic checks (contains, regex) catch structural issues. LLM judges catch semantic issues. You need both.
The A/B comparison mode
Our eval runner has a --compare flag that runs each eval case twice — once with the skill loaded as a system prompt, once without. This measures the comparison_delta: how much better does the AI perform with the skill vs. without it?
python scripts/run-skill-evals.py --skill fastapi-pro --compare
If a skill doesn't measurably improve output quality, it shouldn't exist. We've deprecated skills that scored negative deltas — they were actively making the AI worse by adding irrelevant context.
Metrics tracking in frontmatter
After eval runs, results get written back into the skill's frontmatter:
metrics:
eval_count: 5
last_eval_pass_rate: 0.95
last_eval_date: "2026-03-08"
last_eval_model: "claude-haiku-4-5"
avg_tokens: 1200
comparison_delta: 0.15 # 15% improvement with skill loaded
This creates a living quality record — you can see at a glance whether a skill has been tested, when, on which model, and whether it's actually useful.
Layer 3: Priority-Based Coverage — Test What Matters
With 930+ skills, we can't evaluate everything equally. Our coverage strategy is priority-driven:
| Priority | When to Assign | Eval Requirement |
|---|---|---|
| Critical | Security, compliance, blocking dependencies | Mandatory eval spec. Must pass before export. |
| High | Core patterns used weekly by agents | Should have eval spec. Coverage gap flagged. |
| Medium | Useful but not essential | Optional eval spec. |
| Low | Reference/niche skills | No eval required. |
Our EVAL-COVERAGE.md is auto-generated and tracks this:
- Critical skills with evals: 33%
- High skills with evals: 3.9%
- Total coverage: 2.3% (16 of 696 skills at last count)
We're honest about this: coverage is low. We prioritize critical and high-priority skills first, and we're building it out. The framework is mature — the coverage is what we're scaling.
Layer 4: Cross-Platform Export — One Source, Five Formats
Skills are authored once in our canonical format and exported to five platforms:
.faos/custom/skills/{domain}/{name}/SKILL.md (source of truth)
│
▼ export-skills.py
│
├── skills/cowork/ → Claude Code (SKILL.md)
├── skills/codex/ → OpenAI Codex (SKILL.md + openai.yaml)
├── skills/gemini/ → Gemini CLI (TOML commands)
├── skills/copilot/ → GitHub Copilot (.instructions.md)
└── skills/perplexity/ → Perplexity Computer (SKILL.md)
Each platform has format-specific requirements:
- Gemini CLI needs TOML with specific command metadata
- GitHub Copilot needs
applyToglob patterns in YAML frontmatter - OpenAI Codex needs a companion
openai.yamlfor UI metadata
The export pipeline handles all of this. Authors write one skill, the tooling handles the rest.
Layer 5: The Claude Code Bridge — Skills That Load Themselves
For skills that are actively used in our development workflow, we have a bridge that exports them as native Claude Code skills:
claude_code:
enabled: true
tier: slash-command # User-invocable via /name
allowed_tools: [Read, Grep, Glob, Bash]
argument_hint: "[file-or-pr-url]"
Three tiers control how skills surface in Claude Code:
| Tier | Who Invokes | In / Menu | Use Case |
|---|---|---|---|
slash-command | User types /name | Yes | Code review, deploy, story creation |
auto-load | Claude discovers it | Yes | Patterns, conventions, principles |
background | Claude only | No | Compliance rules, domain knowledge |
There's a token budget constraint: Claude Code allocates ~16K characters for skill descriptions. With 930+ skills, we can't load them all. We limit the bridge to ~50-60 skills, selected by priority.
What We Learned
Building an evaluation framework for 930+ skills taught us a few things:
1. Schema enforcement is non-negotiable. Without structured metadata, skills become an unsearchable pile of markdown. The 22 validation rules catch issues before they ship.
2. LLM judges are necessary but not sufficient. Deterministic checks (contains, regex) are fast, cheap, and deterministic. LLM judges catch semantic quality issues but cost money and can be non-deterministic. Use both.
3. Comparison mode is the real test. A skill that doesn't measurably improve output quality is dead weight. The comparison_delta metric is the most honest signal we have.
4. Coverage is a journey. We're at 2.3% eval coverage. That sounds bad, but it's 2.3% more than most skill libraries have. Critical skills are covered first, and we're building outward.
5. Content design patterns matter. A tool-wrapper skill needs different eval criteria than a reviewer skill. Knowing the pattern tells you how to test it.
Try It Yourself
The skills are open-source. The eval framework is part of our internal tooling, but the eval spec format is straightforward — you could build your own runner against it.
Get the skills: github.com/frank-luongt/faos-skills-marketplace
Want to contribute an eval? We'd love it. Open a PR with an evals/eval-spec.yaml for any skill. Start with the format above — contains, regex, and llm_judge criteria. We'll run it through our pipeline and update the metrics.
What's Next
This is part of our 4-layer open-source strategy. Skills are Layer 1. Coming soon:
- Layer 2 (May): 50+ workflow templates — how to orchestrate multi-step AI work
- Layer 3 (June): 22 industry ontology blueprints — domain intelligence for AI agents
- Layer 4 (July): MCP server — connect your AI assistant to project intelligence
Skills tell your AI what. The eval framework tells us whether it worked.
Built by the FAOS team. If you're building AI skills and want to talk about evaluation approaches, reach out.
