Chuyển tới nội dung chính

How We Evaluate 930+ AI Skills: The FAOS Skill Quality Framework

· 8 phút để đọc
Frank Luong
Founder & CEO, FAOSX | CIO 100 Asia 2025 | AI & Digital Transformation Leader

When we open-sourced 930+ AI skills for Claude Code, Codex, Gemini CLI, Copilot, and Perplexity, the first question people asked was: "How do you know these skills actually work?"

Fair question. Most skill and prompt libraries have zero quality assurance — someone wrote a prompt, it seemed to work once, and it got committed. That's not how we do it.

Here's how we evaluate, structure, and maintain quality across 930+ skills at FAOS.


The Problem With "Trust Me, It Works"

Most open-source prompt collections suffer from three quality issues:

  1. No structured metadata — just a title and a blob of text. No way to categorize, discover, or validate.
  2. No evaluation criteria — nobody defined what "correct output" looks like, so nobody can test it.
  3. No regression testing — skills rot silently as models change. What worked on GPT-4 may fail on Claude Opus.

We built a system that solves all three.


Layer 1: The Schema — Structure Before Content

Every FAOS skill starts with a strict frontmatter schema (v1.3.0). This isn't optional metadata — it's a validated contract that our build tooling enforces.

---
name: langgraph # Must match folder name (kebab-case)
description: "Production-grade LangGraph patterns for stateful agent
workflows. Use when implementing multi-step agent graphs."
priority: high # critical | high | medium | low
domain: ai-ml # Must match parent directory
tags: [agents, workflows, state-management, llm]
provider: community # faos | anthropic | microsoft | community
license: Apache-2.0 # Required for non-FAOS skills
source: antigravity-awesome-skills
pattern: tool-wrapper # One of 6 content design patterns
---

What the schema enforces

Our build-skill-index.py validator runs 22 validation rules on every skill:

RuleWhat It Catches
name-matches-folderSkill named langchain but in a langchain-v2/ folder
domain-matches-pathSkill says domain: backend but lives in security/
description-triggerMissing "Use when:" phrase — how does the AI know to load it?
license-requiredExternal skill without license attribution
no-duplicate-namesTwo skills claiming the same name
metrics-staleEval results older than 90 days — needs re-evaluation

If validation fails, the skill doesn't make it into the index. No exceptions.

Six content design patterns

We don't just categorize skills by topic — we categorize by how the skill structures its logic. Inspired by the emerging agent skills ecosystem, we defined six canonical patterns:

PatternHow It WorksExample
Tool WrapperOn-demand library/API context. Loads conventions when working with a technology.langgraph, fastapi-templates, azure-cosmos-db-py
ReviewerSeparates WHAT to check from HOW. Stores a rubric, agent scores against it.code-review, owasp-top10, api-security-patterns
GeneratorEnforces consistent output format. Forces step-by-step execution.deep-research, release-notes, adr-template
InversionFlips agent/user dynamic — agent interviews user before acting.requirements-elicitation, discovery-prep, pricing-strategy
SystemBackground governance rules. Always loaded, never user-invoked.naming-conventions, license-compliance, coding-standards
CompositeCombines multiple patterns. Documents which sub-patterns and why.full-stack-review (reviewer + generator)

Knowing the pattern matters because it determines how the skill should be tested.


Layer 2: The Eval Framework — Automated Quality Testing

Every skill can (and critical/high-priority skills should) have an evals/eval-spec.yaml that defines test cases. Here's a real example from our LangGraph skill:

version: "1.0"
skill: "langgraph"

evals:
- id: "langgraph-eval-01"
prompt: "Build a simple ReAct-style agent with LangGraph that has
a tool-calling loop."
success_criteria:
- type: contains
value: "StateGraph"
- type: regex
pattern: "(add_node|add_edge)"
- type: regex
pattern: "(tool_calls|ToolNode)"
- type: llm_judge
rubric: "Does the response build a LangGraph agent with
StateGraph, define agent and tool nodes, add conditional
edges for the tool loop, and compile the graph?"
judge_model: "claude-sonnet-4-6"

Four types of success criteria

TypeWhat It DoesWhen to Use
containsChecks if output contains an exact stringMust-have API names, function calls, patterns
not_containsChecks that output does NOT contain a stringCatch "I don't know" or deprecated patterns
regexPattern matching with flagsFlexible structural checks
llm_judgeA separate LLM call evaluates the output against a rubricSemantic quality — "Does this actually make sense?"

The key insight: deterministic checks (contains, regex) catch structural issues. LLM judges catch semantic issues. You need both.

The A/B comparison mode

Our eval runner has a --compare flag that runs each eval case twice — once with the skill loaded as a system prompt, once without. This measures the comparison_delta: how much better does the AI perform with the skill vs. without it?

python scripts/run-skill-evals.py --skill fastapi-pro --compare

If a skill doesn't measurably improve output quality, it shouldn't exist. We've deprecated skills that scored negative deltas — they were actively making the AI worse by adding irrelevant context.

Metrics tracking in frontmatter

After eval runs, results get written back into the skill's frontmatter:

metrics:
eval_count: 5
last_eval_pass_rate: 0.95
last_eval_date: "2026-03-08"
last_eval_model: "claude-haiku-4-5"
avg_tokens: 1200
comparison_delta: 0.15 # 15% improvement with skill loaded

This creates a living quality record — you can see at a glance whether a skill has been tested, when, on which model, and whether it's actually useful.


Layer 3: Priority-Based Coverage — Test What Matters

With 930+ skills, we can't evaluate everything equally. Our coverage strategy is priority-driven:

PriorityWhen to AssignEval Requirement
CriticalSecurity, compliance, blocking dependenciesMandatory eval spec. Must pass before export.
HighCore patterns used weekly by agentsShould have eval spec. Coverage gap flagged.
MediumUseful but not essentialOptional eval spec.
LowReference/niche skillsNo eval required.

Our EVAL-COVERAGE.md is auto-generated and tracks this:

  • Critical skills with evals: 33%
  • High skills with evals: 3.9%
  • Total coverage: 2.3% (16 of 696 skills at last count)

We're honest about this: coverage is low. We prioritize critical and high-priority skills first, and we're building it out. The framework is mature — the coverage is what we're scaling.


Layer 4: Cross-Platform Export — One Source, Five Formats

Skills are authored once in our canonical format and exported to five platforms:

.faos/custom/skills/{domain}/{name}/SKILL.md (source of truth)

▼ export-skills.py

├── skills/cowork/ → Claude Code (SKILL.md)
├── skills/codex/ → OpenAI Codex (SKILL.md + openai.yaml)
├── skills/gemini/ → Gemini CLI (TOML commands)
├── skills/copilot/ → GitHub Copilot (.instructions.md)
└── skills/perplexity/ → Perplexity Computer (SKILL.md)

Each platform has format-specific requirements:

  • Gemini CLI needs TOML with specific command metadata
  • GitHub Copilot needs applyTo glob patterns in YAML frontmatter
  • OpenAI Codex needs a companion openai.yaml for UI metadata

The export pipeline handles all of this. Authors write one skill, the tooling handles the rest.


Layer 5: The Claude Code Bridge — Skills That Load Themselves

For skills that are actively used in our development workflow, we have a bridge that exports them as native Claude Code skills:

claude_code:
enabled: true
tier: slash-command # User-invocable via /name
allowed_tools: [Read, Grep, Glob, Bash]
argument_hint: "[file-or-pr-url]"

Three tiers control how skills surface in Claude Code:

TierWho InvokesIn / MenuUse Case
slash-commandUser types /nameYesCode review, deploy, story creation
auto-loadClaude discovers itYesPatterns, conventions, principles
backgroundClaude onlyNoCompliance rules, domain knowledge

There's a token budget constraint: Claude Code allocates ~16K characters for skill descriptions. With 930+ skills, we can't load them all. We limit the bridge to ~50-60 skills, selected by priority.


What We Learned

Building an evaluation framework for 930+ skills taught us a few things:

1. Schema enforcement is non-negotiable. Without structured metadata, skills become an unsearchable pile of markdown. The 22 validation rules catch issues before they ship.

2. LLM judges are necessary but not sufficient. Deterministic checks (contains, regex) are fast, cheap, and deterministic. LLM judges catch semantic quality issues but cost money and can be non-deterministic. Use both.

3. Comparison mode is the real test. A skill that doesn't measurably improve output quality is dead weight. The comparison_delta metric is the most honest signal we have.

4. Coverage is a journey. We're at 2.3% eval coverage. That sounds bad, but it's 2.3% more than most skill libraries have. Critical skills are covered first, and we're building outward.

5. Content design patterns matter. A tool-wrapper skill needs different eval criteria than a reviewer skill. Knowing the pattern tells you how to test it.


Try It Yourself

The skills are open-source. The eval framework is part of our internal tooling, but the eval spec format is straightforward — you could build your own runner against it.

Get the skills: github.com/frank-luongt/faos-skills-marketplace

Want to contribute an eval? We'd love it. Open a PR with an evals/eval-spec.yaml for any skill. Start with the format above — contains, regex, and llm_judge criteria. We'll run it through our pipeline and update the metrics.


What's Next

This is part of our 4-layer open-source strategy. Skills are Layer 1. Coming soon:

  • Layer 2 (May): 50+ workflow templates — how to orchestrate multi-step AI work
  • Layer 3 (June): 22 industry ontology blueprints — domain intelligence for AI agents
  • Layer 4 (July): MCP server — connect your AI assistant to project intelligence

Skills tell your AI what. The eval framework tells us whether it worked.


Built by the FAOS team. If you're building AI skills and want to talk about evaluation approaches, reach out.