How We Evaluate 930+ AI Skills: The FAOS Skill Quality Framework

16 tháng 4, 2026 · 8 phút để đọc

Founder & CEO, FAOSX | CIO 100 Asia 2025 | AI & Digital Transformation Leader

When we open-sourced 930+ AI skills for Claude Code, Codex, Gemini CLI, Copilot, and Perplexity, the first question people asked was: "How do you know these skills actually work?"

Fair question. Most skill and prompt libraries have zero quality assurance — someone wrote a prompt, it seemed to work once, and it got committed. That's not how we do it.

Here's how we evaluate, structure, and maintain quality across 930+ skills at FAOS.

The Problem With "Trust Me, It Works"

Most open-source prompt collections suffer from three quality issues:

No structured metadata — just a title and a blob of text. No way to categorize, discover, or validate.
No evaluation criteria — nobody defined what "correct output" looks like, so nobody can test it.
No regression testing — skills rot silently as models change. What worked on GPT-4 may fail on Claude Opus.

We built a system that solves all three.

Layer 1: The Schema — Structure Before Content

Every FAOS skill starts with a strict frontmatter schema (v1.3.0). This isn't optional metadata — it's a validated contract that our build tooling enforces.

---
name: langgraph                    # Must match folder name (kebab-case)
description: "Production-grade LangGraph patterns for stateful agent
  workflows. Use when implementing multi-step agent graphs."
priority: high                     # critical | high | medium | low
domain: ai-ml                     # Must match parent directory
tags: [agents, workflows, state-management, llm]
provider: community               # faos | anthropic | microsoft | community
license: Apache-2.0               # Required for non-FAOS skills
source: antigravity-awesome-skills
pattern: tool-wrapper              # One of 6 content design patterns
---

What the schema enforces

Our build-skill-index.py validator runs 22 validation rules on every skill:

Rule	What It Catches
`name-matches-folder`	Skill named `langchain` but in a `langchain-v2/` folder
`domain-matches-path`	Skill says `domain: backend` but lives in `security/`
`description-trigger`	Missing "Use when:" phrase — how does the AI know to load it?
`license-required`	External skill without license attribution
`no-duplicate-names`	Two skills claiming the same name
`metrics-stale`	Eval results older than 90 days — needs re-evaluation

If validation fails, the skill doesn't make it into the index. No exceptions.

Six content design patterns

We don't just categorize skills by topic — we categorize by how the skill structures its logic. Inspired by the emerging agent skills ecosystem, we defined six canonical patterns:

Pattern	How It Works	Example
Tool Wrapper	On-demand library/API context. Loads conventions when working with a technology.	`langgraph`, `fastapi-templates`, `azure-cosmos-db-py`
Reviewer	Separates WHAT to check from HOW. Stores a rubric, agent scores against it.	`code-review`, `owasp-top10`, `api-security-patterns`
Generator	Enforces consistent output format. Forces step-by-step execution.	`deep-research`, `release-notes`, `adr-template`
Inversion	Flips agent/user dynamic — agent interviews user before acting.	`requirements-elicitation`, `discovery-prep`, `pricing-strategy`
System	Background governance rules. Always loaded, never user-invoked.	`naming-conventions`, `license-compliance`, `coding-standards`
Composite	Combines multiple patterns. Documents which sub-patterns and why.	`full-stack-review` (reviewer + generator)

Knowing the pattern matters because it determines how the skill should be tested.

Layer 2: The Eval Framework — Automated Quality Testing

Every skill can (and critical/high-priority skills should) have an evals/eval-spec.yaml that defines test cases. Here's a real example from our LangGraph skill:

version: "1.0"
skill: "langgraph"

evals:
  - id: "langgraph-eval-01"
    prompt: "Build a simple ReAct-style agent with LangGraph that has
      a tool-calling loop."
    success_criteria:
      - type: contains
        value: "StateGraph"
      - type: regex
        pattern: "(add_node|add_edge)"
      - type: regex
        pattern: "(tool_calls|ToolNode)"
      - type: llm_judge
        rubric: "Does the response build a LangGraph agent with
          StateGraph, define agent and tool nodes, add conditional
          edges for the tool loop, and compile the graph?"
        judge_model: "claude-sonnet-4-6"

Four types of success criteria

Type	What It Does	When to Use
`contains`	Checks if output contains an exact string	Must-have API names, function calls, patterns
`not_contains`	Checks that output does NOT contain a string	Catch "I don't know" or deprecated patterns
`regex`	Pattern matching with flags	Flexible structural checks
`llm_judge`	A separate LLM call evaluates the output against a rubric	Semantic quality — "Does this actually make sense?"

The key insight: deterministic checks (contains, regex) catch structural issues. LLM judges catch semantic issues. You need both.

The A/B comparison mode

Our eval runner has a --compare flag that runs each eval case twice — once with the skill loaded as a system prompt, once without. This measures the comparison_delta: how much better does the AI perform with the skill vs. without it?

python scripts/run-skill-evals.py --skill fastapi-pro --compare

If a skill doesn't measurably improve output quality, it shouldn't exist. We've deprecated skills that scored negative deltas — they were actively making the AI worse by adding irrelevant context.

Metrics tracking in frontmatter

After eval runs, results get written back into the skill's frontmatter:

metrics:
  eval_count: 5
  last_eval_pass_rate: 0.95
  last_eval_date: "2026-03-08"
  last_eval_model: "claude-haiku-4-5"
  avg_tokens: 1200
  comparison_delta: 0.15    # 15% improvement with skill loaded

This creates a living quality record — you can see at a glance whether a skill has been tested, when, on which model, and whether it's actually useful.

Layer 3: Priority-Based Coverage — Test What Matters

With 930+ skills, we can't evaluate everything equally. Our coverage strategy is priority-driven:

Priority	When to Assign	Eval Requirement
Critical	Security, compliance, blocking dependencies	Mandatory eval spec. Must pass before export.
High	Core patterns used weekly by agents	Should have eval spec. Coverage gap flagged.
Medium	Useful but not essential	Optional eval spec.
Low	Reference/niche skills	No eval required.

Our EVAL-COVERAGE.md is auto-generated and tracks this:

Critical skills with evals: 33%
High skills with evals: 3.9%
Total coverage: 2.3% (16 of 696 skills at last count)

We're honest about this: coverage is low. We prioritize critical and high-priority skills first, and we're building it out. The framework is mature — the coverage is what we're scaling.

Layer 4: Cross-Platform Export — One Source, Five Formats

Skills are authored once in our canonical format and exported to five platforms:

.faos/custom/skills/{domain}/{name}/SKILL.md    (source of truth)
        │
        ▼  export-skills.py
        │
├── skills/cowork/     → Claude Code (SKILL.md)
├── skills/codex/      → OpenAI Codex (SKILL.md + openai.yaml)
├── skills/gemini/     → Gemini CLI (TOML commands)
├── skills/copilot/    → GitHub Copilot (.instructions.md)
└── skills/perplexity/ → Perplexity Computer (SKILL.md)

Each platform has format-specific requirements:

Gemini CLI needs TOML with specific command metadata
GitHub Copilot needs applyTo glob patterns in YAML frontmatter
OpenAI Codex needs a companion openai.yaml for UI metadata

The export pipeline handles all of this. Authors write one skill, the tooling handles the rest.

Layer 5: The Claude Code Bridge — Skills That Load Themselves

For skills that are actively used in our development workflow, we have a bridge that exports them as native Claude Code skills:

claude_code:
  enabled: true
  tier: slash-command        # User-invocable via /name
  allowed_tools: [Read, Grep, Glob, Bash]
  argument_hint: "[file-or-pr-url]"

Three tiers control how skills surface in Claude Code:

Tier	Who Invokes	In `/` Menu	Use Case
`slash-command`	User types `/name`	Yes	Code review, deploy, story creation
`auto-load`	Claude discovers it	Yes	Patterns, conventions, principles
`background`	Claude only	No	Compliance rules, domain knowledge

There's a token budget constraint: Claude Code allocates ~16K characters for skill descriptions. With 930+ skills, we can't load them all. We limit the bridge to ~50-60 skills, selected by priority.

What We Learned

Building an evaluation framework for 930+ skills taught us a few things:

1. Schema enforcement is non-negotiable. Without structured metadata, skills become an unsearchable pile of markdown. The 22 validation rules catch issues before they ship.

2. LLM judges are necessary but not sufficient. Deterministic checks (contains, regex) are fast, cheap, and deterministic. LLM judges catch semantic quality issues but cost money and can be non-deterministic. Use both.

3. Comparison mode is the real test. A skill that doesn't measurably improve output quality is dead weight. The comparison_delta metric is the most honest signal we have.

4. Coverage is a journey. We're at 2.3% eval coverage. That sounds bad, but it's 2.3% more than most skill libraries have. Critical skills are covered first, and we're building outward.

5. Content design patterns matter. A tool-wrapper skill needs different eval criteria than a reviewer skill. Knowing the pattern tells you how to test it.

Try It Yourself

The skills are open-source. The eval framework is part of our internal tooling, but the eval spec format is straightforward — you could build your own runner against it.

Get the skills: github.com/frank-luongt/faos-skills-marketplace

Want to contribute an eval? We'd love it. Open a PR with an evals/eval-spec.yaml for any skill. Start with the format above — contains, regex, and llm_judge criteria. We'll run it through our pipeline and update the metrics.

What's Next

This is part of our 4-layer open-source strategy. Skills are Layer 1. Coming soon:

Layer 2 (May): 50+ workflow templates — how to orchestrate multi-step AI work
Layer 3 (June): 25 industry ontology blueprints — domain intelligence for AI agents
Layer 4 (July): MCP server — connect your AI assistant to project intelligence

Skills tell your AI what. The eval framework tells us whether it worked.

Built by the FAOS team. If you're building AI skills and want to talk about evaluation approaches, reach out.

The Problem With "Trust Me, It Works"​

Layer 1: The Schema — Structure Before Content​

What the schema enforces​

Six content design patterns​

Layer 2: The Eval Framework — Automated Quality Testing​

Four types of success criteria​

The A/B comparison mode​

Metrics tracking in frontmatter​

Layer 3: Priority-Based Coverage — Test What Matters​

Layer 4: Cross-Platform Export — One Source, Five Formats​

Layer 5: The Claude Code Bridge — Skills That Load Themselves​

What We Learned​

Try It Yourself​

What's Next​