Agents that improve
their own skills

envdash is an open pipeline that connects mock environments, structured tasks, and evolutionary optimization into a loop — so agent skills get measurably better each cycle.

How it works

Four components form a closed loop. Each cycle produces a better SKILL.md backed by real evaluation data.

1
Agent Skills
Folders of instructions, scripts, and resources that agents discover via progressive disclosure. Agents load only name + description at startup (~100 tokens), then read the full SKILL.md on demand when a task matches.
# skills/gws-gmail/SKILL.md
---
name: gws-gmail
description: "Gmail: Send, read, and manage email"
---
# Usage
gws gmail messages list --format json
gws gmail messages send --to user@... --subject ...
gws gmail labels list
2
🦞
smolclaw environment
API-identical Gmail mock. 54 endpoints, SQLite-backed. Seed 3,000 emails across 6 categories (work, personal, promos, notifications, newsletters, spam). Snapshot state before each run, diff after.
# Seed + serve
smolclaw seed --scenario long_context
smolclaw serve --port 8001
# 54 routes, 11 admin endpoints
# Action log captures every API call
OpenClaw
OpenClaw agent
Reads SKILL.md at startup. Receives task instruction. Calls gws CLI against the mock API. Skills are mounted at /skills in the Docker container — the agent discovers them through its native skill loading mechanism.
# Agent reads skill, then executes
gws gmail messages list \
  --format json --page-all
gws gmail messages batchDelete \
  --json '{"ids": [...]}'
3
Harbor task runner + evaluator
Each task is a Docker-isolated directory: instruction.md tells the agent what to do, evaluate.py checks the result deterministically. No LLM-as-judge — programmatic verifiers with safety gates. Deleting work emails = instant -1.0.
# evaluate.py — deterministic reward
def evaluate(final_state, diff, action_log):
  if work_emails_deleted: return {"reward": -1.0}  # safety gate
  reward = 0.0
  if promos_removed >= 250: reward += 0.40
  if spam_removed >= 1:   reward += 0.10
  if old_notifs_removed:  reward += 0.20
  if filter_created:     reward += 0.10
4
GEPA skill optimizer
Reads full execution traces — not just the scalar reward, but every API call, error message, and reasoning step. Diagnoses why the agent failed ("didn't paginate past 100 results"), then mutates the SKILL.md with targeted fixes. Pareto selection keeps variants that excel on different task subsets.
# gskills optimize loop
# Iter 1: train reward 0.35, val reward 0.20
#   Diagnosis: "Agent used individual deletes
#    instead of batchDelete, missed old notifs"
#   Mutation: add batch API + date filter guidance
# Iter 2: train reward 0.75, val reward 0.70
# Iter 3: train reward 0.91, val reward 0.85
# → best_SKILL.md saved
improved SKILL.md feeds back to step 2
from skillsbench research
+16.2pp with curated skills

Across 84 tasks, 7 models, and 7,308 trajectories — curated skills give agents a substantial boost. But self-generated skills show zero gain. The difference is structured optimization, which is what envdash automates.

Read the SkillsBench paper

Get in touch

Questions about skill optimization, mock environments, or the pipeline.