envdash is an open pipeline that connects mock environments, structured tasks, and evolutionary optimization into a loop — so agent skills get measurably better each cycle.
Four components form a closed loop. Each cycle produces a better SKILL.md backed by real evaluation data.
gws CLI
against the mock API. Skills are mounted at
/skills
in the Docker container — the agent discovers
them through its native skill loading mechanism.
instruction.md tells the agent what to do,
evaluate.py checks the result deterministically. No LLM-as-judge —
programmatic verifiers with safety gates. Deleting work emails = instant -1.0.
Across 84 tasks, 7 models, and 7,308 trajectories — curated skills give agents a substantial boost. But self-generated skills show zero gain. The difference is structured optimization, which is what envdash automates.
Read the SkillsBench paperQuestions about skill optimization, mock environments, or the pipeline.