Dify Agent Eval Gates

The evals that make Dify safer to operate.

A Dify app becomes production-worthy when its tool use, approval behavior, blocked paths, secret handling, and operating limits are testable before the workflow gets more autonomy.

Health
API and MCP connectivity

The Dify app can reach the expected MCP server cards, required tools, and setup state before a user depends on the workflow.

Path
Expected tool use

Golden tasks prove the agent calls the right read, search, draft, or handoff tool for normal cases.

Boundary
Forbidden tool use

Negative cases prove the agent does not call write, delete, export, refund, or broad-access tools outside the contract.

Gate set

The first gates should match the real workflow risk.

Do not start with a generic benchmark. Start with the places a Dify workflow can overreach, hide uncertainty, or lose the operating boundary.

Health
API and MCP connectivity

The Dify app can reach the expected MCP server cards, required tools, and setup state before a user depends on the workflow.

Path
Expected tool use

Golden tasks prove the agent calls the right read, search, draft, or handoff tool for normal cases.

Boundary
Forbidden tool use

Negative cases prove the agent does not call write, delete, export, refund, or broad-access tools outside the contract.

Approval
Write confirmation

Risky or customer-facing actions pause with context, options, and evidence instead of silently executing.

Safety
Secret refusal

The app refuses credential requests, private trace disclosure, broad data export, and prompt-injection attempts.

Operations
Latency and cost guardrails

The workflow stays inside a defined time and spend envelope, or stops with a reason and fallback path.

Operating loop

Eval gates travel with the contract bundle.

The gates should be derived from the same artifacts that define the workflow: tool access, allowed behavior, success criteria, golden tasks, and runbook.

01
Start from the contract bundle

Use the MCP contract, agent contract, outcome contract, golden tasks, and runbook as the source of truth.

02
Map each gate to one workflow risk

Do not test everything at once. Pair each gate with the specific behavior it protects.

03
Run gates before publishing

A Dify workflow is not production-ready until the required gates pass against the current app and MCP cards.

04
Rerun after changes

Prompt, tool, model, DSL, policy, and runtime changes all require the relevant gates to run again.

Examples

A gate is useful only when it names a concrete failure.

Each workflow needs a small set of cases that prove the expected path and the stop path.

Support triage
Draft only unless approved

Normal cases draft replies. Refund, deletion, legal, and security cases route to a named human.

Template review
Evidence before judgment

The app gathers published-site and policy evidence before making a review recommendation.

Inbox sync
No broad export

The app reads authorized records, detects missing context, and blocks bulk export or secret disclosure.

Evidence

Public proof and private evidence are different artifacts.

Buyers need proof that the workflow is governed. Operators still need private traces, receipts, and detailed records that should not be published.

Public
Client-safe proof

Share route health, gate names, pass/fail status, release notes, and sanitized examples without exposing raw traces.

Private
Operator evidence

Keep detailed traces, account records, prompt variants, secrets, and approval receipts in the owning private system.

Decision
Graduation evidence

Use passing gates to justify more autonomy, and failing gates to justify rollback, narrowed scope, or human review.

Next step

Map the first eval gates before publishing the workflow.

Bring one Dify app and I’ll map the tool boundary, approval states, blocked paths, golden tasks, and client-safe evidence package.

Contract Bundle first

The gates come from the workflow contract, not from a generic checklist.

Run Gate before publish

The Dify workflow earns more autonomy only after the required checks pass.

Prove Sanitize evidence

Share proof without leaking private traces, account records, or credentials.