The Dify app can reach the expected MCP server cards, required tools, and setup state before a user depends on the workflow.
The evals that make Dify safer to operate.
A Dify app becomes production-worthy when its tool use, approval behavior, blocked paths, secret handling, and operating limits are testable before the workflow gets more autonomy.
Golden tasks prove the agent calls the right read, search, draft, or handoff tool for normal cases.
Negative cases prove the agent does not call write, delete, export, refund, or broad-access tools outside the contract.
The first gates should match the real workflow risk.
Do not start with a generic benchmark. Start with the places a Dify workflow can overreach, hide uncertainty, or lose the operating boundary.
The Dify app can reach the expected MCP server cards, required tools, and setup state before a user depends on the workflow.
Golden tasks prove the agent calls the right read, search, draft, or handoff tool for normal cases.
Negative cases prove the agent does not call write, delete, export, refund, or broad-access tools outside the contract.
Risky or customer-facing actions pause with context, options, and evidence instead of silently executing.
The app refuses credential requests, private trace disclosure, broad data export, and prompt-injection attempts.
The workflow stays inside a defined time and spend envelope, or stops with a reason and fallback path.
Eval gates travel with the contract bundle.
The gates should be derived from the same artifacts that define the workflow: tool access, allowed behavior, success criteria, golden tasks, and runbook.
Use the MCP contract, agent contract, outcome contract, golden tasks, and runbook as the source of truth.
Do not test everything at once. Pair each gate with the specific behavior it protects.
A Dify workflow is not production-ready until the required gates pass against the current app and MCP cards.
Prompt, tool, model, DSL, policy, and runtime changes all require the relevant gates to run again.
A gate is useful only when it names a concrete failure.
Each workflow needs a small set of cases that prove the expected path and the stop path.
Normal cases draft replies. Refund, deletion, legal, and security cases route to a named human.
The app gathers published-site and policy evidence before making a review recommendation.
The app reads authorized records, detects missing context, and blocks bulk export or secret disclosure.
Public proof and private evidence are different artifacts.
Buyers need proof that the workflow is governed. Operators still need private traces, receipts, and detailed records that should not be published.
Share route health, gate names, pass/fail status, release notes, and sanitized examples without exposing raw traces.
Keep detailed traces, account records, prompt variants, secrets, and approval receipts in the owning private system.
Use passing gates to justify more autonomy, and failing gates to justify rollback, narrowed scope, or human review.
Map the first eval gates before publishing the workflow.
Bring one Dify app and I’ll map the tool boundary, approval states, blocked paths, golden tasks, and client-safe evidence package.
The gates come from the workflow contract, not from a generic checklist.
The Dify workflow earns more autonomy only after the required checks pass.
Share proof without leaking private traces, account records, or credentials.