Dify Agent Eval Gates

The evals that make Dify safer to operate.

A Dify app becomes production-worthy when Langfuse can explain the runtime trace and score the MCP contract before the workflow gets more autonomy.

Talk Through a Workflow Read Control Plane

Trace and gate layer

Use Langfuse for Dify traces and MCP gates.

Dify carries the app. Langfuse observes the Dify runtime and evaluates the MCP contracts CREATE SOMETHING creates.

Dify

Langfuse watches the app runtime

Dify can send app traces to Langfuse, so operators can inspect conversations, prompts, model calls, latency, cost, and runtime errors where the Dify app actually runs.

MCP

Langfuse scores the tool contract

CREATE SOMETHING uses Langfuse-backed evals for the MCPs we create, so expected tool use, forbidden tool use, write confirmation, and policy-boundary checks stay tied to the repo-owned contract.

Evidence

One evidence layer answers both questions

Langfuse explains what happened inside the Dify app and stores the eval evidence for whether the MCP boundary behaved the way the workflow contract promised.

Gate set

The first gates should match the real workflow risk.

Do not start with a generic benchmark. Start with the places a Dify workflow can overreach, hide uncertainty, or lose the operating boundary.

Health

API and MCP connectivity

The Dify app can reach the expected MCP server cards, required tools, and setup state before a user depends on the workflow.

Path

Expected tool use

Golden tasks prove the agent calls the right read, search, draft, or handoff tool for normal cases.

Boundary

Forbidden tool use

Negative cases prove the agent does not call write, delete, export, refund, or broad-access tools outside the contract.

Approval

Write confirmation

Risky or customer-facing actions pause with context, options, and evidence instead of silently executing.

Safety

Secret refusal

The app refuses credential requests, private trace disclosure, broad data export, and prompt-injection attempts.

Operations

Latency and cost guardrails

The workflow stays inside a defined time and spend envelope, or stops with a reason and fallback path.

Evidence map

Each gate should point to the system that proves it.

The goal is not duplicate observability. The goal is to know which trace or eval run answers the operator's question.

Health

API and MCP connectivity

Use a Service API smoke plus Dify MCP setup state to prove the app can reach the expected server cards and tools.

Evidence: route health, tool availability, harmless read result
Failure: block publish until the card or bearer path is fixed

Langfuse

Runtime trace quality

Use Langfuse for Dify app sessions, prompt changes, model behavior, latency, token use, and runtime errors.

Evidence: trace link, session summary, cost and latency envelope
Failure: narrow context, revise prompt, or change model path

Langfuse

MCP contract behavior

Use Langfuse-backed eval runs for the CREATE SOMETHING-owned MCP gates that prove the agent uses the right tools and avoids disallowed tools.

Evidence: eval run, expected and forbidden tool assertions
Failure: revise tool contract, tool description, or policy pack

Approval

Write confirmation

Use negative and approval-path cases to prove write-capable tools pause before customer-facing or irreversible actions.

Evidence: confirmation prompt and no write before approval
Failure: remove write scope or require a stricter approval state

Operating loop

Eval gates travel with the contract bundle.

The gates should be derived from the same artifacts that define the workflow: tool access, allowed behavior, success criteria, golden tasks, and runbook.

Start from the contract bundle

Use the MCP contract, agent contract, outcome contract, golden tasks, and runbook as the source of truth.

Map each gate to one workflow risk

Do not test everything at once. Pair each gate with the specific behavior it protects.

Run gates before publishing

A Dify workflow is not production-ready until Langfuse tracing is connected and the required Langfuse MCP eval gates pass against the current app and MCP cards.

Rerun after changes

Prompt, tool, model, DSL, policy, and runtime changes all require the relevant gates to run again.

Examples

A gate is useful only when it names a concrete failure.

Each workflow needs a small set of cases that prove the expected path and the stop path.

Support triage

Draft only unless approved

Normal cases draft replies. Refund, deletion, legal, and security cases route to a named human.

Template review

Evidence before judgment

The app gathers published-site and policy evidence before making a review recommendation.

Inbox sync

No broad export

The app reads authorized records, detects missing context, and blocks bulk export or secret disclosure.

Evidence

Public proof and private evidence are different artifacts.

Decision owners need proof that the workflow is governed. Operators still need private traces, receipts, and detailed records that should not be published.

Public

Client-safe proof

Share route health, gate names, pass/fail status, release notes, and sanitized examples without exposing raw traces.

Private

Operator evidence

Keep Langfuse traces, eval runs, account records, prompt variants, secrets, and approval receipts in the owning private system.

Decision

Graduation evidence

Use passing gates to justify more autonomy, and failing gates to justify rollback, narrowed scope, or human review.

Next step

Map the first eval gates before publishing the workflow.

Bring one Dify app and I’ll map the tool boundary, approval states, blocked paths, golden tasks, and client-safe evidence package.

Talk Through a Workflow Back To Dify See Dify Page Portfolio Read Eval Evidence Paper

Owner: Release operator
Authority: Workflow eval contract
Proof: Sanitized gate evidence
State: review

01 / Contract Bundle first
The gates come from the workflow contract, not from a generic checklist.
02 / Run Gate before publish
The Dify workflow earns more autonomy only after the required checks pass.
03 / Prove Sanitize evidence
Share proof without leaking private traces, account records, or credentials.