What is an MCP server in AI applications?

An MCP server is a service that exposes application capabilities — such as product search, cart management, or knowledge retrieval — as tools that a large language model can discover and invoke through the Model Context Protocol. Rather than a traditional API that a developer calls directly, an MCP server is designed to be used by an LLM client: the model interprets user intent, selects the appropriate tool, sends structured arguments, and continues the conversation based on the result. This architecture enables agentic workflows where the model drives multi-step tasks autonomously.

What is DeepEval used for in MCP testing?

DeepEval is an open source LLM evaluation framework that lets you define conversational test cases, run them against an agent connected to your MCP server, and score the results using metrics such as Hallucination, Faithfulness, Contextual Relevance, and G-Eval criteria. It integrates natively with pytest, which means you can run evaluations using the standard pytest command and export results as JUnit XML with the --junitxml flag — making it compatible with both CI/CD pipelines and test management platforms like TestQuality.

Why are multi-turn tests important for MCP servers?

Multi-turn tests are important because most MCP failures are not visible in a single exchange. A tool may return a correct result in isolation while the model fails to carry that result forward into the next turn — selecting the wrong item, losing cart state, or re-querying when it should reference prior context. Multi-turn test cases simulate the actual user journey: search, select, configure, act, verify. That sequence exposes state continuity failures, tool selection drift, and memory loss that single-turn assertions will never catch.

Do you need an LLM client to test an MCP server?

Yes. An MCP server is designed to be invoked by an LLM through a client — not called directly like a REST endpoint. Testing the server without a configured client and a model capable of tool selection means you are only testing the underlying service, not the actual integration surface. The LLM decision layer — which tool to call, with what arguments, and when — is exactly what MCP testing must exercise. Skip the client and you cannot validate tool selection, argument quality, or session continuity.

What kinds of workflows are best for first-round MCP test coverage?

Start with the user journeys that are highest in business value and most likely to fail under state drift: search-to-cart flows, knowledge retrieval with follow-up questions, order status checks that reference prior tool outputs, and any multi-step workflow where a later turn depends on an earlier result. These expose the failures that matter most to users and are the hardest to catch with protocol-level or UI-level tests alone. Synthetic prompts with no business value make poor first targets because they do not validate the real integration surface.

How does LLM-as-a-judge reduce test code in MCP evaluations?

LLM-as-a-judge replaces handwritten assertions for response quality with a judge model that evaluates whether the output meets a metric — Hallucination, Faithfulness, Contextual Relevance, or a custom G-Eval criterion. Instead of asserting exact strings or rigid JSON structures for every possible response variation, you define what a correct outcome looks like conceptually and let the judge score it. This is practical for MCP testing because natural language responses legitimately vary in phrasing while remaining semantically correct. The tradeoff is that judge model stability must be controlled — changing the judge changes the scores.

Can local LLMs be used for MCP evaluation?

Yes, but with meaningful resource implications. When you run a local model for both agent behavior and as the LLM-as-a-judge evaluator, compute and memory demand increases substantially — the agent is generating responses while a second model instance is scoring them. Local evaluation is viable for development-environment testing and for teams with data sovereignty requirements, but it is not a free alternative to hosted judge models. Budget for the infrastructure cost before committing to local evaluation at scale, and fix the local model version to keep evaluation scores comparable across runs.

How should QA teams store and report MCP evaluation results?

MCP evaluation results should be treated as governed release assets, not local script outputs. The recommended workflow is to export JUnit XML from DeepEval via pytest, upload results to TestQuality using the testquality upload_test_run CLI command with the correct --project_name and --plan_name flags, and track pass/fail outcomes within named test runs and cycles. This gives teams historical trend data, cross-build comparison, and defect linkage to GitHub and Jira — the same operational visibility applied to manual and automated test results. Storing results in a test management platform prevents valuable evaluation work from getting stranded in local folders or CI logs.

Test MCP Servers End to End with DeepEval

How to Test MCP Servers with DeepEval

Three-layer Pipeline diagram showing MCP Server Testing with DeepEval, Pytest, and TestQuality CLI

Jose Amoros
June 9, 2026
2:22 pm
0 comments

Get Started

with $0/mo FREE Test Plan Builder or a 14-day FREE TRIAL of Test Manager

Start FREE

MCP server testing is the practice of validating that a Model Context Protocol server exposes the right tools, passes the right context, preserves session state across turns, and returns outputs an LLM can use correctly in real agentic workflows. For QA teams building AI products, this means testing not just API responses but complete tool-driven conversations — search queries, cart updates, order actions — where each turn depends on the one before it. Understanding the underlying MCP architecture helps frame what you are actually validating. A test management platform becomes essential once these evaluations multiply across prompts, tools, and release cycles. Done well, MCP testing catches failures UI tests never see.

At a Glance

A practical guide to evaluating MCP server behavior end to end

The hard part is not exposing a tool. It is proving an LLM can use it correctly over multiple turns.

Approach: Evaluate the full chain — client, LLM, MCP tools, and application state — rather than isolated API calls.

Core tool: DeepEval — an open source LLM evaluation framework that runs via pytest and exports standard JUnit XML.

Example workflow: Search products, select an item, add to cart, verify cart contents — one multi-turn conversational test.

Pipeline: DeepEval outputs JUnit XML via pytest → TestQuality CLI uploads results → TestQuality tracks runs, trends, and defect linkage.

QA angle: Treat conversational evaluations as governed release assets — not disposable scripts in a local folder.

MCP quality depends less on one perfect answer and more on whether the whole conversation remains coherent and actionable across turns.

What is an MCP server, and why is it harder to test than a normal API?

MCP servers are harder to test because they sit inside an LLM-driven interaction loop. You are not only checking whether a backend function works — you are checking whether a client can discover the tool, whether the model chooses it correctly, whether context persists, and whether the returned result supports the next conversational turn.

In a conventional API test, you call an endpoint, inspect a response, and assert a contract. That still matters here, but it is only one layer. An MCP server is typically exposed to an LLM client — a desktop assistant, a coding environment, or an agentic pipeline. The model interprets user intent, selects a tool, sends arguments, receives data, and then continues the conversation based on the result.

That makes MCP testing a system problem, not just a service problem.

A realistic setup often includes:

A frontend or user-facing application.
A backend with business logic.
Retrieval components such as embeddings, vector stores, or a RAG pipeline.
An MCP server that exposes application capabilities as tools.
An MCP client connected through stdio or SSE.
An LLM that decides when and how to invoke those tools.

If any one of those parts misbehaves, the conversation can fail even though the underlying product logic is correct. That is the integration surface QA needs to own.

What should you actually validate in an MCP server test?

You should validate more than tool availability. A solid MCP server test checks whether the right tool was chosen, whether the tool received the right inputs, whether session state carried forward, and whether the final response remained accurate and useful for the user's goal.

For practical QA, the most important checks are:

Tool discovery: Can the client see the MCP tools that should be exposed?
Tool selection: Does the model call the right tool for the user request?
Argument quality: Are the tool inputs complete and relevant?
State continuity: Does the conversation preserve history across turns?
Data correctness: Does the tool return the correct product, cart, or summary data?
Workflow completion: Can a multi-step task finish successfully without manual correction?

This is where LLM application testing differs from browser automation. In Selenium or Playwright, you usually assert visible UI states. In MCP testing, you often need to validate the conversational path that led to the result — not just whether the result looks plausible on the surface.

That is also why teams are moving from single assertion tests to evaluation-based testing. The official Model Context Protocol overview from Anthropic is worth reading as background because it explains the protocol's role in connecting models to tools and data sources — exactly the integration surface this article covers.

How does DeepEval fit into MCP server testing?

DeepEval fits MCP testing by letting you evaluate end-to-end LLM interactions without writing brittle assertion logic for every response variation. Instead of checking each response manually, you define conversational test cases, run the agent against your MCP server, and use an LLM-as-a-judge approach to score the outcome against LLM evaluation metrics like: Hallucination, Faithfulness, Contextual Relevance, and G-Eval criteria.

The useful shift is from low-level scripting to evaluation logic.

Rather than writing dozens of assertions for every possible response variation, you model a conversation such as:

Ask for high-rated running shoes.
Select the first result.
Choose size and color.
Add the item to cart.
Retrieve cart contents.

DeepEval assesses whether the full interaction achieved the intended outcome — not whether the exact phrasing matched a template. This aligns with the broader direction in AI quality work. The OpenAI documentation on evaluations also emphasizes systematic evals over ad hoc spot checks when validating LLM behavior at scale.

In practice, the approach uses:

A configured MCP client.
An agent that invokes the MCP server.
Conversational test cases expressed as LLMTestCase objects.
A multi-turn MCP-focused metric in DeepEval.
An LLM judge model to evaluate the result.

How do multi-turn MCP tests work in a realistic shopping flow?

Multi-turn MCP tests work by storing the conversation as a sequence of prompts and expected outcomes, then evaluating whether the agent uses the exposed tools correctly across that sequence. This matters when later turns depend on earlier ones — such as remembering a selected product before adding it to the cart.

A simple shopping example makes the problem concrete.

Imagine a product application where the backend supports semantic search through a RAG pipeline and exposes commerce actions through an MCP server. A realistic test flow looks like this:

Turn 1: Ask for the highest-rated running shoes.
Turn 2: Ask to add the first result to the cart.
Turn 3: Provide required details such as size and color.
Turn 4: Ask for the current cart contents.
Turn 5: Verify the item added matches the prior selection.

That sequence tests several things simultaneously:

The search tool returns relevant products.
The model understands what "the first shoe" refers to across turns.
The add-to-cart tool is called with valid, complete options.
The cart retrieval tool reflects the updated application state.
Conversation history remains intact throughout.

If the memory breaks, tool arguments drift, or the wrong item is selected, the flow fails — even if each individual tool works correctly in isolation. This is fundamentally a context engineering problem: the quality of what the model carries forward across turns determines whether the conversation succeeds or collapses.

What infrastructure do you need before these tests can run?

Before you can run MCP evaluations, you need the server, a client connection, and an LLM agent that can invoke tools. The MCP server does not operate meaningfully by itself in this context, it must be exercised through a configured client and a model capable of deciding when to call the exposed tools.

That distinction is easy to miss, and skipping it is why many first attempts at MCP testing produce inconclusive results.

The essential building blocks are:

MCP server: The service exposing your application capabilities.
Client transport: Typically stdio or SSE, depending on your implementation.
LLM agent: The model-driven layer that interprets prompts and invokes tools.
Application under test: The product backend and any retrieval or domain systems behind it.
Evaluation framework: DeepEval metrics and conversational test cases.

If you are missing the client or agent layer, you are not truly testing MCP behavior. You are only testing the underlying service. A protocol test alone cannot confirm that the LLM will pick the right tool or maintain task continuity across a real conversation.

How can you organize MCP test cases so they remain maintainable?

You can keep MCP tests maintainable by expressing them as reusable user journeys, separating conversation data from environment setup, and storing evaluation results like any other governed QA artifact. The goal is to prevent prompt-heavy test suites from turning into untraceable script collections nobody revisits.

A practical structure:

Group tests by workflow, not by tool. Good examples include search flow, cart flow, checkout flow, knowledge retrieval flow, and order summary flow. That reflects how failures are experienced in production, and makes regression reports readable to non-engineers.

Keep conversation turns explicit. Store each user prompt and expected conversational milestone in sequence. State-dependent failures are much easier to diagnose when you can see exactly where the context broke.

Track environment assumptions. If your test expects a local model, seeded catalog data, or a running vector store, document that in the test case metadata. Environment instability is the most common source of false failures in LLM-evaluation suites.

Preserve evaluation history. Once teams start running these evaluations regularly, a test management platform matters. One way to handle this in TestQuality is to store conversational MCP scenarios as test cases, execute them in runs, and associate pass or fail outcomes with release cycles. For setup and project organization basics, the TestQuality documentation covers how projects, test cases, runs, and reports are structured.

This becomes especially important when a single prompt change can affect dozens of evaluations at once.

What is the full pipeline from DeepEval to TestQuality?

The complete pipeline runs from test case definition in DeepEval through JUnit XML export via pytest, then into TestQuality via the CLI for governed run tracking. Each layer handles a distinct responsibility (evaluation logic, artifact generation, and operational visibility) and none of the three substitutes for the others.

Here is how the layers connect:

Layer 1 — Ideate and define (TestStory.ai + your IDE). Before writing a single assertion, you need well-formed test scenarios. If you are working in an agentic coding environment like Cursor, Claude Code, or VS Code with Copilot, TestStory.ai integrates directly as an MCP-compatible tool.

TestStory.ai | Agentic QA for Test Case Writting

Feed it a user story, Jira issue, or process diagram describing the MCP workflow you want to cover, and it generates structured test cases that sync directly into TestQuality.

TestStory Generated Test Cases on PR example Syncing with TestQuality test management button. Showing TestStory.ai Transfer TestQuality button.

Those cases become your source of truth for what the evaluation suite should prove.

Layer 2 — Quantify and validate (DeepEval via pytest). Package your scenarios into formal LLMTestCase objects using DeepEval metrics (Hallucination score, Faithfulness, Contextual Relevance, G-Eval criteria) and run them through your MCP server and agent. Because DeepEval integrates natively with Pytest, exporting results requires a single flag:

# Run the evaluation suite and export JUnit XML
pytest test_mcp_agents.py --junitxml=deepeval_results.xml

Layer 3 — Log and audit (TestQuality CLI). Once the JUnit XML artifact exists, the TestQuality CLI pushes it into a named project and cycle:

testquality upload_test_run 'deepeval_results.xml' \
  --project_name="AI_Agent_Core" \
  --plan_name="MCP_Regression"

TestQuality parses the XML, maps results to existing test cases or creates new ones, and records execution metrics in your run history. Stakeholders get pass/fail trends, reliability metrics, and historical trend lines — without manual updates.

TestQuality Test Management Tool | Agentic QA

The full loop looks like this:

[ IDE + TestStory.ai ]
  (Define MCP user journeys → sync test cases to TestQuality)
        ↓
[ DeepEval Test Suite (pytest) ]
  (Evaluate Hallucination, Faithfulness, Contextual Relevance, G-Eval)
        ↓
[ JUnit XML artifact: deepeval_results.xml ]
        ↓
[ testquality upload_test_run CLI ]
        ↓
[ TestQuality Dashboard ]
  (Run history, trend analysis, defect linkage to GitHub and Jira)

The CLI is the connector that makes the integration operational. Describing TestQuality as something that "plugs into your pipeline" without naming the CLI step is how you end up with a broken assumption in your CI/CD configuration.

Generate structured MCP test cases from your user stories or Jira issues.

TestStory.ai converts project assets into organized, review-ready test cases — then syncs them directly into TestQuality for execution tracking.

Try the Free Test Case Builder →

What are the most common mistakes that make MCP server tests unreliable?

The most common mistakes are testing tools in isolation, ignoring conversation history, and assuming a correct final answer proves a correct tool sequence. MCP tests become unreliable when they skip the LLM decision layer or treat multi-turn behavior like a stateless API exchange.

Watch for these specific problems:

Only verifying final text. A plausible answer can hide wrong tool use or stale context. The response looks fine; the tool chain was broken.
No multi-turn coverage. Many MCP failures appear only after a follow-up request. Single-turn tests miss the state continuity problem entirely.
Ignoring session state. Cart, order, and memory-dependent flows require continuity checks. If you reset context between assertions, you are not testing the real user path.
Overfitting to one phrasing. Natural language varies. Your eval should focus on outcome correctness, not lexical matching.
Mixing environment issues with product issues. Local model instability, memory pressure, or transport failures can create noisy results that look like product defects.
No result traceability. Without stored runs and comparable histories, regressions are harder to confirm and harder to explain to stakeholders.

Resource planning matters too. Local execution using a local model for both application behavior and evaluation increases compute and memory demand significantly. LLM-as-a-judge is not free just because you write fewer assertions.

How do you know whether an MCP evaluation result is actually trustworthy?

An MCP evaluation result is trustworthy when the scenario is grounded in a real workflow, the environment is stable, the metric reflects the user goal, and failures are inspectable at the turn and tool level. Trust comes from repeatability and diagnosis — not from a single aggregate score.

Use this checklist before treating any result as release-confidence data:

Does the scenario reflect a real task? A search-to-cart flow is better than a synthetic prompt with no business value.
Can you inspect intermediate outputs? You should be able to review what happened at each individual turn.
Is the model setup fixed? Changing judge models or local model versions can shift evaluation outcomes without any product change.
Can you rerun the same test? One success is not enough for release confidence. Determinism across runs is what qualifies an evaluation as a regression asset.
Do metrics align with user outcomes? A completed order flow matters more than stylistic response quality.

The safest approach combines evaluation metrics with spot inspection of important runs, especially when on-boarding a new server or changing prompt logic.

How can QA teams bring MCP evaluations into a normal release workflow?

QA teams can integrate MCP evaluations into a release workflow by treating them like other regression assets: define stable scenarios, run them repeatedly, record results, and compare changes across builds. The difference is that conversational evaluations need state visibility and prompt-aware analysis in addition to standard pass or fail reporting.

A straightforward operating model:

Define high-value MCP user journeys. Start with search, cart, and checkout flows — the paths that matter most to users and break most visibly.
Use TestStory.ai to convert those journeys into structured test cases. Feed a user story, Jira issue, or process diagram into TestStory.ai from your IDE or directly from app.teststory.ai. Cases sync automatically into TestQuality.
Package scenarios into DeepEval LLMTestCase objects. Apply Hallucination, Faithfulness, Contextual Relevance, or G-Eval metrics based on what matters for each flow.
Run evaluations via pytest and export JUnit XML.
Upload results to TestQuality via the CLI using testquality upload_test_run.
Review regressions before release. TestQuality's run history and trend data show whether AI model behavior has shifted across builds. Defects confirmed by a tester link back to GitHub issues or Jira tickets automatically through TestQuality's native integrations.

Teams using TestQuality typically keep AI workflow test cases alongside other manual and automated assets, using runs and reports to track release readiness across the full stack — not just the deterministic automation layer.

For further detail on the CLI upload workflow and project structure, the TestQuality documentation covers the current setup.

The point is not to replace protocol tests or UI tests. It is to fill the gap between them — the conversational layer where most agentic product failures actually live.

Technical Deep Dive FAQ

Key Takeaways

What matters most when testing MCP servers

Good MCP testing follows the user journey — not just the protocol contract.

System view: An MCP server must be tested together with the LLM client and the model that invokes it — not as an isolated service.

Workflow focus: Multi-turn scenarios expose state continuity failures that single-turn assertions and isolated tool checks miss entirely.

DeepEval's role: Conversational test cases plus LLM-as-a-judge metrics reduce brittle assertion code and score outcomes against what actually matters — task completion.

Pipeline: DeepEval exports JUnit XML via pytest → testquality upload_test_run CLI pushes to TestQuality → run history, trends, and defect linkage are automatic.

Operational fit: Treat MCP evaluation scenarios as governed release assets — stored in TestQuality alongside manual and automated tests, not scattered across CI logs.

If the model can complete the task over several turns without losing context, the server is doing useful work. If not, the protocol contract alone will not save you.

About the Author

Jose Amoros is part of the TestQuality marketing team, focused on agentic QA, AI-powered test management, and the operational handoff between AI-generated test artifacts and governed execution workflows. He writes regularly about CI/CD integration, Gherkin/BDD practices, and shift-left testing. Author profile →