Gemma 4 QAT is a quantization-aware version of Google's Gemma 4 model family intended to lower memory usage and storage cost while preserving practical performance for local inference. Its appeal for coding agents is that larger models become easier to run on local hardware — relevant for offline workflows, local RAG development, and teams trying to reduce cloud token costs. The main question is not whether it runs locally, but whether its outputs stay accurate enough for non-trivial engineering work.

What is Qwen 3.6 in this comparison?

Qwen 3.6 is the alternative local model used in the same VS Code coding-agent workflow for a direct comparison. The specific variant was a 35B mixture-of-experts model with 6-bit quantization, chosen because its active parameter profile was close enough to the Gemma 4 QAT variant to make the comparison meaningful. In practice, it produced stronger code for the given DeepEval task — particularly around conversational test logic and correct inclusion of the evaluation judge model.

Why did Gemma 4 QAT fail the coding-agent task?

The failure was not that it produced no code — it produced code that looked plausible but missed critical evaluation wiring. The metrics were initialized without being applied correctly, and the necessary judge model was not passed into the evaluation flow. For a DeepEval conversational RAG test, that makes the output structurally wrong. In QA terms, an artifact that looks finished while still being invalid is often more dangerous than an obviously incomplete draft because it can waste review time and create false confidence.

Did Gemma 4 QAT have a local performance advantage?

Yes — local efficiency was one of the clearer positives. In the observed run it consumed about 6.8 GB of memory while executing a 26B mixture-of-experts configuration, making it attractive for local development on capable laptops. The problem is that efficiency did not translate into better coding output for this task. If your workload is code generation for testing or agent workflows, output quality still has to be the first filter, with memory footprint as a secondary consideration.

Was token speed meaningfully different between the two models?

No. The observed token throughput was nearly identical — roughly 53 tokens per second for Gemma 4 QAT and 54 tokens per second for Qwen 3.6. Speed was not a meaningful differentiator in this setup. The more important distinction was whether the generated code was logically usable. For teams evaluating local coding agents, throughput numbers are secondary once interactivity is already acceptable.

Why is a multi-turn RAG test harder than a simple code-generation prompt?

A multi-turn RAG test forces the model to maintain conversational state, handle retrieval context, align with a framework's evaluation API, and produce executable test structure — all at once. That is much harder than generating standalone code from scratch. It is also more representative of production AI application testing, where the model must reason over existing implementation details rather than pattern-matching a common coding exercise. That combination exposes weaknesses that simpler benchmark-style tasks consistently hide.

How should a QA team validate AI-generated test code from a local model?

Start by reviewing whether the code matches the target framework's execution model, then run it inside the real project. Check fixtures, imports, library APIs, and whether evaluation objects are actually used rather than only declared. After that, move surviving artifacts into a governed workflow. TestQuality stores test cases, runs, and execution history in one place, so draft AI output becomes reviewable QA work — with defects linking back to GitHub or Jira automatically — rather than ad hoc editor output.

Is Gemma 4 12B still worth trying for local workflows?

It can be worth trying if your primary constraint is memory and you want a model that runs comfortably on a machine with around 16 GB of virtual memory. The dense 12B variant was noted for strong inference speed — useful for interactive local work. The caveat is that speed does not automatically mean strong coding performance. If your workload centers on code generation for testing or agentic tasks, test it against your own real codebase before treating it as a default local coding model.

Why Gemma 4 QAT Struggles in Local Coding Agents

Why Gemma 4 QAT Struggles in Local Coding Agent Tasks

Diagram comparing Gemma 4 QAT and Qwen 3.6 performance on a local coding agent task in VS Code, showing model output quality differences for AI-generated test code | TestStory

Jose Amoros
June 8, 2026
11:05 pm
0 comments

Get Started

with $0/mo FREE Test Plan Builder or a 14-day FREE TRIAL of Test Manager

Start FREE

Gemma 4 QAT refers to Google's quantization-aware versions of Gemma 4, designed to reduce memory use and improve local inference speed on developer machines. In a direct head-to-head coding-agent task using VS Code and DeepEval, Gemma 4 QAT produced structurally incomplete test code — initializing evaluation metrics without applying them correctly and omitting the required judge model, while Qwen 3.6 35B delivered materially more usable output at nearly identical token throughput. This article explains how that comparison played out, why lower memory usage did not translate into better code quality, and where test management tools like TestQuality fit when you need to turn AI-generated test code into governed, repeatable QA test execution rather than one-off experiments inside an editor.

At a Glance

A practical look at local LLM coding performance

Lower VRAM use helped Gemma 4 QAT run locally, but it did not win on code correctness.

Focus: Real-world comparison of Gemma 4 QAT and Qwen 3.6 in a VS Code coding-agent workflow.

Task: Generate a multi-turn conversational RAG test in Python using Pytest and DeepEval.

Result: Qwen 3.6 produced materially stronger code structure and incorporated the evaluation model correctly.

Tradeoff: Token speed was similar — roughly 53 vs 54 tokens per second — but output quality was not.

QA angle: For AI-generated tests, generation speed matters less than executable, reviewable, traceable artifacts.

A local model that writes fast but writes wrong still creates rework — and rework is what QA teams usually feel first.

What is Gemma 4 QAT, and why does it matter for local coding agents?

Gemma 4 QAT is a compressed Gemma 4 variant built to use less memory and storage while improving local inference efficiency. That matters because coding agents often run inside constrained developer environments where VRAM, RAM, and token cost shape what is practical day to day.

Google's positioning for the newer Gemma 4 family centers on two improvements. First, reduced memory footprint for local execution. Second, faster inference on-device. For teams experimenting with local AI coding agents, those are not minor benefits. They determine whether a model can run at all on a laptop and whether it stays responsive enough to be useful.

In this comparison, the interesting part is not the marketing claim alone. It is what happens when a model that looks efficient on paper is asked to do non-trivial coding work in context. That means reading an existing codebase, understanding a RAG implementation, understanding how DeepEval expects conversational test cases to be written, and then producing a valid Pytest file.

That is a higher bar than toy prompts like building a simple game or making a one-page website. It tests whether a model can reason over existing structure, not just autocomplete isolated snippets.

How was the local model comparison set up?

The comparison used two locally run models inside a VS Code coding-agent workflow: Gemma 4 26B A4B QAT 8-bit and Qwen 3.6 35B A3B with 6-bit quantization. Both were selected because their active parameter profiles were close enough to make the comparison meaningful.

The environment used MLX-based local model execution on Apple Silicon, with the model connected into VS Code as the active language model for the coding agent. The core idea was simple: keep the task, prompt, and workflow consistent, and change only the model.

The task was intentionally practical. Rather than asking for synthetic benchmark-style output, the agent was instructed to inspect an actual backend RAG implementation, understand DeepEval's multi-turn conversational evaluation approach, and then create a Python Pytest file covering the required conversational metrics.

This kind of task better reflects how QA engineers and AI application teams actually use coding agents. They need code that aligns with real libraries, project-specific conventions, and evaluation logic. That is also consistent with a broader industry shift toward testing AI systems in production-like conditions rather than relying on narrow benchmark scores. For example, the Google Research discussion on LLM evaluation practices highlights how difficult it is to infer practical performance from simplified evaluations alone.

What was the coding task, exactly?

The task was to write a multi-turn conversational RAG test using DeepEval, grounded in the application's existing backend implementation and DeepEval's conversational evaluation patterns. The output needed to be a valid Pytest file, not a rough outline or pseudocode.

That matters because the assignment had several moving parts:

Read the existing RAG implementation in the application codebase.
Understand how DeepEval handles conversational and RAG-oriented metrics.
Create a multi-turn conversation test, not a single-turn retrieval test.
Use Python and Pytest.
Wire in the evaluation model correctly so the metrics can actually run.

In other words, the model was not being judged on style. It was being judged on whether it could assemble a functioning evaluation artifact. For QA teams, that distinction is crucial. Generated test code that looks polished but omits the judging model or leaves metrics unexecuted is worse than incomplete code because it creates false confidence.

Why is this kind of test a good benchmark for coding agents?

This task is a strong benchmark because it combines repository understanding, third-party library usage, multi-step reasoning, and test authoring. A model has to connect concepts across files and frameworks instead of guessing from a single isolated prompt.

Many coding model demos focus on greenfield generation. That is useful, but limited. Real engineering work is usually brownfield work. You are modifying an existing app, following existing patterns, and integrating with libraries that have opinions about object structure, fixtures, and execution flow.

Multi-turn RAG evaluation is especially revealing because it is easy for a model to produce code that appears plausible while quietly missing the pieces that make the evaluation meaningful. In this case, the model needed to go beyond declaring metrics and actually pass the right judge model and conversational state through the flow.

This is where QA and AI coding overlap. According to the official Pytest documentation, maintainable tests rely on clear fixture use, reusable structure, and executable assertions. AI-generated test code that ignores these patterns tends to create fragile suites that humans have to rewrite.

How did Gemma 4 QAT perform on the real coding task?

Gemma 4 QAT completed the task superficially but missed important implementation details that made the resulting test effectively unusable. The output looked like a conversational test, yet the core evaluation logic was incomplete and the metrics were not applied correctly.

On the surface, the output checked several boxes. It generated a Pytest-style file. It recognized that the task involved multi-turn interaction. It attempted to define conversational stages. That kind of first-pass plausibility is common with modern local models.

The issue was deeper. The generated code initialized the metrics but did not actually use them in a meaningful evaluation flow. It also omitted the necessary large language model judge input for those metrics. As a result, the code was not just imperfect. It was structurally wrong for the intended DeepEval use case.

That distinction matters. A coding agent can be helpful if it gets you 80 percent of the way there and leaves obvious cleanup. It is much less helpful when it produces code that appears complete but contains conceptual gaps that a less experienced engineer might miss.

One upside did show up. Gemma 4 QAT ran with modest memory consumption for its class, reportedly using about 6.8 GB of memory in this setup while handling a 26B mixture-of-experts model with only a subset of parameters active per token. So the efficiency goal looked real. The coding result simply did not match it.

How did Qwen 3.6 perform differently?

Qwen 3.6 produced a materially stronger implementation by structuring retrieval context, handling conversational turns more coherently, and including the judge model needed for evaluation. It still was not guaranteed to be perfect on first execution, but it was much closer to usable engineering output.

The stronger result showed up in several ways:

It built retrieval context before executing the conversation flow.
It organized the turns with more realistic scenario progression.
It included session handling and conversational continuity.
It passed a model for judging the metrics, which was the critical missing piece in the Gemma output.

There were still quirks. A session fixture appeared to be recreated instead of reused from existing project code. That suggests the output could still require human cleanup. But that is normal. The important point is that the result looked directionally correct and much easier to repair than the Gemma 4 QAT version.

For a coding agent, that is often the threshold that matters. Engineers do not need perfection on the first pass. They need a draft that is logically grounded enough to iterate on quickly.

Turn generated tests into managed test assets.

If AI is helping you draft test cases or Pytest files, a structured place to review, organize, and track them keeps that work from disappearing into chat history.

Try the free test case builder →

Was inference speed the deciding factor?

No. In this comparison, token generation speed was nearly identical, so speed alone did not explain the outcome. The real difference was code quality, specifically whether the generated test logic matched the intended evaluation workflow.

Gemma 4 QAT reportedly ran at about 53 tokens per second, while Qwen 3.6 ran at about 54 tokens per second in the same general setup. That is close enough that most engineers would treat the speed as functionally equivalent for this task.

When throughput is similar, correctness becomes the deciding metric. That is especially true in testing work. If a model generates a wrong test faster, it has not saved time. It has only shifted debugging effort downstream.

This is a recurring lesson with AI tooling in QA. Raw speed is easy to advertise. Useful artifacts are harder to produce. Industry-wide, that pattern shows up in developer tooling discussions as well. The Stack Overflow Developer Survey 2024 shows widespread use of AI tools, but also persistent concerns around trust, accuracy, and the need for human verification.

Does that mean Gemma 4 is bad for local use?

No. It means Gemma 4 QAT underperformed on this specific coding-agent task. The local efficiency gains still looked promising, and the 12B dense variant was noted as having especially good inference speed, even if code quality did not stand out.

That is an important nuance. A model can be attractive for one workload and weak for another. The transcript's observations suggest at least three separate judgments:

Gemma 4 QAT: efficient local execution, but disappointing coding output in this scenario.
Gemma 4 12B: fast inference, especially appealing on lower-memory laptops, but not compelling enough for coding quality.
Qwen 3.6: stronger choice for coding-agent tasks, at least in this head-to-head local workflow.

So the right takeaway is not "never use Gemma 4." It is "do not assume compression and local efficiency automatically translate into better agentic coding performance."

What should QA and AI teams learn from this comparison?

QA and AI teams should evaluate local models on real tasks that end in executable test artifacts, not benchmark scores or polished demos. The key measure is whether the model reduces review and repair time once the code lands in your actual workflow.

There are a few practical lessons here.

Use repository-aware tasks:

If a model cannot inspect your implementation and align with it, the result will often be generic. Generic test code is rarely enough for AI application testing.

Prefer correctness over superficial completion:

A finished-looking file is not the same as a working test. In AI evaluation work, small omissions like a missing judge model can invalidate the whole artifact.

Measure repair cost:

The real question is not "Did it generate code?" It is "How much human effort is required to make this trustworthy?"

Keep generated artifacts under test management:

Once you start producing AI-assisted test assets, you need a way to review, organize, execute, and report on them. One way to handle that in TestQuality is to store manual and automated test cases in the same governed workspace, connect them to runs, and keep the execution history visible over time. See the TestQuality docs for current workflow details.

How can you evaluate local coding models more fairly in your own environment?

A fair evaluation uses the same prompt, the same repository, the same editor integration, and a task with objectively checkable output. If you change multiple variables at once, you usually end up comparing setup differences rather than model capability.

A practical evaluation checklist looks like this:

Pick one real task. Use a task that touches your actual codebase.
Keep the prompt constant. Do not tune one model more aggressively than the other.
Use the same agent path. Same editor, same extension, same local runtime.
Inspect the output manually. Check for omitted dependencies, fake APIs, and missing execution logic.
Run the code. Even strong-looking output may break on import or fixture resolution.
Track revision time. The better model is often the one needing fewer correction loops.

If your team formalizes this process, a test management platform becomes useful quickly. You can treat model-generated test cases as draft assets, route them through review, and then execute them in controlled runs instead of treating AI output as disposable snippets.

What common mistakes should you avoid when comparing local coding LLMs?

The biggest mistake is confusing plausible output with correct output. Other common errors include using toy prompts, changing setup variables between runs, and ignoring whether the generated artifact actually executes within your project.

Do not compare on toy examples only. A model that builds a simple game may still fail on your production test harness.
Do not overvalue token speed. Similar speed can hide large quality differences.
Do not skip library-specific validation. Frameworks like DeepEval, Pytest, and RAG tooling have structural expectations.
Do not assume quantization preserves coding quality. Compression can help fit and speed while still changing output behavior.
Do not leave AI-generated tests unmanaged. Untracked artifacts are difficult to review and easy to lose.

Technical Deep Dive FAQ

Key Takeaways

What this local LLM comparison actually showed

Efficient local inference is useful, but usable code is the real outcome metric.

Gemma 4 QAT's strength: Reduced memory footprint and good local responsiveness — about 6.8 GB for a 26B MoE model.

Gemma 4 QAT's weakness: Generated code that looked valid but missed essential DeepEval evaluation logic and the required judge model.

Qwen 3.6 advantage: Better structured output, stronger conversational test flow, and correct inclusion of the judging model.

Evaluation lesson: Similar token speed — 53 vs 54 tokens per second — does not mean similar engineering usefulness.

QA implication: Judge local coding models by executable test artifacts and review effort — not demos or benchmark claims.

The best local coding model is usually the one that leaves your team with less cleanup — not the one with the prettiest first draft.

About the Author

Jose Amoros is part of the TestQuality marketing team, focused on agentic QA, AI-powered test management, and the operational handoff between AI-generated test artifacts and governed execution workflows. He writes regularly about CI/CD integration, Gherkin/BDD practices, and shift-left testing.