The noise problem
AI code review tools are proliferating. Most of them follow the same pattern: feed a diff to an LLM, ask it to find problems, post comments on the pull request. The result is typically a wall of low-value observations — style nitpicks, obvious comments, and false positives — that developers quickly learn to ignore.
The problem isn't that LLMs can't find real issues. They can. The problem is that without careful filtering and prioritization, the signal-to-noise ratio is too low for developers to bother reading.
An AI code review tool that generates 20 comments per PR, of which 2 are useful, is worse than one that generates 3 comments, all of which are useful. Developers will read 3 comments. They won't read 20.
What LLMs are actually good at in code review
Based on real-world deployments, LLMs add the most value in these areas:
Bug detection in business logic
LLMs are surprisingly good at catching logical errors that static analysis misses — things like incorrect boundary conditions, off-by-one errors in complex loops, and race conditions in concurrent code. These are the bugs that experienced reviewers catch through pattern recognition, and LLMs have enough training data to replicate some of that intuition.
Security vulnerability identification
Common vulnerability patterns (SQL injection, XSS, insecure deserialization) are well-represented in training data. LLMs catch these reliably — often more consistently than human reviewers who might be focused on functionality rather than security.
Documentation and naming quality
LLMs can identify functions with misleading names, missing docstrings on public APIs, and comments that contradict the code. This is high-value feedback that's easy for humans to overlook.
Cross-file impact analysis
Given enough context (the diff plus relevant surrounding files), LLMs can flag changes that might break callers in other files — a type of review that requires understanding relationships across the codebase.
Building a pipeline that works
Step 1 — Filter before sending to the LLM
Not every file in a PR needs AI review. Generated files, lock files, test fixtures, and large binary diffs should be excluded. Similarly, small changes to well-tested utility functions may not warrant the inference cost.
A simple filter based on file type, change size, and file path reduces both cost and noise.
Step 2 — Provide rich context
The quality of AI review scales directly with the context you provide. At minimum, include the diff plus the full files being modified. For best results, also include relevant test files, the function signatures of callers, and any relevant configuration or schema files.
Step 3 — Use structured prompts with severity levels
Instead of asking the model to "review this code," provide specific review dimensions and ask it to classify findings by severity. A well-structured prompt produces output like: severity (critical/major/minor), category (bug/security/performance/style), specific location (file + line), and a concrete explanation of the issue and suggested fix.
Step 4 — Filter the output aggressively
Post-generation filtering is where you turn noise into signal. Common filters include suppressing style comments that duplicate existing linter rules, removing findings with low confidence scores, deduplicating similar findings, and suppressing findings on lines that were not modified in the current diff.
Step 5 — Track feedback and iterate
Let developers mark AI comments as helpful or unhelpful. Use this feedback to tune your prompts, adjust your filters, and identify patterns in false positives. Over time, the system gets better because you're calibrating it to your team's standards.
The trust equation
Developer trust is the scarcest resource in AI code review. Trust is built slowly and lost instantly. A few principles:
- High precision beats high recall. It's better to miss some issues than to flood PRs with false positives.
- Never block merges on AI review. AI comments should be advisory, not blocking. Blocking PRs on AI findings destroys developer goodwill.
- Be transparent about limitations. Show the model's confidence when available, and make it clear that the review is AI-generated.
Metrics to track
- Action rate — What percentage of AI comments lead to a code change? Target: >30%.
- Dismiss rate — What percentage are dismissed as unhelpful? Target: <40%.
- True positive rate on critical findings — When the AI flags a critical issue, how often is it real? Target: >80%.
- Developer satisfaction — Periodic surveys on whether the tool is helpful or annoying.
If your action rate is below 20%, the tool is generating too much noise and developers are ignoring it. If it's above 50%, the tool is probably too conservative and missing findings. The sweet spot is a tool that posts 2–4 comments per PR, most of which are genuinely useful.