How to Build an AI Code Review Pipeline: Catch Bugs Before They Ship

TutorialsBy Ivern AI Team10 min read

How to Build an AI Code Review Pipeline: Catch Bugs Before They Ship

TL;DR: Build a 3-agent code review pipeline with Ivern AI that runs Bug Hunter, Security Scanner, and Style Enforcer on every pull request. Each review completes in under 60 seconds and costs $0.03-0.10 per PR. This guide covers agent configuration, system prompts, a real Python PR walkthrough, and a cost comparison to GitHub Copilot Code Review and CodeRabbit.

The average pull request sits in review for 1-2 days. Reviewers miss roughly 40% of bugs because they are reviewing unfamiliar code, juggling multiple PRs, or rushing to unblock a release. Small teams often skip reviews entirely because there is nobody available to review.

An AI code review pipeline fixes this by running consistent, fast reviews on every PR -- no scheduling required. Instead of relying on a single AI tool that tries to do everything, a multi-agent pipeline assigns specialized agents to each category of review. The result: higher catch rates, fewer false positives, and reviews that finish before your coffee gets cold.

This guide walks through building one with Ivern AI.

In this guide:

Related: AI Agent Code Review Automation · AI Agent Bug Fixing Workflow · How to Coordinate Multiple AI Coding Agents · Claude Code vs Cursor Comparison · AI Coding Assistant Complete Guide · Compare All Tools

Why Multi-Agent Review Beats Single-Tool Approaches

A single AI model reviewing your code works okay for simple checks. But it struggles with depth. One model trying to catch bugs, security vulnerabilities, and style violations in a single pass produces shallow feedback across all three categories.

Multi-agent review solves this by giving each agent a narrow focus and the right model for the job:

ApproachCatch RateFalse Positive RateAvg. Review TimeCost/PR
No review0%0%0s$0
Single-agent AI~60%25-30%20s$0.02-0.05
Multi-agent pipeline~85-95%8-12%45-60s$0.03-0.10
Human review only~60-70%5-10%1-2 days$50-200

The multi-agent pipeline catches 25-35% more issues than a single agent while keeping false positives in the single digits. It does this by running agents in parallel and merging their findings into one consolidated report.

The 3-Agent Review Pipeline

The pipeline uses three specialized agents. Each agent gets a different system prompt, a different model, and a different review focus. They run in parallel and Ivern AI merges the results.

Agent 1: Bug Hunter

Model: Claude Sonnet 4 (high reasoning accuracy) Purpose: Detects logic errors, off-by-one bugs, null pointer risks, race conditions, and incorrect error handling.

System prompt:

You are a senior software engineer performing a bug-focused code review.
Analyze the diff for logic errors, incorrect control flow, unhandled
edge cases, null/undefined access, race conditions, and resource leaks.
For each finding, provide:
1. File and line number
2. Severity: CRITICAL, HIGH, MEDIUM, LOW
3. Description of the bug
4. Suggested fix (code snippet)

Ignore style issues and security vulnerabilities -- other agents handle those.
Focus only on correctness bugs.

Agent 2: Security Scanner

Model: Claude Sonnet 4 (strong at pattern-based vulnerability detection) Purpose: Finds SQL injection, XSS, hardcoded secrets, insecure dependencies, auth bypasses, and data exposure risks.

System prompt:

You are an application security engineer reviewing a pull request.
Check for: SQL injection, XSS, CSRF, hardcoded secrets/credentials,
insecure deserialization, path traversal, auth bypass, and data
exposure. Also flag any new dependencies with known CVEs.
For each finding:
1. File and line number
2. Severity: CRITICAL, HIGH, MEDIUM, LOW
3. Vulnerability type (OWASP category)
4. Remediation steps

Do not comment on style or general bugs. Focus on security only.

Agent 3: Style Enforcer

Model: Claude Haiku 4 (fast and cheap, sufficient for style checks) Purpose: Checks naming conventions, code organization, documentation, test coverage, and adherence to project style guides.

System prompt:

You are a code quality reviewer enforcing project style standards.
Check for: naming convention violations, missing docstrings on public
functions, inconsistent formatting, overly complex functions (high
cyclomatic complexity), missing tests for new code, and violations of
the project's linting rules.
For each finding:
1. File and line number
2. Severity: MEDIUM or LOW only
3. Rule violated
4. Suggested improvement

Do not report bugs or security issues. Focus on style and maintainability.

Why These Models

Claude Sonnet 4 handles the heavy reasoning for bug detection and security analysis. Claude Haiku 4 is fast and cheap for style checks, where the logic is simpler. With Ivern AI's BYOK (Bring Your Own Key) model, you plug in your own API keys and pay only for what you use -- no per-seat markup.

AgentModelInput TokensOutput TokensCost/Review
Bug HunterClaude Sonnet 4~8K~1.5K~$0.04
Security ScannerClaude Sonnet 4~8K~1K~$0.03
Style EnforcerClaude Haiku 4~8K~0.8K~$0.003
Total per PR~24K~3.3K~$0.073

At $0.073 per PR, reviewing 50 PRs per day costs $3.65. That is less than a single human review hour.

Setup Instructions

Step 1: Create an Ivern AI Account

Sign up at ivern.ai/signup. The free tier lets you set up 3 agents and run 100 reviews per month. No credit card required.

Step 2: Add Your API Keys (BYOK)

Ivern AI uses a Bring Your Own Key model. Navigate to Settings > API Keys and add your Anthropic key. You can also add OpenAI and Google keys if you want to mix models across agents.

Why BYOK matters: you pay the raw API price with no markup. A $100 Anthropic credit gives you $100 of compute. On platforms that charge per-seat, the same usage costs $200-500/month per developer.

Step 3: Create the Three Agents

In the Ivern AI dashboard, create three agents:

  1. Bug Hunter -- Select Claude Sonnet 4. Paste the system prompt above. Set the task type to "Code Review."
  2. Security Scanner -- Select Claude Sonnet 4. Paste the security system prompt. Set the task type to "Code Review."
  3. Style Enforcer -- Select Claude Haiku 4. Paste the style system prompt. Set the task type to "Code Review."

Step 4: Create a Pipeline

Go to Pipelines > New Pipeline. Name it "PR Code Review." Add all three agents and set them to run in parallel. Configure the output to merge all findings into a single sorted report (CRITICAL first, then HIGH, MEDIUM, LOW).

Step 5: Connect Your Repository

Link your GitHub repository under Integrations. Ivern AI will listen for new pull requests and automatically trigger the pipeline. You can also configure it to run on push to specific branches or on a manual trigger via a slash command in PR comments.

Step 6: Test with a Sample PR

Open a test PR with a few intentional bugs: a null pointer risk, a hardcoded API key, and a missing docstring. The pipeline should catch all three within 60 seconds.

Real Workflow Example: Reviewing a Python PR

Consider a PR that adds a user registration endpoint to a Flask application. Here is the diff:

# Added to routes/user.py

@app.route("/register", methods=["POST"])
def register():
    data = request.json
    user = User(
        username=data["username"],
        email=data["email"],
        password=data["password"]  # plaintext
    )
    db.session.add(user)
    db.session.commit()

    send_welcome_email(
        to=data["email"],
        api_key="sk-live-a1b2c3d4e5f6"  # hardcoded
    )

    return jsonify({"id": user.id, "username": user.username})

Here is what each agent reports:

Bug Hunter findings:

[MEDIUM] routes/user.py:5 - Missing error handling for request.json.
If the request body is empty or not JSON, data["username"] will raise
a KeyError. Wrap in a try/except or validate with a schema.

Suggested fix:
    data = request.get_json()
    if not data or "username" not in data or "email" not in data:
        return jsonify({"error": "Missing required fields"}), 400

Security Scanner findings:

[CRITICAL] routes/user.py:6 - Password stored in plaintext.
Use a hashing function (bcrypt, argon2) before storing.

Suggested fix:
    from werkzeug.security import generate_password_hash
    password=generate_password_hash(data["password"])

[CRITICAL] routes/user.py:11 - Hardcoded API key exposed in source.
Move to an environment variable.

Suggested fix:
    api_key=os.environ.get("SENDGRID_API_KEY")

[HIGH] routes/user.py:3 - No input validation on email format.
Accepts any string, which could lead to injection or malformed data.

Style Enforcer findings:

[LOW] routes/user.py:3 - Missing docstring on public function.
Add a docstring describing the endpoint, parameters, and return format.

[LOW] routes/user.py:3 - Missing type hints.
Consider adding type annotations for better maintainability.

The merged report surfaces 2 CRITICAL issues (plaintext password and hardcoded secret) that must be fixed before merge, plus 1 HIGH and 2 MEDIUM items for follow-up. The 2 LOW style items are optional.

Total review time: 47 seconds. Total cost: $0.073.

Cost Breakdown

Here is the monthly cost at different PR volumes:

PRs/MonthBug HunterSecurity ScannerStyle EnforcerTotal
50$2.00$1.50$0.15$3.65
200$8.00$6.00$0.60$14.60
500$20.00$15.00$1.50$36.50
1,000$40.00$30.00$3.00$73.00

Compare this to hiring a reviewer at $80K/year. Even at 1,000 PRs per month, the AI pipeline costs $73 -- roughly 0.09% of a full-time salary.

Comparison to GitHub Copilot Code Review, CodeRabbit, and Others

FeatureIvern AI PipelineGitHub Copilot ReviewCodeRabbitSonarQube
Multi-agent reviewYes (3+ specialized agents)No (single model)No (single model)No (rule-based)
Custom system promptsYesNoLimitedYes (custom rules)
Model choiceAny (BYOK)OpenAI onlyFixed modelsN/A (static analysis)
Cost per review$0.03-0.10Included in $19-39/mo seat$0 (free tier) / $12/mo$150+/mo (team)
Bug detectionHigh (85-95%)Medium (60-70%)Medium-High (70-80%)Medium (pattern-based)
Security scanningYes (OWASP)BasicYesYes
Style enforcementYes (custom rules)BasicYesYes (lint rules)
Self-hosted optionYes (BYOK keys)NoNoYes
CI/CD integrationGitHub, GitLab, BitbucketGitHub onlyGitHub, GitLabGitHub, GitLab, Bitbucket

Key differences:

  • GitHub Copilot Code Review uses a single model with a fixed prompt. You cannot customize what it checks or swap models. It is included in the Copilot subscription ($19-39/month per seat) but you cannot mix Claude and GPT models.
  • CodeRabbit offers a free tier and decent default reviews. But you cannot create specialized agents or control the review pipeline. It uses fixed models and fixed prompts.
  • SonarQube is rule-based static analysis. It catches style and pattern-based issues well but cannot reason about logic bugs the way an LLM can.
  • Ivern AI gives you full control over agents, models, and prompts. With BYOK, you pay raw API prices with no markup. You can add a fourth agent (performance reviewer, accessibility checker) without changing your existing setup.

Tips for Better AI Code Review Output

1. Send the diff, not the full file. Diffs are smaller, faster to process, and produce more focused reviews. Include 3-5 lines of context around each change for grounding.

2. Add a project context file. Create a .ivern/context.md file describing your tech stack, coding conventions, and known patterns. Agents use this as additional context during review.

3. Tune severity thresholds. Start with all severities reported, then filter out LOW items after a week if they create noise. You can configure the pipeline to post only CRITICAL and HIGH findings as PR comments.

4. Run style checks only on changed files. There is no value in flagging style issues in files your PR did not touch. Configure the Style Enforcer to only review lines in the diff.

5. Add a "known false positives" list. If an agent repeatedly flags something that is intentional, add it to an ignore file (.ivern/ignore.json). This improves precision over time.

6. Use different models for different languages. Claude Sonnet 4 excels at Python and JavaScript review. For Rust or Go, consider GPT-4.1 which has strong performance on systems programming languages. Ivern AI's BYOK model lets you route agents to any model.

FAQ

How long does a full 3-agent review take?

Most PRs complete in 30-60 seconds. Large PRs (over 1,000 lines changed) may take 90-120 seconds. All three agents run in parallel, so the total time is determined by the slowest agent (usually Bug Hunter).

Can I add more than 3 agents?

Yes. Ivern AI supports unlimited agents per pipeline. Common additions include a Performance Reviewer (checks for N+1 queries, unnecessary re-renders), an Accessibility Checker (validates ARIA labels, semantic HTML), and a Dependency Auditor (flags outdated or vulnerable packages).

Does this work with languages other than Python?

Yes. The pipeline works with any language. The system prompts are language-agnostic. For language-specific checks (like Rust borrow checker issues or Go goroutine leaks), add language-specific instructions to the agent's system prompt.

What if I do not want to use Anthropic models?

Ivern AI supports BYOK with Anthropic, OpenAI, Google, and Mistral. You can run the entire pipeline on GPT-4.1 or Gemini 2.5 Flash if you prefer. Costs will differ based on the model's pricing.

How do I get started?

Sign up at ivern.ai/signup, add your API key, create the three agents, and connect your repository. The whole setup takes under 10 minutes. The free tier covers 100 reviews per month.


Ready to stop shipping bugs? Build your AI code review pipeline on Ivern AI -- free to start, BYOK pricing, and your first review runs in under 5 minutes.

Related posts:

AI Content Factory -- Free to Start

One prompt generates blog posts, social media, and emails. Free tier, BYOK, zero markup.