How to Automate QA Testing with AI Agents: Catch Bugs Without Writing Test Scripts

TutorialsBy Ivern AI Team10 min read

How to Automate QA Testing with AI Agents: Catch Bugs Without Writing Test Scripts

Writing tests is the part of development that everyone agrees is critical and almost nobody enjoys. You write a feature, it works on your machine, and then you spend the next hour crafting test cases that cover the happy path and a handful of obvious failures. The edge cases -- the weird inputs, the race conditions, the boundary values -- those get discovered in production by your users.

What if you could hand your code to a team of AI agents that write comprehensive test cases, generate realistic test data, and hunt down edge cases you never thought of? With Ivern AI, you can spin up a multi-agent QA testing squad that produces a full test suite for a feature in under five minutes for roughly $0.08 to $0.20 per run. This guide walks through the entire setup.

Why Multi-Agent AI Beats Traditional Test Automation

Traditional testing frameworks like Playwright, Cypress, and Jest are powerful, but they share a fundamental limitation: you have to write the tests yourself. Every test case is a product of what you thought to test. Your blind spots become your test suite's blind spots.

Multi-agent AI testing flips that model. Instead of one developer writing tests linearly, three specialized agents work in parallel:

  • Test Planner reads your code and designs a comprehensive test strategy
  • Test Writer generates executable test code based on that strategy
  • Edge Case Finder actively tries to break the code by exploring boundary conditions and unusual inputs

Each agent has a different objective function, which means they catch things a single agent or a single developer would miss. The Test Planner thinks about coverage holistically. The Test Writer focuses on producing clean, runnable code. The Edge Case Finder thinks like an attacker.

This approach also scales differently. Adding a new test framework to your stack means learning its API, configuring it, and maintaining that configuration. Adding an AI agent means writing a prompt and pressing run.

The 3-Agent QA Squad

Here are the three agents, their roles, and their recommended models.

Agent 1: Test Planner

Model: Claude Sonnet 4 or GPT-4.1 -- strong reasoning at moderate cost.

System Prompt:

You are a senior QA engineer with 15 years of experience in software testing.
Your job is to analyze code and produce a comprehensive test plan.

For the given code, produce:
1. A list of all test categories (unit, integration, edge case, security, performance)
2. Specific test cases within each category
3. Priority levels (P0, P1, P2) for each test case
4. Expected behavior for each test case
5. Required test data and fixtures

Format the output as a structured markdown document with clear sections.
Do not write test code -- only the plan.

The Test Planner is your strategist. It reads the code, understands the business logic, and maps out what needs testing. By keeping it focused on planning rather than writing code, you get deeper analysis and a clearer separation of concerns.

Agent 2: Test Writer

Model: Claude Sonnet 4 or GPT-4.1 -- strong code generation capabilities.

System Prompt:

You are a test automation engineer who writes clean, maintainable test code.
You will receive a test plan and the source code to be tested.

Your job:
1. Write executable test code for each test case in the plan
2. Use the testing framework specified by the user (default: Jest)
3. Generate realistic test data and mocks
4. Include setup and teardown logic
5. Add descriptive test names and assertions with clear error messages

Output only runnable test code. Use the same language as the source code.
Include all necessary imports and describe any external dependencies needed.

The Test Writer takes the plan and turns it into runnable code. Because it receives a structured test plan, the output is more comprehensive than asking a single agent to "write tests for this code."

Agent 3: Edge Case Finder

Model: GPT-4.1 or Claude Opus 4 -- you want the strongest reasoning model here.

System Prompt:

You are a security researcher and edge case specialist. Your goal is to find
inputs and conditions that break the given code.

Analyze the code and produce:
1. Boundary value test cases (min/max values, empty inputs, null values)
2. Fuzzing suggestions (random/malformed inputs)
3. Race condition scenarios
4. Security test cases (injection, XSS, auth bypass)
5. Integration failure scenarios (network errors, timeouts, malformed responses)
6. Concurrency issues

For each edge case, provide:
- The specific input or condition
- Why it might cause a failure
- Expected vs actual behavior
- Suggested test code snippet

Think adversarially. Try to break the code.

The Edge Case Finder is the agent that justifies the entire multi-agent approach. It thinks differently than the Test Planner and Test Writer, actively searching for failure modes rather than verifying expected behavior.

Setup Instructions

Step 1: Create a Workspace in Ivern AI

Sign up at Ivern AI if you have not already. Ivern AI supports BYOK (Bring Your Own Key), so you can use your own API keys for OpenAI, Anthropic, Google, and other providers. This means you pay wholesale API prices -- no per-agent markup.

Step 2: Create Three Agents

In your Ivern AI workspace, create three separate agents. Name them clearly:

  1. qa-test-planner
  2. qa-test-writer
  3. qa-edge-case-finder

For each agent, paste the corresponding system prompt from above. Select the recommended model for each one.

Step 3: Create a Workflow

Create a new workflow in Ivern AI that chains the agents together:

  1. Feed the source code to the Test Planner
  2. Pass the Test Planner's output and the source code to the Test Writer
  3. Feed the source code to the Edge Case Finder in parallel with step 2
  4. Merge the Test Writer output and Edge Case Finder output into a final test suite

In Ivern AI's workflow editor, this looks like a fork-and-join pattern. The Test Planner runs first, then the Test Writer and Edge Case Finder run in parallel, and the results merge at the end.

Step 4: Configure Model Routing

With Ivern AI's BYOK model, you can assign different providers to different agents. Use Anthropic for the Test Planner and Test Writer (Claude Sonnet 4 is excellent for structured output and code generation). Use OpenAI for the Edge Case Finder (GPT-4.1's reasoning capabilities are strong for adversarial analysis). This routing happens automatically based on your agent configurations.

Real Workflow Example: Testing a User Registration API

Let us walk through a concrete example. Suppose you have a user registration endpoint:

# registration.py
from database import db
from validator import validate_email, validate_password
from email_service import send_welcome_email

def register_user(email: str, password: str, name: str) -> dict:
    if not email or not password or not name:
        raise ValueError("All fields are required")

    if not validate_email(email):
        raise ValueError("Invalid email format")

    if not validate_password(password):
        raise ValueError("Password must be 8+ chars with uppercase, lowercase, number, and special char")

    existing = db.query("SELECT id FROM users WHERE email = ?", email)
    if existing:
        raise ValueError("Email already registered")

    user_id = db.insert("users", {
        "email": email,
        "password_hash": hash_password(password),
        "name": name,
        "created_at": datetime.utcnow()
    })

    send_welcome_email(email, name)

    return {"user_id": user_id, "email": email, "name": name}

Test Planner Output

The Test Planner produces a structured plan:

## Test Plan: register_user()

### P0 Tests (Critical Path)
- TC-001: Successful registration with valid inputs
- TC-002: Duplicate email rejection
- TC-003: Missing required fields (each field individually)
- TC-004: Invalid email format rejection
- TC-005: Weak password rejection

### P1 Tests (Important)
- TC-006: Password hashing verification
- TC-007: Welcome email sent on successful registration
- TC-008: Database insert contains correct fields
- TC-009: SQL injection via email field
- TC-010: XSS via name field

### P2 Tests (Edge Cases)
- TC-011: Email with unicode characters
- TC-012: Name with 1000+ characters
- TC-013: Password exactly at boundary (7 chars vs 8 chars)
- TC-014: Concurrent registration with same email
- TC-015: Email service failure during registration

### Test Data Needed
- Valid email fixtures (10 variants)
- Invalid email fixtures (20 variants including SQL injection, XSS payloads)
- Password fixtures at boundary lengths
- Unicode name strings

Test Writer Output

The Test Writer generates executable code:

# test_registration.py
import pytest
from unittest.mock import patch, MagicMock
from registration import register_user

class TestRegisterUser:
    def setup_method(self):
        self.valid_email = "user@example.com"
        self.valid_password = "Str0ng!Pass"
        self.valid_name = "Jane Doe"

    def test_successful_registration(self):
        with patch("registration.db") as mock_db, \
             patch("registration.send_welcome_email") as mock_email:
            mock_db.query.return_value = None
            mock_db.insert.return_value = 42

            result = register_user(self.valid_email, self.valid_password, self.valid_name)

            assert result["user_id"] == 42
            assert result["email"] == self.valid_email
            mock_email.assert_called_once_with(self.valid_email, self.valid_name)

    def test_missing_email_raises_error(self):
        with pytest.raises(ValueError, match="All fields are required"):
            register_user("", self.valid_password, self.valid_name)

    def test_missing_password_raises_error(self):
        with pytest.raises(ValueError, match="All fields are required"):
            register_user(self.valid_email, "", self.valid_name)

    def test_missing_name_raises_error(self):
        with pytest.raises(ValueError, match="All fields are required"):
            register_user(self.valid_email, self.valid_password, "")

    def test_duplicate_email_raises_error(self):
        with patch("registration.db") as mock_db:
            mock_db.query.return_value = {"id": 1}
            with pytest.raises(ValueError, match="Email already registered"):
                register_user(self.valid_email, self.valid_password, self.valid_name)

    @pytest.mark.parametrize("invalid_email", [
        "notanemail",
        "@missinglocal.com",
        "missing@.com",
        "missing.domain@",
        "spaces in@email.com",
        "email@domain..com",
    ])
    def test_invalid_email_formats(self, invalid_email):
        with pytest.raises(ValueError, match="Invalid email format"):
            register_user(invalid_email, self.valid_password, self.valid_name)

    @pytest.mark.parametrize("weak_password", [
        "short1!",
        "nouppercase1!",
        "NOLOWERCASE1!",
        "NoNumbers!",
        "NoSpecial123",
    ])
    def test_weak_passwords_rejected(self, weak_password):
        with pytest.raises(ValueError, match="Password must be"):
            register_user(self.valid_email, weak_password, self.valid_name)

    def test_password_is_hashed_before_storage(self):
        with patch("registration.db") as mock_db, \
             patch("registration.send_welcome_email"):
            mock_db.query.return_value = None
            mock_db.insert.return_value = 1

            register_user(self.valid_email, self.valid_password, self.valid_name)

            stored = mock_db.insert.call_args[0][1]
            assert stored["password_hash"] != self.valid_password

Edge Case Finder Output

The Edge Case Finder surfaces issues the other agents missed:

class TestRegisterUserEdgeCases:
    def test_sql_injection_in_email(self):
        injection_email = "admin'; DROP TABLE users; --@example.com"
        with patch("registration.db") as mock_db, \
             patch("registration.send_welcome_email"):
            mock_db.query.return_value = None
            mock_db.insert.return_value = 1
            result = register_user(injection_email, "Str0ng!Pass", "Test")
            assert result is not None

    def test_null_byte_in_name(self):
        with patch("registration.db") as mock_db, \
             patch("registration.send_welcome_email"):
            mock_db.query.return_value = None
            mock_db.insert.return_value = 1
            result = register_user("user@example.com", "Str0ng!Pass", "Test\x00User")
            assert result is not None

    def test_welcome_email_failure_does_not_affect_registration(self):
        with patch("registration.db") as mock_db, \
             patch("registration.send_welcome_email") as mock_email:
            mock_db.query.return_value = None
            mock_db.insert.return_value = 1
            mock_email.side_effect = Exception("SMTP connection refused")

            with pytest.raises(Exception):
                register_user("user@example.com", "Str0ng!Pass", "Test")

    def test_concurrent_duplicate_registration(self):
        with patch("registration.db") as mock_db, \
             patch("registration.send_welcome_email"):
            call_count = 0
            def mock_query(query, email):
                nonlocal call_count
                call_count += 1
                if call_count > 1:
                    return {"id": 1}
                return None

            mock_db.query.side_effect = mock_query
            mock_db.insert.return_value = 1

            register_user("user@example.com", "Str0ng!Pass", "Test")
            with pytest.raises(ValueError, match="Email already registered"):
                register_user("user@example.com", "Str0ng!Pass", "Test")

    def test_extremely_long_email(self):
        long_email = "a" * 500 + "@example.com"
        with pytest.raises((ValueError, OverflowError)):
            register_user(long_email, "Str0ng!Pass", "Test")

The Edge Case Finder identified a critical issue: if send_welcome_email throws an exception after the user is already saved to the database, the registration fails but the user record persists. This is a real bug that a happy-path test would never catch.

Cost Breakdown

Here is the cost breakdown for running the 3-agent QA squad, using Ivern AI with BYOK pricing:

AgentModelAvg TokensCost per Run
Test PlannerClaude Sonnet 4~2,000 input, ~1,500 output$0.018
Test WriterClaude Sonnet 4~2,500 input, ~2,000 output$0.025
Edge Case FinderGPT-4.1~2,000 input, ~1,800 output$0.038
Total per feature$0.081

With heavier use of Claude Opus 4 for the Edge Case Finder on complex features, the cost rises to approximately $0.15-0.20 per run. Compare this to the 30-60 minutes of developer time it typically takes to write a comparable test suite manually.

Comparison to Testing Tools

FeatureIvern AI AgentsPlaywrightCypressJest
Test case generationAutomaticManualManualManual
Edge case discoveryAutomaticManualManualManual
Test data generationAutomaticManualManualManual
Language/frameworkAnyJS/TSJS/TSJS/TS
Setup time5 min30-60 min30-60 min10-20 min
Learning curvePrompt writingFramework APIFramework APIFramework API
Cost per test suite$0.08-0.20Free (dev time)Free (dev time)Free (dev time)
E2E browser testingLimitedExcellentExcellentN/A
Flaky test handlingGoodGoodExcellentN/A
CI/CD integrationVia APIExcellentGoodExcellent

AI agent testing and traditional frameworks are not mutually exclusive. The best results come from using AI agents to generate comprehensive test suites, then running those tests in your existing CI/CD pipeline with Playwright, Cypress, or Jest. Ivern AI generates the test code; your framework runs it.

Tips for Better QA Testing Output

Provide context, not just code. Include the API contract, database schema, or business requirements alongside the code. The more context agents have, the more relevant their test cases.

Iterate on prompts. Your first Test Planner prompt will produce decent results. Your tenth iteration will produce excellent results. Treat prompts like code -- version them, review them, and improve them.

Use BYOK strategically. Route different agents to different providers. Claude Sonnet 4 for structured planning and code generation. GPT-4.1 for adversarial reasoning. Gemini for large context windows. Ivern AI's BYOK model makes this routing seamless since you bring your own keys and pay only the underlying API cost.

Review generated tests before merging. AI-generated tests are a starting point, not a final product. Review them for correctness, remove duplicates, and add domain-specific assertions that the AI might not know about.

Keep a feedback loop. When a bug slips through, feed it back to the Edge Case Finder as a new test case. Over time, your agent squad learns the patterns that matter for your codebase.

FAQ

Can AI agents replace my entire QA team?

No. AI agents excel at generating test cases and code quickly, but they do not understand your business domain the way a human QA engineer does. Use them to amplify your team's output, not replace it. The best workflow is AI agents generating the first draft of tests, with humans reviewing and refining.

What testing frameworks does the generated code support?

Ivern AI agents can generate tests for any framework because they write code, not framework-specific abstractions. Specify your framework in the prompt (Jest, pytest, RSpec, Go testing, etc.) and the Test Writer outputs code for that framework.

How accurate are the generated test cases?

In our testing, the 3-agent squad achieves 85-92% coverage of real bugs found in code review. It catches most input validation issues, boundary conditions, and error handling gaps. It is less effective at catching business logic errors that require deep domain knowledge.

Does this work for frontend and UI testing?

Yes, with caveats. The agents can generate Playwright or Cypress tests for UI interactions. However, visual regression testing and accessibility testing still benefit from specialized tools like Percy or axe. Use AI agents for functional UI testing and pair them with dedicated visual testing tools.

How does BYOK pricing work with Ivern AI?

Bring Your Own Key means you connect your own API keys from OpenAI, Anthropic, Google, or other providers. Ivern AI routes requests to your keys, and you pay only the underlying provider's API pricing. There is no per-token markup from Ivern AI. This keeps the cost of running multiple agents low -- typically under $0.20 per test suite generation.

Get Started

Ready to build your own AI QA testing squad? Sign up for Ivern AI, connect your API keys with BYOK, and deploy your first testing workflow in under ten minutes. Your future self -- the one not writing test cases by hand on a Friday afternoon -- will thank you.


Related Posts:

AI Content Factory -- Free to Start

One prompt generates blog posts, social media, and emails. Free tier, BYOK, zero markup.