AI Agent Workflow for IT Operations: Incident Response and Runbook Automation

TutorialsBy Ivern AI Team12 min read

AI Agent Workflow for IT Operations: Incident Response and Runbook Automation

TL;DR: A three-agent IT ops squad -- Triage Agent (GPT-4.1-mini, $0.03), Documentation Agent (Claude Sonnet 4, $0.10), Review Agent (GPT-4.1-mini, $0.01) -- automates incident triage, runbook generation, and post-mortem reports. Total cost: $0.05-$0.21 per run depending on task complexity. This guide covers three production workflows with exact prompts and cost breakdowns.

IT operations teams spend 40% of their time on documentation and communication tasks that are critical but repetitive. Incident response requires fast triage and clear communication. Runbooks need to be written, updated, and maintained. Post-mortems demand thorough analysis and blameless documentation. Each of these tasks follows a predictable structure -- exactly where AI agent squads deliver value.

The challenge with IT ops documentation is that it requires both technical accuracy and clear communication. A runbook that is technically correct but unreadable is useless at 3 AM during an outage. A post-mortem that is well-written but misses the root cause is dangerous. Multi-agent workflows handle both dimensions: one agent ensures technical completeness, another handles clarity and structure.

Related: AI Coding Agents for DevOps Teams · AI Agent Workflow Automation Tools 2026 · AI Agent Workflows: 10 Examples · Multi-Agent AI Pipeline · How to Automate Workflows with AI Agents

The IT Ops Agent Squad

Agent Configuration

AgentModelRoleCost per Run
Triage AgentGPT-4.1-miniClassify incidents, assess severity, suggest initial actions~$0.03
Docs WriterClaude Sonnet 4Draft runbooks, post-mortems, change management docs~$0.08-$0.10
Review AgentGPT-4.1-miniVerify technical accuracy, check completeness, enforce templates~$0.01

Total per run: $0.05-$0.14 depending on task complexity and input size.

Why These Models?

  • GPT-4.1-mini for triage and review: fast inference at $0.40/M input tokens, strong at structured classification and checklist verification. Essential when incidents need sub-second categorization.
  • Claude Sonnet 4 for documentation: produces clear, technically precise prose. Handles the nuanced requirements of blameless post-mortems and step-by-step runbook instructions.
  • GPT-4.1 as an upgrade option for complex incidents: when triage involves multi-system failure analysis, upgrade to GPT-4.1 at $2.00/M input tokens for deeper reasoning.

Workflow 1: Incident Triage and Initial Response

The Problem

When an alert fires, the on-call engineer spends the first 10-15 minutes understanding the incident: What is it? How severe? Who is affected? What are the first steps? This triage phase delays resolution and increases MTTR (Mean Time to Resolve).

The Agent Pipeline

Task input (triggered by alert webhook):

"INCIDENT ALERT: Service 'payment-gateway' in us-east-1 is returning 503 errors at 12% rate. Started 3 minutes ago. Affected endpoints: /api/v2/charge, /api/v2/refund. Last deployment: payment-gateway v3.14.2, 47 minutes ago. CPU utilization: 89%. Error rate graph: [attached]. Recent changes: [change log data]."

Phase 1: Triage Agent (GPT-4.1-mini, ~$0.03)

The triage agent produces a structured assessment:

  • Severity: P1 (revenue-impacting, customer-facing)
  • Category: Deployment-related regression (correlation with recent deploy)
  • Affected systems: payment-gateway, potentially order-processing downstream
  • Estimated blast radius: 12% of payment attempts failing, approximately $4,200/hour in failed transactions based on normal volume
  • Recommended first actions:
    1. Verify rollback candidate: payment-gateway v3.14.1
    2. Check deployment diff for /api/v2/charge and /api/v2/refund handlers
    3. Page payment team lead (current on-call: [name])
    4. Prepare customer-facing status update draft
  • Similar incidents: INC-2025-0842 (same service, similar error pattern after deployment)
  • Communication template: Slack incident channel name, stakeholder notification list

Phase 2: Docs Writer (Claude Sonnet 4, ~$0.08)

The writer produces:

  • Status page draft: Customer-facing update explaining the issue without exposing internal details
  • Slack incident channel kickoff message: Structured with severity, timeline, and assigned responders
  • Stakeholder email draft: For VP Engineering and Customer Success, with business impact quantification

Phase 3: Review Agent (GPT-4.1-mini, ~$0.01)

Checks severity classification against SLA definitions, verifies business impact calculations, confirms communication drafts follow incident response playbook templates.

Total cost: ~$0.12. Time: 90 seconds.

This replaces 10-15 minutes of manual triage with a structured assessment delivered before the on-call engineer finishes reading the alert.

Prompt Template for Incident Triage

Triage Agent:

"You are an incident triage assistant for a SRE team. Given alert data, produce: 1) Severity classification (P1-P4) based on: customer impact, revenue impact, blast radius, and SLA proximity, 2) Category (deployment, infrastructure, dependency, configuration, security), 3) Affected systems map, 4) Estimated business impact with reasoning, 5) Recommended first actions ordered by priority, 6) Links to similar past incidents if patterns match, 7) Stakeholder communication template. Be specific and actionable. Avoid speculation beyond what the data supports."

Docs Writer:

"You are an incident communication specialist. Given the triage assessment, produce: 1) Status page update (customer-facing, no internal details, under 100 words), 2) Slack incident channel kickoff (severity, systems, timeline, assigned responders), 3) Stakeholder email (business impact, ETA for update, current status). Tone: calm, factual, solution-oriented. Never speculate on root cause in customer-facing communications."

Workflow 2: Runbook Documentation from Tickets

The Problem

Every IT team knows they need better runbooks. But writing them is time-consuming, and runbooks written during calm periods often miss the edge cases that matter during incidents. The best time to write a runbook is immediately after resolving an issue, but engineers rarely have time for documentation in the aftermath.

The Agent Pipeline

Task input:

"Create a runbook from the following resolved incident: INC-2026-0347. Service: auth-service. Issue: Token validation failing after certificate rotation. Resolution: Updated JWKS cache configuration to respect max-age header. Steps taken: 1) Identified cert mismatch in logs, 2) Checked JWKS endpoint response, 3) Found stale cache entries, 4) Updated cache config, 5) Deployed fix via hotfix pipeline. Time to resolve: 45 minutes. Root cause: Cache config did not account for automated cert rotation schedule."

Phase 1: Triage Agent (GPT-4.1-mini, ~$0.02)

Extracts structured data from the incident:

  • Service and component mapping
  • Symptom catalog: token validation errors, 401 responses, JWKS mismatch
  • Resolution steps in chronological order
  • Root cause category: configuration drift
  • Dependencies: certificate authority, JWKS endpoint, auth-service
  • Frequency risk: high (automated cert rotation happens quarterly)

Phase 2: Docs Writer (Claude Sonnet 4, ~$0.10)

Produces a complete runbook:

  • Title: Auth Service -- Token Validation Failure After Certificate Rotation
  • Symptoms: How to recognize this issue (specific error messages, log patterns, monitoring alerts)
  • Prerequisites: Access requirements, tools needed, approval requirements
  • Verification steps: How to confirm this is the issue (3 checks)
  • Resolution steps: Numbered, copy-paste-ready commands with expected output at each step
  • Rollback plan: What to do if the fix does not work
  • Prevention: Recommended config changes to prevent recurrence
  • Related runbooks: Links to certificate management and cache configuration docs

Phase 3: Review Agent (GPT-4.1-mini, ~$0.01)

Verifies: all resolution steps are present, commands are syntactically reasonable, prerequisites are listed, rollback plan exists, prevention section addresses root cause.

Total cost: ~$0.13. Time: 3-4 minutes.

Prompt Template for Runbook Generation

Docs Writer (for runbooks):

"You are a senior SRE documenting a runbook. Given incident details, produce a runbook with these sections: 1) Title (service name -- specific issue), 2) Symptoms (how to recognize this issue: error messages, log patterns, alert names), 3) Prerequisites (access, tools, approvals needed), 4) Verification (2-3 checks to confirm this is the right issue), 5) Resolution Steps (numbered, with exact commands and expected output), 6) Rollback Plan (what to do if resolution fails), 7) Prevention (how to avoid recurrence), 8) Related Runbooks. Write for an engineer seeing this issue for the first time at 3 AM. Be explicit, not implicit."

Workflow 3: Post-Mortem Report Generation

The Problem

Post-mortems are the most valuable IT ops document -- and the most neglected. A thorough post-mortem takes 2-4 hours to write and requires input from multiple team members. Many teams skip them entirely or produce shallow summaries that miss systemic issues.

The Agent Pipeline

Task input:

"Generate a blameless post-mortem for INC-2026-0347. Incident data: [paste timeline, Slack logs, PagerDuty timeline, deployment records]. Participants: [names and roles]. Resolution: [summary]. Duration: 45 minutes. Impact: 12% of authentication requests failed for 23 minutes, approximately 8,400 users affected. Conduct interview summaries with the responding engineer: [paste notes]."

Phase 1: Triage Agent (GPT-4.1-mini, ~$0.03)

Structures the raw incident data:

  • Timeline reconstruction from logs, Slack, and PagerDuty data
  • Impact quantification: duration, affected users, revenue impact, SLA breach calculation
  • Contributing factors: direct cause, enabling factors, systemic issues
  • Action items extracted from resolution steps and team discussion

Phase 2: Docs Writer (Claude Sonnet 4, ~$0.10)

Produces a complete post-mortem:

  • Executive summary: One paragraph with incident, impact, resolution, and key takeaway
  • Timeline: Minute-by-minute with timestamps, actions, and owners
  • Root cause analysis: Using 5 Whys methodology
  • Contributing factors: Direct, enabling, and systemic
  • Impact assessment: Quantified business and customer impact
  • What went well: Practices that helped resolve the incident quickly
  • What could be improved: Process gaps and delays
  • Action items: Owner, deadline, priority, and category (fix/process/tooling)
  • Lessons learned: Broader insights for the organization

Phase 3: Review Agent (GPT-4.1-mini, ~$0.01)

Checks: blameless language throughout, all participants mentioned fairly, timeline is complete, action items have owners and deadlines, no speculation presented as fact.

Total cost: ~$0.14. Time: 4-5 minutes.

Prompt Template for Post-Mortems

Docs Writer (for post-mortems):

"You are writing a blameless post-mortem. Given incident data, timeline, and participant input, produce a post-mortem with: 1) Executive Summary (one paragraph), 2) Timeline (timestamps, actions, owners -- minute-by-minute), 3) Root Cause Analysis (5 Whys), 4) Contributing Factors (direct, enabling, systemic), 5) Impact Assessment (quantified: users, revenue, SLA), 6) What Went Well (specific practices), 7) What Could Be Improved (specific gaps), 8) Action Items (owner, deadline, priority, category), 9) Lessons Learned. CRITICAL: Use blameless language. Focus on systems and processes, never individuals. Present facts, not speculation."

Cost Analysis

Per-Task Cost

TaskTriageWriteReviewTotalTime
Incident triage + comms$0.03$0.08$0.01$0.1290 sec
Runbook generation$0.02$0.10$0.01$0.134 min
Post-mortem report$0.03$0.10$0.01$0.145 min
Change management doc$0.02$0.08$0.01$0.113 min
Incident handoff notes$0.02$0.04$0.01$0.072 min

Monthly Cost by Incident Volume

Incidents/MonthDocumentation CostCompare to Engineer Time
10 (small team)~$1.2020-40 hours
30 (mid team)~$3.6060-120 hours
60 (large team)~$7.20120-240 hours

At $75/hour average engineer cost, saving 20 hours/month of documentation time is worth $1,500. The agent squad costs $1-7/month.

Setup Guide

Step 1: Create Your Account

Go to ivern.ai/signup. Free to start, no credit card.

Step 2: Add API Keys

Add your OpenAI and Anthropic API keys in Settings. BYOK model -- you pay API prices directly with no markup. A $10 API budget covers approximately 70-80 IT ops documentation tasks.

Step 3: Create an IT Ops Squad

Create a squad named "IT Ops Docs." Add the three agents with the prompts above. Configure the pipeline: Triage to Writer to Review.

Step 4: Test with a Past Incident

Run a historical incident through the pipeline to validate output quality. Compare the agent-generated post-mortem against the one your team wrote manually. This gives you a baseline for prompt tuning.

Frequently Asked Questions

Can I connect this to PagerDuty or Opsgenie?

Yes. Use the incident data (alert payload, timeline, runbook links) as input to your triage task. You can paste the alert JSON directly into the task description. For automated pipelines, use the Ivern API to trigger tasks from webhook events.

How accurate is the severity classification?

The triage agent follows your SLA definitions and severity matrix. Include your specific P1-P4 criteria in the system prompt for best results. In testing, classification accuracy reaches 90%+ when the prompt includes explicit severity thresholds.

What about sensitive incident data?

Your incident data is processed through your own API keys and is not stored by Ivern beyond the task execution window. For organizations with strict data policies, use GPT-4.1-mini with data processing agreements in place, or configure agents to process only sanitized incident summaries.

Can I customize the post-mortem template?

Absolutely. The Writer agent follows whatever template structure you define in the prompt. If your organization uses a specific post-mortem format (e.g., Google's SRE template, GitLab's format), include it in the system prompt and the agent will adhere to it.

How does this work for multi-service incidents?

For complex incidents affecting multiple services, feed the combined alert data and timeline into the triage agent. The agent maps dependencies between affected services and produces a unified triage assessment. For very complex incidents, consider upgrading the Triage Agent to GPT-4.1 for deeper analysis.


Sign up at ivern.ai/signup to build your IT ops agent squad. Your first tasks are free -- enough to generate 15 runbooks, post-mortems, or incident triages at no cost.

AI Content Factory -- Free to Start

One prompt generates blog posts, social media, and emails. Free tier, BYOK, zero markup.