AI & Machine LearningπŸ’» Technical CourseLearnAspire Certified

AI Agents for Platform Engineers: Deploy a Production-Grade Alert-Triage Agent with OpenAI Function-Calling, Slack Bolt, and PagerDuty

β€œBuild the agent that triages alerts while you sleep.”

Build and deploy a stateful single-agent workflow that classifies Kubernetes alerts, decides between auto-remediation and escalation, enforces human approval gates, and writes every decision to a DynamoDB audit log.

Intermediate11h6 modules47 slides18 exercises24 quiz Qsβœ“ Verified Mar 2026
πŸ”₯ Launch Price β€” 63% off. Limited time.
β‚Ή2,999β‚Ή7,999

One-time Β· Lifetime access Β· Certificate included

Sign in to Enroll
7-day money-back guarantee
  • βœ“6 modules of content
  • βœ“47 concept slides
  • βœ“18 practical exercises
  • βœ“24 quiz questions
  • βœ“Capstone project
  • βœ“LearnAspire certificate

Learning Outcomes

What you'll learn

β†’You will be able to write a structured system prompt for an infrastructure-triage agent that constrains gpt-4o to classify alert severity and choose between exactly two actions β€” kubectl rollout restart or PagerDuty incident creation β€” without hallucinating service names outside a validated allowlist.
β†’You will be able to define and register OpenAI function-calling tool schemas in Python that expose kubectl and PagerDuty operations to gpt-4o, validate every tool call argument against a Kubernetes namespace service registry before execution, and raise a ToolArgumentValidationError with the exact rejected payload when validation fails.
β†’You will be able to implement a stateful LangGraph agent loop that receives a Slack Bolt event, runs a multi-step reason-then-act cycle, persists intermediate state to a Python dict checkpoint, and halts the loop pending a Slack interactive button approval before any destructive tool call executes.
β†’You will be able to diagnose the three most common production failure modes β€” gpt-4o returning a hallucinated service name, a PagerDuty Events API v2 429 rate-limit, and a Slack Bolt acknowledgement timeout β€” identify the exact line of Python that raises each error, and apply a mitigation pattern for each without disabling the audit trail.
β†’You will be able to write every agent decision β€” including raw gpt-4o response JSON, tool call arguments, human approval outcome, and final action result β€” to a DynamoDB table using boto3 1.34 with a schema that satisfies a SOC 2 Type II auditor's requirement for tamper-evident, timestamped decision records.
β†’You will be able to deploy your complete triage agent to a GitHub Actions ubuntu-22.04 runner, pass a pytest 8.2 test suite that covers the approval gate, the hallucination guard, and the DynamoDB write, and produce an architecture decision record plus a one-page compliance summary explaining exactly what the agent cannot decide autonomously and why.

The day after you finish

The day after completing this course, you will be able to open your deployed GitHub repository, point your Slack Bolt app at a live Meridian-style PagerDuty alert firing in your team's Slack channel, watch gpt-4o classify the alert severity and propose either a kubectl rollout restart or a PagerDuty incident, click Approve in the Slack thread to authorise the action, and then open DynamoDB in the AWS console to show your Tech Lead the timestamped audit record of every decision the agent made β€” including the raw LLM response β€” without having written a single line of new code that morning.

Who this is for

  • Primary: Platform Engineers and SREs with 3–7 years experience who write Python, manage Kubernetes clusters, and spend 10+ hours per week manually triaging alerts in Slack
  • Secondary: IT Ops Engineers who own incident response pipelines and want to automate first-response decisions without replacing PagerDuty or Slack
  • Tertiary: Engineering Managers and Tech Leads who need to evaluate whether an AI agent proposal from their team is production-safe and audit-compliant

Prerequisites

  • Can write and debug Python 3.x including async functions and decorators in context
  • Has called at least one REST API from Python using requests or httpx β€” understands status codes, headers, and JSON payloads
  • Has used Slack, PagerDuty, or GitHub Actions in a real on-call or deployment workflow
  • Knows what a Kubernetes pod is and has run kubectl get pods or kubectl describe pod
  • Has used ChatGPT or the OpenAI Playground β€” knows what a prompt is and what a completion looks like

Curriculum

6 modules Β· full breakdown

πŸ€– Part of: AI Engineering Path

Step 1 β€” Foundations
β†’Step 2 β€” Core Skills
β†’Step 3 β€” RAG
β†’Step 4 β€” LangGraph RAG
β†’Step 5 β€” Agent Systems
β†’Step 6 β€” Production
β†’Step 7 β€” MCP
β†’Step 8 β€” Enterprise
← Previous: Step 1 β€” FoundationsNext in path: Step 3 β€” RAG β†’
πŸ†

Capstone Project

Meridian Alert-Triage Agent: Deploy, Harden, and Document a Production-Ready Single-Agent Workflow

You will deploy the complete Meridian triage agent β€” a Python 3.12 application using OpenAI gpt-4o function-calling, Slack Bolt 1.18, PagerDuty Events API v2, LangGraph 0.1, and boto3 1.34 β€” to a GitHub Actions ubuntu-22.04 runner. The agent must receive a realistic CrashLoopBackOff alert payload via Slack Bolt, classify it using a structured system prompt with a validated service allowlist, present an interactive Approve/Escalate button in the Slack thread, execute kubectl rollout restart or POST to PagerDuty only after human approval, and write the full decision record to DynamoDB. You will also write a pytest 8.2 test suite with at least five passing tests covering the hallucination guard, approval gate bypass prevention, DynamoDB write schema, PagerDuty 429 retry logic, and Slack acknowledgement timeout handling.

What you'll deliver

A public GitHub repository containing: (1) working deployed agent code with all dependencies pinned in requirements.txt, (2) a pytest test suite with five passing tests and a CI badge, (3) an AWS IAM policy JSON granting least-privilege access to DynamoDB and the Kubernetes API, (4) an architecture decision record (ADR) in Markdown explaining the hallucination mitigation strategy and its known failure modes, and (5) a one-page plain-English compliance summary describing what the agent logs, what it cannot act on autonomously, and how the human approval gate satisfies the SOC 2 change-management control.