AI Agents for Platform Engineers: Deploy a Production-Grade Alert-Triage Agent with OpenAI Function-Calling, Slack Bolt, and PagerDuty
βBuild the agent that triages alerts while you sleep.β
Build and deploy a stateful single-agent workflow that classifies Kubernetes alerts, decides between auto-remediation and escalation, enforces human approval gates, and writes every decision to a DynamoDB audit log.
One-time Β· Lifetime access Β· Certificate included
- β6 modules of content
- β47 concept slides
- β18 practical exercises
- β24 quiz questions
- βCapstone project
- βLearnAspire certificate
Learning Outcomes
What you'll learn
The day after you finish
The day after completing this course, you will be able to open your deployed GitHub repository, point your Slack Bolt app at a live Meridian-style PagerDuty alert firing in your team's Slack channel, watch gpt-4o classify the alert severity and propose either a kubectl rollout restart or a PagerDuty incident, click Approve in the Slack thread to authorise the action, and then open DynamoDB in the AWS console to show your Tech Lead the timestamped audit record of every decision the agent made β including the raw LLM response β without having written a single line of new code that morning.
Who this is for
- Primary: Platform Engineers and SREs with 3β7 years experience who write Python, manage Kubernetes clusters, and spend 10+ hours per week manually triaging alerts in Slack
- Secondary: IT Ops Engineers who own incident response pipelines and want to automate first-response decisions without replacing PagerDuty or Slack
- Tertiary: Engineering Managers and Tech Leads who need to evaluate whether an AI agent proposal from their team is production-safe and audit-compliant
Prerequisites
- Can write and debug Python 3.x including async functions and decorators in context
- Has called at least one REST API from Python using requests or httpx β understands status codes, headers, and JSON payloads
- Has used Slack, PagerDuty, or GitHub Actions in a real on-call or deployment workflow
- Knows what a Kubernetes pod is and has run kubectl get pods or kubectl describe pod
- Has used ChatGPT or the OpenAI Playground β knows what a prompt is and what a completion looks like
Curriculum
6 modules Β· full breakdown
π€ Part of: AI Engineering Path
Capstone Project
Meridian Alert-Triage Agent: Deploy, Harden, and Document a Production-Ready Single-Agent Workflow
You will deploy the complete Meridian triage agent β a Python 3.12 application using OpenAI gpt-4o function-calling, Slack Bolt 1.18, PagerDuty Events API v2, LangGraph 0.1, and boto3 1.34 β to a GitHub Actions ubuntu-22.04 runner. The agent must receive a realistic CrashLoopBackOff alert payload via Slack Bolt, classify it using a structured system prompt with a validated service allowlist, present an interactive Approve/Escalate button in the Slack thread, execute kubectl rollout restart or POST to PagerDuty only after human approval, and write the full decision record to DynamoDB. You will also write a pytest 8.2 test suite with at least five passing tests covering the hallucination guard, approval gate bypass prevention, DynamoDB write schema, PagerDuty 429 retry logic, and Slack acknowledgement timeout handling.
What you'll deliver
A public GitHub repository containing: (1) working deployed agent code with all dependencies pinned in requirements.txt, (2) a pytest test suite with five passing tests and a CI badge, (3) an AWS IAM policy JSON granting least-privilege access to DynamoDB and the Kubernetes API, (4) an architecture decision record (ADR) in Markdown explaining the hallucination mitigation strategy and its known failure modes, and (5) a one-page plain-English compliance summary describing what the agent logs, what it cannot act on autonomously, and how the human approval gate satisfies the SOC 2 change-management control.