Blog • March 2, 2026 • 26 minute read

The Journey of Cortex XSOAR Playbook: Designing Your First Playbook

Transform your manual firewall upgrade process into a streamlined, automated workflow with Cortex XSOAR. This article guides you through designing a robust playbook for upgrading High Availability firewall pairs, emphasizing architectural decisions, governance integration and validation patterns. Discover how automation reduces errors, enhances efficiency and satisfies compliance requirements.

In this blog

Introduction

In the first installment of this series, we explored the theory and conceptual foundation of Cortex XSOAR playbook development. I talked about how every operation in XSOAR revolves around three core components, Incidents, Indicators, and Playbooks, and why the real mastery of automation begins not with the playbook editor, but with a shift in how you think about operational workflows. This is the shift that separates engineers who build scripts for one-off use cases from engineers who build complete workflows

Now it's time to put those principles into practice.

In this post, I'll take my favorite network operations task, upgrading a High Availability (HA) pair of Palo Alto Networks Next-Generation Firewalls, the task every network engineer loves to dread, and walk through the design process of turning that manual runbook into an automated XSOAR playbook. We're not going to jump straight into the playbook editor. Instead, we'll focus on the architectural decisions that separate a fragile, one-off script from a scalable, production-grade automation. I've learned these lessons the hard way, watching deployments in the field, and I'll share those insights here.

By the end of this post, you'll have a clear blueprint for your playbook: one that accounts for HA state awareness, pre-flight and post-flight validation, and the sub-playbook patterns that make your automation maintainable over time. You'll learn the patterns I use when building production playbooks for customers.

Why NGFW HA upgrades? (because they're harder than they look)

If you've spent any time in network operations, you know that firewall upgrades are one of those tasks that are simultaneously routine and terrifying. The process itself is well-documented: wake up at 2:00 AM, download the image, update the content package, check HA state, upgrade the passive node, validate, failover, upgrade the other node, validate again, and then hope no one calls you on your weekend about an outage. But the execution is where things get interesting. I've done enough of these by hand to know all the ways they can go sideways.

Consider what a typical HA NGFW upgrade looks like when done manually:

Pre-flight checks: Verify the HA pair is healthy. Confirm which node is active and which is passive. Capture a baseline snapshot of the device's operational state: routing tables, BGP peers, interface status and session counts.
Software staging: Download the target PAN-OS image to the device (and any intermediary versions if you're jumping across major releases).
Passive node upgrade: Install the software on the passive node. Reboot it. Wait for it to come back online. Verify it rejoined the HA cluster.
Failover: Trigger a controlled failover to make the freshly upgraded node active. Confirm traffic is flowing.
Second node upgrade: Repeat the install and reboot on the now-passive original active node.
Post-flight validation: Compare the operational state against the pre-flight baseline. Are all BGP peers up? Are session counts within expected ranges? Did anything break?
HA topology restoration: Optionally fail back to the original active/passive arrangement if your environment requires a specific topology.

Each of those steps has decision points, error conditions, and dependencies on the previous step. A network engineer executing this manually might spend 45 minutes to an hour per HA pair, and that is when everything goes smoothly. Multiply that across a fleet of firewalls during a maintenance window, and you start to understand why this is a perfect candidate for XSOAR automation.

Who benefits from automation and why

SOC Analyst / Network Engineer: You're the one doing these upgrades at 2 AM. This playbook takes the repetitive, error-prone steps off your plate so you can focus on the exceptions that need human judgment. Instead of manually comparing CLI outputs across two devices, you get structured validation that tells you exactly what changed.
SOC Manager / Team Lead: Consistency becomes automatic. Every upgrade follows the same process, produces the same evidence, and catches the same failure modes regardless of which engineer runs it. That means fewer missed steps, fewer rollbacks, and a team that can handle more upgrades in the same maintenance window.
CISO / CTO: Automated upgrades with built-in assurance testing reduce the risk of human error during maintenance windows, accelerate your patching cadence to close vulnerability gaps faster, and produce machine-generated evidence that satisfies your change management requirements without additional manual effort.

Thinking in XSOAR: Mapping the manual process

Before we open the playbook editor, let's map the manual upgrade workflow to XSOAR's native constructs. In my experience, this design phase makes the difference between a playbook that works and a playbook that works reliably. This is the shift in perspective we discussed in Theory and Concepts, and it's the most important step in the entire design process.

The incident: Your upgrade request

In XSOAR, an Incident is the trigger: the reason an investigation or workflow begins. For our upgrade scenario, the incident represents a specific upgrade request: "Upgrade firewall pair FW-DC1-A/FW-DC1-B to PAN-OS 12.1.3-h3."

This incident carries all the contextual data your playbook needs to execute:

Target device: The firewall hostname or serial number
Peer device: The HA partner (if any)
Target version: The desired PAN-OS version
Panorama instance: Which Panorama server manages these firewalls

Notice how these map directly to the playbook inputs we'll define later. The incident isn't just a ticket. It's the data contract between the operator and the automation.

The indicators: your device inventory

Indicators in XSOAR aren't just for threat intelligence. They're any enrichable, trackable entities in your environment? In a network operations context, each firewall becomes an indicator: a "Network Device" type with properties like hostname, serial number, current PAN-OS version, HA state, and management IP.

This is a subtle but powerful concept. By modeling your firewalls as indicators, you get:

Automatic enrichment: XSOAR can pull the device's current state from Panorama whenever the indicator is referenced.
Relationship mapping: Link an indicator to its HA peer, its Panorama server, or any CVEs affecting its current software version.
Historical tracking: Every upgrade, every state change, every assurance test is recorded against the indicator's timeline.

The playbook: Your automated runbook

The Playbook is where the logic lives. But here's the key insight from theory and concepts: a good playbook isn't a single monolithic script. It is an orchestration layer that calls reusable sub-playbooks, each responsible for one well-defined task.

For our HA upgrade, the top-level playbook orchestrates the overall flow while delegating the actual work to specialized sub-playbooks. This is the architecture I use in production, and it's the pattern I'd recommend for anything you build to run at scale:

Sub-Playbook	Responsibility	DRY Principle
Get HA Pair Status	Query both devices, determine active vs. passive	Reusable any time HA state matters
Take Snapshot	Capture operational state (routes, peers, sessions) as JSON baseline	Same logic for pre-flight and post-flight
Download Software	Stage the PAN-OS image on the target device	Works for any version, any device
Install Software	Execute the install and reboot sequence	Handles single or HA scenarios
Upgrade Assurance	Run pre/post snapshot comparison and validation tests	Reusable across upgrade types
Post-Upgrade Validation	Compare snapshots and surface differences for review	Decoupled from upgrade itself

This table should look familiar if you recall the DRY (Don't Repeat Yourself) principle from Theory and Concepts. Each sub-playbook is a reusable building block. The "Take Snapshot" sub-playbook, for example, doesn't care whether it is being called as a pre-flight check or a post-flight validation. It simply captures the device's operational state. I've seen teams ignore this principle and regret it when they need to maintain parallel versions of the same logic. The context of when and why it runs is determined by the parent playbook.

The high-level architecture

Let's sketch the overall flow of our HA upgrade playbook. Think of this as the architectural blueprint: the "10,000-foot view" before we start building individual components. When I design a production playbook, I always start here, with the full workflow mapped out, before touching the editor. It's easier to fix and redraw the flow of tasks to confirm your theory in Visio, Lucid, or a conference room whiteboard before moving into the editor.

Phase 1: Initialization

The playbook begins by establishing context. It takes the incident's input fields (target device, target version, Panorama instance) and uses them to set up the execution environment. This includes:

Associating the device indicator to the incident (so the incident's War Room has full device context)
Setting the Panorama integration instance (determining which Panorama server to communicate through)
Linking to a parent incident if this upgrade is part of a larger batch operation

This initialization phase is deceptively important. In a production environment, you might be running dozens of upgrades simultaneously as part of a maintenance window. Each upgrade incident needs to know exactly which Panorama instance to talk to and whether it is part of a coordinated batch.

Phase 2: Software selection and path calculation

PAN-OS upgrades are a little easier with the new Skip Upgrade Feature, but aren't always a straight line from version A to version B. If you're moving between major releases (say, 9.1 to 12.1), you may need to install intermediary versions along the way. The playbook calls the Filter and Select Available Software Images sub-playbook, which:

Queries the device for available software images
Filters them based on the target version and the current version
Calculates the required upgrade path (including any intermediate versions)
Sets the upgrade path as an incident field for downstream consumption

If no valid upgrade path can be determined (perhaps the target version is lower than the current version, or the images are not available) the playbook surfaces an error and stops.

Fail early, fail clearly.

I've seen too many midnight texts because a playbook silently skipped validation steps.

Phase 3: The upgrade loop

Here's where the design gets interesting, and where I see the biggest mistakes. Rather than hard-coding a single upgrade step, the playbook uses a Device Upgrade Loop sub-playbook that can be called iteratively. If the upgrade path requires multiple hops (e.g., 9.1.4 → 10.1.0 → 11.2.0 → 12.1.3-h3), the loop handles each step in sequence.

The looping pattern is a powerful XSOAR technique. By encapsulating a single upgrade step as a sub-playbook and looping over the upgrade path, you get a design that handles both simple (single hop) and complex (multi-hop) upgrades with the same code. This is DRY in action: one sub-playbook, many execution paths. When I built this pattern for the first time, it cut down on testing and maintenance dramatically.

Within each iteration of the loop, the sub-playbook:

Determines HA state (which device is currently active/passive)
Takes a pre-flight snapshot of both devices
Downloads the software image to the target device
Installs the software on the passive node first
Waits for the passive node to rejoin the cluster
Triggers a controlled failover
Installs the software on the now-passive (previously active) node
Takes a post-flight snapshot and runs comparison tests
Optionally, restores the original HA topology

The critical design decision here is HA awareness. The loop doesn't just blindly upgrade. It understands the relationship between the two devices and ensures that traffic always has a path. This is the difference between automation that works in a lab and automation you trust in production.

The pre-flight and post-flight pattern

If there's one pattern that separates amateur automation from production-grade workflows, it's the pre-flight/post-flight validation pattern. The concept is straightforward: capture the state of the world before you make a change, make the change, then capture the state again and compare.

What we capture

The operational snapshot is a JSON document that records the device's current state across multiple dimensions:

Routing table: All active routes, including BGP learned routes
BGP Peer Status: Peer state, prefix counts, uptime for each BGP neighbor
Interface Status: Link state, IP assignments, and error counter
Session Counts: Active session table statistics
HA State: Current HA role, peer connectivity, synchronization status
System Resource Utilization: CPU, memory, and disk usage

This snapshot becomes the baseline. After the upgrade, a second snapshot is taken using the exact same sub-playbook (there's that DRY principle again), and an assurance test compares the two.

The assurance test

The comparison isn't just a diff. The assurance sub-playbook runs structured tests:

Are all BGP peers that were up before the upgrade still up? If a peer dropped, that is flagged immediately.
Are route counts within the expected variance? A small change might be normal; a large drop is a red flag.
Are all the interfaces that were up still up? A link going down post-upgrade could indicate a compatibility issue.
LACP Status? If LACP is configured, are both links operational?
Is the HA cluster healthy? Both nodes should be communicating and synchronized.
User-ID Agent Status? If there was an agent configured, it is connected so the user section of the security policy functions

If any of these tests fail, the playbook can pause for human review, generate an alert, or (in more mature implementations) trigger automated rollback procedures. The key is that the operator doesn't have to manually compare CLI outputs; the automation surfaces exactly what changed and whether that change is expected. This is where I've seen automation really earn its place in production operations.

Capturing the health checks in our XSOAR Incident.

Beyond the firewall: End-to-end application validation

The firewall-level checks described above answer a critical question: Is the device healthy? But they don't answer the question that business stakeholders actually care about, the one I always hear at 2:00 AM on maintenance night: are the applications still working?

A production-grade upgrade playbook should include an external application validation layer that tests actual traffic flows through the firewall. This means designing a sub-playbook that can execute synthetic checks (HTTP requests to critical application endpoints, DNS resolution tests, or API calls to monitoring platforms) and capture the results as structured data alongside the firewall snapshots.

Consider what this looks like in practice. Before the upgrade, the playbook calls an "Application Health Check" sub-playbook that iterates over a list of critical application endpoints defined in the incident fields. For each endpoint, it records the HTTP response code, latency, and any TLS certificate details. After the upgrade, the same sub-playbook runs again, and the assurance test compares both sets of results. If an application that returned HTTP 200 in 45 milliseconds pre-upgrade is now returning 503 or timing out, that's an immediate red flag, even if every BGP peer and interface on the firewall looks healthy. I learned to build this layer after a "successful" firewall upgrade that broke a critical application for half an hour due to a functional change in how TLS Decryption worked.

This two-layer validation approach (firewall state plus application health) gives you a complete picture. It also produces the kind of evidence that operational stakeholders and change review boards need to see, which we'll discuss next.

Navigating CAB and CURB: Designing for governance

If you've worked in enterprise environments, you know that the technical upgrade process is only half the battle. The other half is organizational: getting the change approved through your Change Advisory Board (CAB) and, in many organizations, satisfying a Change and Upgrade Review Board (CURB) that governs infrastructure modifications.

These governance bodies exist for good reason: they prevent uncoordinated changes from causing outages. But they also introduce friction that automation must account for, not ignore. In my implementation work, I've learned that customers don't just want faster upgrades; they want upgrades that their governance teams trust. A well-designed XSOAR playbook doesn't just execute the technical steps; it produces the artifacts that satisfy governance requirements and builds an auditable evidence trail in the War Room.

The CAB challenge

CAB reviewers typically want to see three things before approving a firewall upgrade: a risk assessment that identifies what could go wrong, a rollback plan that explains how you'll recover if it does, and evidence that the change was tested or validated in a controlled manner. The challenge is that most teams produce these artifacts manually (Word documents, spreadsheets, and screenshots pasted into change tickets) which is time-consuming and inconsistent across engineers. I've watched customers spend days assembling change packages that a well-designed playbook should produce in seconds.

XSOAR changes this dynamic. When your upgrade playbook captures pre-flight snapshots, runs assurance tests, and logs every action to the War Room, you are automatically generating the evidence that CAB needs. The pre-flight snapshot becomes your baseline risk assessment. The assurance test results become your validation evidence. The War Room timeline becomes your audit trail. Instead of an engineer spending an hour assembling a change ticket, the playbook produces machine-generated, timestamped proof of every step. That's when governance stops being friction and starts being a feature of your playbook design.

The CURB challenge

CURB boards often layer additional requirements on top of CAB, particularly around version governance and fleet consistency. Common CURB concerns include ensuring that target PAN-OS versions have been vetted against known CVEs, that the upgrade path follows vendor-supported sequences, and that the organization maintains a consistent software baseline across device groups. These are exactly the kinds of checks that belong in the playbook's initialization phase: the software path calculation we discussed earlier can be extended to validate against an approved version matrix stored as an XSOAR list or integration lookup. Every time I've seen a playbook fail CURB review, it's because these checks weren't built in from the start.

From a CISO or CTO perspective, this is where automation pays for itself in ways that don't show up on a simple time-savings spreadsheet. Consistent, machine-generated change evidence means your team spends less time preparing for audits. Automated version governance means you can verify that every device in your fleet is running an approved software version without relying on manual inventory checks. And the structured assurance data gives you confidence that upgrades aren't introducing risk, which means you can approve a faster patching cadence and close vulnerability windows sooner.

Designing the playbook for governance

When you design your upgrade playbook with governance in mind from the start, several architectural decisions follow naturally. First, every sub-playbook should write structured results to the War Room: not just success or failure, but the actual data: snapshot JSON files, assurance comparison tables, and application health check results. These War Room entries become downloadable artifacts that can be attached to change tickets in ServiceNow, Jira, or whatever ITSM platform your organization uses. I make this a requirement in every production playbook I design.

Second, the manual confirmation gates we'll build in Parts 3 and 4 serve double duty: they're both operational safety controls and governance checkpoints. When an operator approves a failover, that approval is recorded with a timestamp and username in the incident context. That's audit evidence that a human reviewed and authorized the change at each critical stage. This dual purpose means you're not adding governance overhead; you're baking it into the operational workflow.

Third, consider building a "Change Evidence Package" sub-playbook that runs at the end of the upgrade workflow. This sub-playbook collects all War Room artifacts (pre-flight snapshots, post-flight comparisons, application validation results, operator approvals) and assembles them into a single summary that can be pushed to your ITSM integration. This turns the XSOAR incident into a self-documenting change record. In my field experience, this is where playbooks go from being useful automation to being transformational tools that change how operations teams work.

Designing for separation of concerns

Before we wrap up the design phase, let's talk about a principle that will save you significant headaches as your automation scales: separation of concerns.

In our HA upgrade playbook, notice how the responsibilities are cleanly divided. When I review customer playbooks, I see the problems that arise when you blur these lines:

The top-level playbook knows what needs to happen and in what order. It doesn't know how to query HA state or take a snapshot.
The sub-playbooks know how to do their specific task. The "Get HA Pair Status" sub-playbook knows how to call pan-os-platform-get-ha-state and interpret the results, but it doesn't care why it was called.
The integrations (Panorama, PAN-OS Device Management) handle the API-level communication. The playbooks never make raw API calls. They use integration commands.

This three-layer architecture (orchestration, task execution, and integration) mirrors how well-designed software systems work. It means you can update the HA detection logic without touching the upgrade flow, or swap out the integration for a different management platform without rewriting your sub-playbooks. That separation of concerns is what separates a playbook you can hand off from a playbook you'll be debugging three years from now.

The time savings are real

Let's talk about the economics of this automation. A manual HA pair upgrade takes 45 to 60 minutes of focused engineer time per pair, assuming nothing goes wrong. The engineer is constantly comparing CLI outputs, watching for failures, and making sure each step completes before moving to the next. It's not intellectually demanding, but it's attention-intensive and error-prone when you're running it at 2:00 AM on your fourth upgrade of the night.

With XSOAR automation, the engineer kicks off the playbook and monitors progress. Hands-on time drops to roughly 10 to 15 minutes of oversight per pair. For a fleet of 20 HA pairs during a quarterly maintenance window, that is the difference between 20+ hours of manual work and 3 to 4 hours of supervised automation. That's time your engineers can spend on actual problem-solving instead of button-clicking.

But the real savings compound when you factor in the documentation. Manual upgrades require engineers to capture screenshots, paste CLI output into change tickets, and write up post-change summaries. The playbook generates all of this automatically as a byproduct of execution. The War Room snapshots, assurance test results, and approval timeline are all captured and timestamped without any additional effort. That's another 2 to 3 hours per maintenance window that you don't have to spend assembling evidence for your change management system.

For organizations that struggle to keep their firewall fleet on supported PAN-OS versions because upgrades are too labor-intensive, this automation changes the math entirely. When you can upgrade a fleet of 20 firewalls in a single maintenance window instead of requiring two or three, your security posture improves. You close vulnerability windows faster. Your compliance team stops asking when you're going to patch that critical CVE. And your network engineering team stops viewing upgrade windows as a grueling all-nighter and starts viewing them as routine operational tasks.

Panorama version compatibility: A hard gate you can't ignore

One of the most dangerous gaps in a naive upgrade playbook is the assumption that any PAN-OS target version is fair game as long as the image is available. It isn't. Panorama must always run a software version equal to or greater than the managed firewalls it supervises. Upgrading a managed firewall to a version newer than Panorama is an unsupported configuration that can break push operations, template synchronization, and log collection…sometimes silently.

This check belongs in the playbook's initialization phase, before any software path calculation begins. The playbook should query the Panorama version via the integration, compare it against the requested target version, and treat a version mismatch as a hard stop. Not a warning. If Panorama is running 11.0 and the operator requests an upgrade to 12.1, the playbook should surface a clear, actionable error: "Panorama must be upgraded to 112.1 before managed device upgrades can proceed." I've seen this exact scenario cause hours of confusion during customer maintenance windows because the relationship between Panorama and device software versions wasn't validated upfront.

For organizations managing multiple Panorama instances across regions, this check becomes a cross-instance validation. The playbook should verify which Panorama instance manages the target device and query that specific instance's version. An XSOAR list or integration lookup can maintain a compatibility matrix that maps acceptable device version ranges per Panorama version, giving you a single source of truth that the playbook consults at runtime rather than relying on tribal knowledge in an engineer's head.

Maintenance window scheduling and change freeze awareness

A technically perfect upgrade playbook can still cause an outage if it runs outside an approved maintenance window or during a change freeze. Enterprise organizations maintain formal schedules for when infrastructure changes are permitted, and those schedules exist precisely to protect business-critical operations during peak traffic periods, quarter-end processing, or regulatory reporting cycles. Automation that ignores these schedules is automation that loses its operator's trust the moment it fires at the wrong time.

The initialization phase is the right place to enforce this. The playbook should check a change freeze flag (stored as an XSOAR list entry or queried from your ITSM integration) before proceeding. If a freeze is active, the playbook halts immediately and notifies the operator with a clear reason. Similarly, if your organization enforces specific maintenance windows (say, Saturdays between 00:00 and 06:00 local time), the playbook can validate the current timestamp against the allowed window and refuse to proceed outside it.

For organizations using ServiceNow or Jira Service Management, the approved maintenance window is often already encoded in the change request itself. The XSOAR integration can read the approved start and end times directly from the change ticket and use those as the execution boundary. This closes the loop between your change management process and your automation: the playbook only runs when the change record says it should, and that constraint is enforced programmatically rather than relying on operators to remember the schedule at 1:00 AM.

There's a related concern worth building into your design: upgrade duration estimation. For a fleet upgrade, the parent playbook can calculate estimated completion time based on historical upgrade durations stored against the device indicator timeline. If the estimate exceeds the approved window, the playbook can alert before starting rather than discovering mid-fleet that the window will close before all devices are complete. That early warning gives operators the chance to scope the batch appropriately rather than leaving half the fleet in a mid-upgrade state when the window closes.

Lab validation: the gate before production

CURB boards at mature organizations don't just want to know that an upgrade was completed successfully in production. They want to know that the target version was validated in a controlled environment first. This is standard practice in software development; no responsible team ships code directly from a developer's laptop to production, and it should be standard practice in network operations as well.

Design a lab validation gate into the initialization phase. Before a production upgrade incident can be created (or before the upgrade loop begins), the playbook checks an XSOAR list or custom field for a lab-validated flag against the target PAN-OS version. If the version hasn't been marked as lab-validated, the playbook halts and directs the operator to complete lab testing first. The lab validation record itself can be a separate XSOAR incident type, a lightweight workflow that runs the same upgrade playbook against lab devices, records the assurance results, and upon successful completion sets the version's validation status in the shared list.

This pattern serves multiple purposes. It enforces a disciplined promotion process from lab to production. It creates an auditable record that the target version was tested before the production change window. And it gives your CURB board exactly what they're asking for: documented evidence that the change was validated in a controlled environment before touching production infrastructure. When your playbook can point to a timestamped lab validation incident as part of the production change evidence package, the governance review becomes a formality rather than a gatekeeping exercise.

SLA Metrics: Measuring what actually matters

Time savings and risk reduction are compelling arguments for automation, but they're difficult to defend in a budget conversation without data to back them up. This is where SLA metrics become essential, not as an afterthought, but as a deliberate design element of the upgrade workflow. When you instrument your playbook correctly, every upgrade incident becomes a data point that builds a measurable operational track record over time.

The metrics worth capturing fall into three categories. The first is upgrade duration: total elapsed time from incident creation to completion, time spent in each phase (initialization, software staging, passive upgrade, failover, second upgrade, assurance), and how the actual duration compares to the approved maintenance window. These numbers let you refine window-sizing estimates and identify which phase is consuming the most time in your environment, providing insight that directs optimization effort to where it matters.

The second category is assurance outcomes: what percentage of upgrades pass the post-flight assurance test on the first attempt, how often the playbook pauses for human review versus completing without intervention, and which specific assurance checks fail most frequently. If BGP peer count variance failures show up disproportionately in a particular site or device group, that's a signal worth investigating long before it becomes a production incident.

The third category is fleet hygiene: the percentage of managed devices running an approved PAN-OS version at any given time, mean time from CVE publication to completed remediation across the fleet, and the distribution of software versions across device groups. This is the data your CISO needs to answer the board's question about patch currency, and it should come from your XSOAR dashboards automatically rather than from a manual audit that happens once a quarter.

From a design perspective, capturing these metrics is straightforward if you plan for it from the start. Set incident fields for phase start and end timestamps at each transition in the upgrade loop. Write assurance results as structured data rather than free text. Store upgrade outcomes against the device indicator so the timeline accumulates historically. The fleet dashboard we'll discuss in a later blog, aggregates these incident-level metrics into organizational-level views but that dashboard is only as useful as the data the playbook feeds it.

Design the measurement now, and the reporting comes for free later.

Conclusion

At this point, we've got a complete architectural blueprint for our HA upgrade playbook. We know the XSOAR constructs we'll use (incidents for upgrade requests, indicators for device inventory), the sub-playbook decomposition (HA state, snapshots, software staging, install, assurance), and the key design patterns (the upgrade loop, pre-flight/post-flight validation, separation of concerns). This is the foundation every production playbook should rest on.

In our next part, we'll roll up our sleeves and start building. We'll walk through constructing the HA detection sub-playbook step by step, wire up the snapshot logic that powers our pre-flight and post-flight checks, and see how XSOAR's context data and conditional branching bring our design to life in the playbook editor. This is where the design becomes real, and where you'll see why planning ahead saves you hours of rework.