The AI-Native Engineer Framework: Three Questions, Nine Scenarios
PREVIOUSLY IN THIS SERIES
Part 1 framed AI-native engineering as a role shift, orchestrating production, not writing line by line. Part 2 showed how the six leading platforms have converged on capability; differentiation now comes from how teams extend them, not which one they pick. Part 3 made the case that extensibility (skills, rules, MCP and plugins) is where the productivity multiplier actually lives. Part 4 turns from the strategic to the tactical: when you have a task in front of you, how do you decide which tool to reach for?
Before any specific scenario, the right tool falls out of three questions about the task at hand. Once these become reflexive, most tool decisions take seconds, not meetings.
The three questions of tool selection
- Scope. Is this a single line, a single file, multiple files in one project or multiple repos? Modern tools span more of this range than they used to, but each still has a sweet spot. A single-line fix doesn't need Devin; a cross-repo migration doesn't belong in inline completion.
- Reasoning depth. Does the task need pattern matching, synthesis or architecture? Pattern matching is autocompleting boilerplate. Synthesis is multi-file refactoring. Architecture is designing a new system or making a hard trade-off. Tools handle the first two well; humans still own the third.
- Autonomy level. Are you supervising every keystroke, delegating with checkpoints or running unattended? Autonomy and trust go together. New problems and high-stakes systems demand supervision. Well-scoped, reversible work can run autonomously.
The convergence of tools is most visible in the highest-autonomy work. Eighteen months ago, "multi-repo, autonomous" meant Devin. Today, Cursor Cloud Agents run in isolated VMs, Copilot's Coding Agent runs in GitHub Actions, Claude Code Dynamic Workflows orchestrate hundreds of parallel subagents, Google Jules opens PRs from a GitHub issue, and Devin Desktop runs the same Devin agent through its Agent Command Center. The choice is now about ecosystem fit, integration posture, cost predictability and how the work integrates with your existing flow - not about which tool can do it.
The scenario playbook
The below nine scenarios cover most of what an AI-native engineer does in a given week. The table below is the fast lookup; the cards beneath it give the full tool list, the three-question read and the failure mode to watch for in each case:
| Scenario | Reach for | Why |
| Familiar-code feature | Any IDE: Cursor, Copilot, Devin Desktop, Codex, Antigravity | Inline completion is commodity; ergonomics and rules decide |
| Multi-file refactor | Your IDE's multi-file mode | Coordinated multi-file edits are table-stakes across all of them |
| Multi-repo migration | Devin Desktop, Claude Code Dynamic Workflows, Cursor Cloud Agents, Copilot Coding Agent, Jules | Well-scoped, mechanical, parallelizable; delegate it |
| Unfamiliar code or domain | Claude Code, Codex CLI, Antigravity CLI; Codemaps / Code Wiki for navigation | On-demand expertise; visual maps for fast orientation |
| Routine maintenance | Claude Code /loop, Devin Desktop, Cursor Cloud Agents, Copilot Coding Agent, Jules | Declarative, scheduled, compounding ops win |
| Test creation | Cursor, Copilot or Devin Desktop in your daily flow | Closes the dev-to-test gap; unlocks weekly cadence |
| Heavy synthesis | Claude Code Dynamic Workflows, Codex CLI Goal Mode, Antigravity CLI | Long-context reasoning over a large corpus |
| Production debugging | Claude Code read-only + observability via MCP | High-skepticism reasoning grounded in real signals |
| Greenfield architecture | Your judgment; AI as thinking partner | AI is weakest here; the decision stays human |
Writing a new feature in familiar code SCOPE Single line / file REASONING Pattern matching AUTONOMY Supervised Reach for: Whatever IDE you already use: Cursor, GitHub Copilot, Devin Desktop, Codex IDE or Antigravity. Why: You know the codebase. Inline assistance is the bread-and-butter case here. The five leading IDE-embedded options now ship comparable completion quality at this level, and the differences are marginal. Pick the one whose ergonomics fit your team and invest your energy in rules and conventions that make it specifically good for your codebase. Watch out for: Don't let inline acceptance rate become the metric; it measures typing saved, not cycle time. The leverage is in the rules that make completions match your conventions, not in the raw completion feature. |
Refactoring across multiple files in a single project SCOPE Single project REASONING Synthesis AUTONOMY Supervised Reach for: Whatever IDE you already use, in its multi-file mode: Cursor Composer 2.5, Copilot Edits, Devin Desktop with Devin Local, Claude Code or Antigravity's agent-first IDE. Why: This used to require switching tools. It doesn't anymore. All of them handle coordinated multi-file edits well. Cursor and Devin Desktop give you a visual diff-and-approve workflow inside the editor. Claude Code gives you a delegation model where you describe the change and review the result. Antigravity defaults to agents planning and executing with checkpoints. Choose based on how you want to work, not on capability gaps that no longer exist. Watch out for: Review the diff as a whole, not file by file. A multi-file agent can apply the wrong abstraction consistently across every file it touches, which makes the mistake harder to spot in any single diff. |
Migrating frameworks or upgrading dependencies across many repos SCOPE Multi-repo REASONING Synthesis AUTONOMY Autonomous, delegated Reach for: Devin Desktop with cloud handoff, Claude Code with Dynamic Workflows and a migration skill, Cursor Cloud Agents, Copilot's Coding Agent at scale, Google Jules for GitHub-native shops or OpenHands for self-hosted execution. Why: This was the canonical Devin scenario. It's no longer Devin's exclusive territory, and Devin itself is now part of a tighter vertical stack as Devin Desktop, which remains the strongest single-vendor option when you want planning, IDE work and autonomous execution under one contract (especially with the FedRAMP/HIPAA/ITAR coverage Cognition ships). Claude Code with Dynamic Workflows running migration skills across hundreds of parallel subagents is the strongest path when you want cost predictability and direct control over the prompts. Cursor Cloud Agents, Copilot's Coding Agent and Google Jules all work the same territory from their respective ecosystems. OpenHands is the option when the agents need to stay inside your security boundary. The proof points are reproducible: 28-repository upgrades completed in four hours against a four-week estimate; 18-month ETL migrations compressed into weeks at organizations like Nubank. Watch out for: Only delegate when the change pattern is well-understood and the surface area for unexpected breakage is small. Pilot on one repo, confirm the pattern holds, then fan out; don't point an autonomous agent at fifty repos on faith. |
Working in unfamiliar code or an unfamiliar domain SCOPE Single project to multi-repo REASONING Synthesis AUTONOMY Supervised Reach for: Claude Code, Codex CLI or Antigravity CLI for terminal-driven exploration; Cursor, Devin Desktop or Antigravity in chat mode for in-editor exploration, supplemented by Devin Desktop's Codemaps or Google's Code Wiki for visual code navigation. Why: The killer use case for AI-native engineering, and the one that most changes the talent equation. When you don't know the codebase or the domain, the AI provides on-demand expertise. Claude Code excels at "explain this system to me" and "what would happen if I changed X" reasoning, especially when paired with a skill encoding your codebase's patterns. Codex CLI with GPT-5.5 is now genuinely competitive on terminal-native exploration. Devin Desktop's Codemaps (AI-annotated visual representations of a codebase) are uniquely valuable for onboarding into large legacy monorepos. Google's Code Wiki (codewiki.google) auto-generates a Gemini-powered, always-current wiki for any public GitHub repository, with architecture, class and sequence diagrams plus an integrated chat agent, invaluable when evaluating an open-source dependency or learning a new framework. The empirical finding from many WWT engagements: a generalist with the right tooling and a well-built skills library delivers specialist-quality work, which is what makes 8x acceleration on framework migrations possible. Watch out for: Treat the agent's explanation as a starting map, not ground truth. Verify its account of the system against the actual code before you make changes on the strength of it; confident explanations of unfamiliar code are exactly where hallucinations hide. |
Routine maintenance and patching SCOPE Single project to multi-repo REASONING Pattern matching to synthesis AUTONOMY Autonomous, scheduled Reach for: Claude Code with /loop scheduled jobs and maintenance skills, Devin Desktop with cloud handoff, Cursor Cloud Agents, Copilot's Coding Agent in GitHub Actions or Google Jules for ticket-to-PR flows. Why: The autonomous platforms still earn their cost here, but they're no longer the only option. Claude Code's /loop turns the agent into a declarative background worker: Cron-like scheduled tasks for PR reviews, dependency updates, certificate rotations and deployment monitoring. Cursor Cloud Agents, Copilot's Coding Agent and Google Jules give you the same "runs without me" pattern from inside their respective ecosystems. Devin Desktop's Agent Command Center remains the strongest end-to-end ticket-to-PR option, with cloud handoff baked into the default UX. The 99.6% efficiency gain on Oracle Exadata patching came from exactly this kind of automation. Watch out for: Scheduled agents fail silently if nobody's watching. Wire their results into a channel a human actually reads, and keep write scope conservative; an unattended agent with broad permissions is the highest-blast-radius setup in this list. |
Test creation alongside development SCOPE Single file to project REASONING Synthesis AUTONOMY Supervised Reach for: Cursor, GitHub Copilot or Devin Desktop, whichever IDE is in your daily flow. Why: Test scaffolding is a sweet spot for IDE-embedded tools, especially with rules that encode your test conventions. The cadence-transformation story (releases shifted from monthly to weekly) usually traces back to closing the gap between development and test creation. Watch out for: AI-generated tests often assert on the implementation rather than the behavior, which produces brittle suites that break on every refactor. Review what each test is actually proving, not just that the suite is green. |
Heavy synthesis from documents, transcripts or threads SCOPE Any, large context REASONING Synthesis AUTONOMY Delegated Reach for: Claude Code with Dynamic Workflows, Codex CLI with GPT-5.5 in Goal Mode or Antigravity CLI with Gemini 3 Pro for very long single-pass inputs. Why: When the task is "read 200 pages of partner documentation and produce a structured requirements list," the right tool is one with strong reasoning over long context and the ability to load supporting skills. Claude Code is the leading option, especially with Dynamic Workflows dividing the corpus across hundreds of coordinated subagents. Codex CLI in Goal Mode runs for hours autonomously. Gemini's million-token context window via Antigravity CLI is the option for very long single-pass inputs. A financial services engagement processed 200+ requirements in a single continuous session, producing structured outputs that flowed straight into the development backlog. Watch out for: Long-context output reads as authoritative even when it quietly drops or invents a detail. Spot-check the synthesis against the source material before it flows downstream into a backlog or a decision. |
Production debugging and incident response SCOPE Single project, cross-system signals REASONING Heavy reasoning AUTONOMY Supervised, read-only Reach for: Claude Code in read-only mode, supplemented by your existing observability tools and MCP integrations. Why: When systems are misbehaving, you want a tool with high reasoning quality and strong skepticism, one that asks for evidence and considers alternatives, not one that confidently produces plausible-sounding code. Claude Code with read-only MCP connections to your logs, metrics and traces is the strongest current setup; Opus 4.8 was specifically tuned to flag uncertainty and is roughly four times less likely than its predecessor to let flaws in its own code pass unremarked. Watch out for: Treat AI-suggested fixes as hypotheses, not solutions, until you've verified them. Keep the agent read-only against production; the blast radius of a confident wrong fix shipped fast is larger than the time the agent saved you. |
Greenfield architectural design SCOPE Any REASONING Architecture AUTONOMY Human Reach for: Your judgment, with Claude Code or Cursor as a thinking partner. Why: This is where AI tools are weakest and human judgment is most valuable. Use AI to explore options, sanity-check reasoning and write proof-of-concept code, but the architectural decisions themselves should remain yours. The pattern that works: you sketch the design, ask the AI to argue the strongest case against it, revise and only then start implementing. AI is excellent at the second step and weak at the first. Watch out for: Don't outsource the trade-off. Use the AI to attack your design, not to make the decision; the moment you let it choose the architecture is the moment you've handed off the one part of the job it's least equipped to do. |
What's next
Part 5 closes the series with the higher-level question: across these scenarios, how do you actually assemble a stack that holds up? It covers the stack patterns that work, the anti-patterns that quietly waste tool capability and the bottom line on what separates teams getting compound returns from teams just producing more code.