engineering
MCP for iOS: a primer for app developers
Maurice Carrier ·
If you’ve shipped iOS apps for any length of time, you’ve watched the test-automation surface get rebuilt three or four times. XCUITest. Appium. EarlGrey. Maestro. Each one solves a real problem and inherits a real cost: a DSL to maintain, selectors that drift, a runtime that fights you on every iOS major.
Model Context Protocol — MCP — is a different shape. It’s worth understanding before you decide whether it fits your team.
What MCP is
MCP is a small open protocol, originally drafted by Anthropic, for how an LLM application talks to outside tools. A server exposes a set of typed tools (name, description, inputSchema, outputSchema). The client — Claude Code, Cursor, Continue, Zed, whatever — discovers those tools at connect time and lets the model call them with structured arguments.
That’s the whole thing. It’s not a framework. It’s not a runtime. It’s a contract.
The interesting consequence is that any agent that speaks MCP can drive any tool that speaks MCP, with no app-specific glue. If you write an MCP server for your build system today, every MCP-aware coding agent gets access to it for free.
Why iOS automation has been hard
iOS automation traditionally has three loops in it that don’t compose well:
The first is the selector loop. XCUITest, Appium, Maestro — all of them depend on a tree of accessibility identifiers that match the UI. The tree changes. The selectors break. You spend a non-trivial slice of every sprint keeping the test suite green against your own UI changes.
The second is the authoring loop. Most tools want you to write tests in a DSL — Maestro YAML, Appium WebDriver scripts, Earl Grey Swift DSL. The DSL is a language your team has to learn and your tooling has to support. New engineers ramp on it. CI plugins have to understand it.
The third is the agent loop, and this is the one that’s been broken until recently. The way an LLM agent wants to interact with an app is very different from the way a deterministic test runner does. The agent doesn’t want a YAML script — it wants to observe, decide, act, observe again. It wants vision-first description of the screen. It wants short-lived, side-effecting commands. It wants the tools shaped to its loop, not retrofit from a script DSL.
If you bolt agent-style automation onto Appium, you spend most of your engineering effort wrapping Appium’s session model with something agent-shaped. The seams show up everywhere.
What an MCP server for iOS exposes
SimDrive ships 32 tools in 1.0.0b1. The categories tell the story:
observe — ios_observe renders the current screen and returns a vision-model description plus a structured element list. ios_screenshot returns the raw frame. ios_device_logs pulls the recent device log buffer. These are the tools the agent reaches for when it needs to see what’s happening.
act — ios_act takes a natural-language instruction (“tap the Continue button”) plus the prior observation, resolves it to a coordinate, and dispatches the touch. ios_type, ios_swipe, ios_key_press are the explicit primitives if you don’t want to go through ios_act. This is the agent’s hand on the device.
record and replay — ios_record_start and ios_record_stop capture a journey. ios_replay_run re-executes it with SSIM-based parity gates. Recordings are JSON state contracts, not opaque binary blobs — you can read them, diff them, version them. The replay path doesn’t call the VLM, so re-running a 30-step journey in CI costs $0 in AI.
journey — multi-step recorded flows with checkpoint assertions. ios_journey_run, ios_journey_list, ios_journey_assert. The unit between a single tap and an entire test suite.
device — ios_device_select, ios_device_boot, ios_device_shutdown, ios_app_launch, ios_app_terminate. Targeting and lifecycle. Real devices are reached via WDA; simulators are reached directly through Apple’s CoreSimulator.
perf — ios_perf_capture records a perf trace, ios_perf_compare diffs it against a saved baseline. Useful when “the app feels slow” is the actual ticket.
doctor — ios_doctor verifies your environment and prints the exact commands to fix anything that’s broken.
The split between observe and act matters. A traditional test framework collapses them — tapButton("Continue") finds the button and taps it in one call. An agent-shaped framework keeps them separate, because the agent needs to reason between observation and action. The agent looks at the screen, decides what to do, names the target, and only then takes the action. That extra step is what makes the agent loop tolerate UI drift that breaks selector-based tests.
A small worked example
Suppose you want to walk a sign-up flow from launch to confirmation. With SimDrive, the agent does roughly this:
ios_device_select(model="iPhone 17", runtime="latest")ios_app_launch(bundle_id="com.example.app")ios_observe() → "Welcome screen with Sign Up and Sign In"ios_act("tap Sign Up")ios_observe() → "Sign-up form, email field focused"ios_type(text="user@example.com")ios_act("tap Continue")ios_observe() → "Password screen"ios_type(text="Hunter2!Hunter2!")ios_act("tap Continue")ios_observe() → "Email confirmation requested screen"You don’t write that. The agent writes it, against your natural-language prompt. If you save the run with ios_record_start ahead of time, you get a JSON journey you can replay deterministically forever after.
Where MCP isn’t the right shape
MCP isn’t a universal answer. If your team’s testing posture is “ten thousand human-authored deterministic tests gated on every PR,” a DSL-based framework probably suits you better — those frameworks were designed for that workload. MCP shines when an agent is in the loop, either authoring the test or reproducing a bug or exploring an unfamiliar app.
The honest position: MCP is the lowest-friction integration for agent-driven workflows. It’s one of several reasonable choices for the broader test-automation space.
Why agent-native matters
SimDrive is one of the first MCP servers built natively for agents rather than retrofit from an existing automation framework. That shows up in small ways everywhere: the tool descriptions are written for an LLM to read, the return shapes are JSON-schema’d, the observation tool returns rich vision-model output instead of a flat accessibility tree, and the act tool accepts natural-language targets rather than fragile selectors.
If you want to read the tool definitions directly, they’re discoverable at runtime from any MCP client, or you can read the source at github.com/SyncTek-LLC/simdrive.
To install:
pip install simdrivesimdrive trial start --email you@example.comThe trial is 14 days. Pricing at simdrive.dev/pricing.