workflowMay 10, 2026

UI-TARS opens desktop control with accessibility API workflows

UI-TARS resurfaced as an open-source desktop-control stack while Opendesk described using accessibility APIs and marked elements instead of raw pixel guesses. The approach makes computer-use workflows more repeatable, but it still depends on human-oriented interfaces.

4 min read

UI-TARS opens desktop control with accessibility API workflows

TL;DR

heyrimsha's demo thread resurfaced UI-TARS as an open-source desktop-control stack that can watch the screen, click, type, browse, and use local files through a full desktop session.
In metalvendetta's Opendesk write-up, the core reliability trick is to query native accessibility APIs first, then send the model a screenshot with numbered marks on real UI elements instead of asking it to guess raw coordinates.
the same Opendesk post also adds two workflow layers that most flashy computer-use demos skip: record-and-replay task learning, and scheduled desktop runs.
The tension is right in the open: bentossell's question asked why agents are still driving human websites through browsers, while marckohlbrugge's reply argued the bigger opportunity is building software for agents directly.

UI-TARS Desktop

You can browse the UI-TARS Desktop repo, jump to Opendesk on GitHub, and watch heyrimsha's video demo drive the desktop like a remote operator. That mix, open-source control stack plus visible end-to-end execution, is catnip for anyone building agent workflows outside the browser.

According to heyrimsha's thread, UI-TARS can see desktop apps, click buttons, type into fields, browse the web, and work with local files. The clip matters because it frames computer use as a laptop-native workflow, not just a browser agent stuffed into a tab.

Accessibility marks

r/ArtificialInteligence

Accessibility API and Set-of-Marks: making computer-use agents more reliable

0 comments

metalvendetta's Reddit breakdown describes a more structured control loop than the usual screenshot-to-pixels approach. Instead of handing the model an image and hoping it predicts the right click target, Opendesk pulls the platform's native accessibility tree first, including AppleScript on macOS, AT-SPI2 on Linux, and UI Automation on Windows.

That produces a list of interactive elements with labels and bounding boxes. Opendesk then draws numbered chips on those elements before the screenshot reaches the model, so the model reasons over marked targets and the system already knows where each target lives.

The post also spells out the failure modes this is meant to avoid:

Retina scaling shifts pixel assumptions.
Window moves break fixed coordinates.
Dense layouts make nearby targets easy to confuse.
Small UI changes can invalidate a screenshot-only plan.

Mouse coordinates still exist, but in the Opendesk explanation they are a fallback for unlabeled surfaces such as canvases, video players, and games.

Replay and scheduling

r/ArtificialInteligence

Accessibility API and Set-of-Marks: making computer-use agents more reliable

0 comments

Opendesk is not just a live controller. In metalvendetta's post, the project adds two extra layers that push it closer to repeatable desktop automation:

Learn and Replay: the agent can watch a user complete a task, store the trajectory as events plus screenshots, and replay it later.
Scheduling: the same desktop task can run at a specified time, such as opening Gmail every morning and summarizing unread mail.

The replay detail is the interesting bit. The Reddit post says the system should not reuse old coordinates. It should re-execute the workflow against the current screen state with prior screenshots and actions as context, which makes the replay adaptive instead of brittle.

Agentic experience

The most useful pushback in the evidence set is not about model quality. It is about interface design. bentossell asked why agents are opening browsers to use sites built for humans, while marckohlbrugge compared the moment to the mobile-web shift, where existing products were retrofitted first and new product categories appeared later.

That gives this small wave of desktop-control projects a clear split. One track makes current software legible to agents through accessibility trees, visual marks, and replayable actions. The other track imagines software built around agents from the start, which is a different bet than teaching a model to survive inside today's UI.