๐Ÿ“ฐ Story

hackernews_ai ยท May 2, 2026 ยท news

โ† Live feed ๐Ÿ“ฐ Daily recap ๐Ÿ—“๏ธ Weekly recap ๐Ÿ”” RSS

Show HN: Agent-desktop โ€“ Native desktop automation CLI for AI agents

I've been building computer-use tools for a while, and I quietly launched this about a month ago (122 Stars on GH). I figured it was worth sharing here. Over the last few months, a lot of computer-use agents have come out: Codex, Claude Code, CUA, and others. Most of them seem to work roughly like this: 1. Take a screenshot 2. Have the model predict pixel coordinates 3. Click x,y 4. Take another screenshot 5. Repeat That works, but it's slow, expensive in tokens, and fragile. If the UI shifts a few pixels, things break. And the model still doesn't know what any element actually is. But the OS already exposes structured UI information: - macOS: Accessibility API - Windows: UI Automation - Linux: AT-SPI Screen readers have used these APIs for years. On the web, Playwright beat screenshot scraping for the same reason: structured access is just a better abstraction than pixels. So I built a desktop equivalent: agent-desktop. It's a cross-platform CLI for structured desktop automation through the accessibility tree. One Rust binary, about 15 MB, no runtime dependencies. It exposes 53 commands with JSON output, so an LLM can inspect and operate native apps without screenshots or vision models. Inspired by agent-browser by Vercel Labs. A typical loop looks like this: agent-desktop snapshot --app Slack -i --compact agent-desktop click @e12 agent-desktop type @e5 "ship it" agent-desktop press cmd+return So the loop becomes: 1. Snapshot 2. Decide 3. Act 4. Snapshot again The main desig

Read the original at github.com โ†’Open in live feed

Related stories 4 items