A curated collection of resources for building and deploying Computer Use Agents. Vision-language models that see screens and take actions like humans do.
RL-trained with GRPO on synthetic environments. Excels at error recovery and multi-turn interactions. Achieves 55% on OSWorld Chrome benchmark.
The API and runtime for Northstar CUA. Supports task-based automation, custom agent loops, and OpenAI-compatible endpoints.
Why SFT saturates, how positional encoding affects click accuracy (40% to 80% improvement), and why multi-turn RL enables robust error recovery.
| MODEL | PARAMS | HIGHLIGHTS | LINKS |
|---|---|---|---|
| NORTHSTAR CUA FAST TZAFON AI | 4B | RL-trained with GRPO on synthetic environments. Excels at error recovery and multi-turn interactions. 55% on OSWorld Chrome. | |
| UI-TARS | 7B/72B | Native GUI agent with System-2 reasoning. 24.6 on OSWorld. | |
| AGUVIS | 7B/72B | Unified pure vision GUI agent across platforms. | |
| COGAGENT | 18B | High-resolution cross-module attention for GUI understanding. | |
| SEECLICK | 7B | Visual GUI agent with element grounding capabilities. | |
| SHOWUI | 2B | Lightweight vision-language-action model for UI grounding. | |
| FERRET-UI | - | Apple's grounded mobile UI understanding with multimodal LLMs. | |
| OMNIPARSER | - | Microsoft's vision-based GUI agent parser. |
The API and runtime for Northstar CUA. Supports task-based automation, custom agent loops, and OpenAI-compatible endpoints. Available for Python (pip install tzafon) and Node.js (npm install @tzafon/lightcone).
Anthropic's official computer use documentation and API reference.
High-level framework for building browser automation agents with LLMs.
Full desktop automation loop with sandbox execution. Apache 2.0 licensed.
Out-of-the-box computer use implementation.
Framework for multimodal AI computer control.
Natural language interface for computer control.
Lightweight computer use agent.
Foundational design principles in agentic systems.
Generalist computer agents with self-improvement.
Secure cloud sandboxes for running GUI agents.
Containerized environments for agent evaluation.
Toolkit for building general virtual agents.
Anthropic computer use adapted for Mac.
MIT-licensed voice-controlled AI agent for macOS using accessibility APIs.
Deep dive into why SFT saturates, how positional encoding affects click accuracy (40% to 80% improvement), and why multi-turn RL enables robust error recovery.
Comprehensive survey of the field.
Survey on multimodal agents for computing devices.
Self-improving CUA through synthetic data generation and iterative RL. 56.7% on OSWorld.
Native GUI agent with System-2 reasoning. State-of-the-art on 10+ benchmarks.
Self-adaptive agents in realistic environments.
Cognitive journey into digital world.
Pure vision approach without HTML parsing.
Experience-augmented hierarchical planning.
Advanced reasoning capabilities.
State-aware reasoning and re-planning.
Specialized generalist computer assistant.
Memory systems for agent workflows.
Learning environment dynamics for web navigation.
Simple but effective baseline.
Tree search methods for LLM agents.
Self-evolving curriculum reinforcement learning.
Enterprise-scale AI workflows.
Visual grounding for generalist web agents.
LMM-powered web navigation.
Actionable insights from trajectories.
High-resolution cross-module attention.
Coordinate-free visual grounding approach.
Leveraging pretrained MLLMs without fine-tuning.
Universal visual grounding for GUI agents.
Foundation model for GUI grounding.
Vision-based GUI parsing.
Cross-platform UI understanding.
GUI grounding pre-training.
Unleashing visual grounding in GPT-4V.
GUI element collection.
Multimodal web agents data.
Reverse task synthesis.
Web tutorials for trajectory synthesis.
Android agent training data.
Comprehensive GUI dataset.
Digital agents at scale.
The definitive benchmark. 369 tasks across Ubuntu, Windows, macOS. Best AI: 12.24% vs 72% human.
Dynamic Android environment benchmark.
Windows-specific benchmark.
Realistic web environment with functional evaluation.
Multimodal web agent benchmark.
Large-scale web agent dataset.
GUI grounding benchmark across platforms.
Multimodal language model agents.
Efficient benchmark for mobile LLM agents.
Data science automation benchmark.
Multimodal agents in realistic scientific workflows.
Adversarial attacks through pop-up injection.
Mobile device control safety.
Guard agent architecture.
Privacy attacks on web agents.
Comprehensive attack analysis.
Official Anthropic tutorial
Operations automation demo
Task orchestration walkthrough
Comprehensive slide deck
Bill Gates on AI agents
Analysis of computer use implications
Frontier models and agency
Deep technical analysis of VLM training for computer use
Technical exploration
Implementation notes
Quick start guide
macOS-specific tutorial
DataCamp tutorial
Docker-based demo