[ SYSTEM ONLINE ]

AWESOME
COMPUTER
USE

A curated collection of resources for building and deploying Computer Use Agents. Vision-language models that see screens and take actions like humans do.

LIVE RESOURCE DATABASE
// CC0-1.0 LICENSE
________ ___ ___ ________ |\ ____\|\ \|\ \|\ __ \ \ \ \___|\ \ \\\ \ \ \|\ \ \ \ \ \ \ \\\ \ \ __ \ \ \ \____\ \ \\\ \ \ \ \ \ \ \_______\ \_______\ \__\ \__\ \|_______|\|_______|\|__|\|__| ============================================ COMPUTER USE AGENTS // RESOURCE TERMINAL ============================================ > STATUS: OPERATIONAL > MODELS: LOADED > TOOLS: INDEXED > PAPERS: CATALOGUED [ TZAFON AI // NORTHSTAR CUA // LIGHTCONE ] +---+ +---+ +---+ | M |---->| T |---->| A | +---+ +---+ +---+ | | | v v v MODELS TOOLS ACTIONS ============================================
< 02 >

OPEN SOURCE MODELS

// 8 MODELS INDEXED
[ PRODUCTION-READY MODELS FOR GUI AUTOMATION ]
MODEL PARAMS HIGHLIGHTS LINKS
UI-TARS 7B/72B Native GUI agent with System-2 reasoning. 24.6 on OSWorld.
AGUVIS 7B/72B Unified pure vision GUI agent across platforms.
COGAGENT 18B High-resolution cross-module attention for GUI understanding.
SEECLICK 7B Visual GUI agent with element grounding capabilities.
SHOWUI 2B Lightweight vision-language-action model for UI grounding.
FERRET-UI - Apple's grounded mobile UI understanding with multimodal LLMs.
OMNIPARSER - Microsoft's vision-based GUI agent parser.
< 03 >

DEVELOPER TOOLS

// SDKS, FRAMEWORKS, SANDBOXING
< 03.1 > SDKS & APIS

Anthropic's official computer use documentation and API reference.

ANTHROPIC

High-level framework for building browser automation agents with LLMs.

OPEN SOURCE
< 03.2 > AGENT FRAMEWORKS

Out-of-the-box computer use implementation.

SHOWLAB

Framework for multimodal AI computer control.

OTHERSIDEAI

Natural language interface for computer control.

OPEN SOURCE

Lightweight computer use agent.

OPEN SOURCE

Empowering foundation agents towards general computer control. [Paper]

BAAI

Foundational design principles in agentic systems.

RESEARCH

Generalist computer agents with self-improvement.

RESEARCH
< 03.3 > SANDBOXING & EXECUTION

Secure cloud sandboxes for running GUI agents.

E2B

Containerized environments for agent evaluation.

XLANG-AI

Toolkit for building general virtual agents.

RESEARCH
< 03.4 > PLATFORM-SPECIFIC

Anthropic computer use adapted for Mac.

MACOS

MIT-licensed voice-controlled AI agent for macOS using accessibility APIs.

MACOS

Multimodal agents as smartphone users. [Paper] [GitHub]

MOBILE
< 03.5 > OTHER PROJECTS

Open source process automation.

Natural language computer interface.

AI-powered automation bot.

Visual element marking for web agents.

CUA

Computer use agent framework.

UI interaction agent.

< 04 >

RESEARCH PAPERS

// CATEGORIZED ARCHIVE
> MODELING & ARCHITECTURE // 19 PAPERS
EvoCUA: Evolving Computer Use Agents via Learning from Scalable Synthetic Experience

Self-improving CUA through synthetic data generation and iterative RL. 56.7% on OSWorld.

FUDAN ET AL., 2025
UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Native GUI agent with System-2 reasoning. State-of-the-art on 10+ benchmarks.

ALIBABA, 2025
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents

Self-adaptive agents in realistic environments.

2025
PC Agent: While You Sleep, AI Works

Cognitive journey into digital world.

2024
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Pure vision approach without HTML parsing.

2024
Agent S: An Open Agentic Framework that Uses Computers Like a Human

Experience-augmented hierarchical planning.

2024
OSCAR: Operating System Control via State-Aware Reasoning

State-aware reasoning and re-planning.

2024
AgentStore: Scalable Integration of Heterogeneous Agents

Specialized generalist computer assistant.

2024
Agent Workflow Memory

Memory systems for agent workflows.

2024
Web Agents with World Models

Learning environment dynamics for web navigation.

2024
Tree Search for Language Model Agents

Tree search methods for LLM agents.

2024
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum RL

Self-evolving curriculum reinforcement learning.

TSINGHUA, 2024
ECLAIR: Enterprise sCaLe AI for woRkflows

Enterprise-scale AI workflows.

STANFORD, 2024
SeeAct: GPT-4V(ision) is a Generalist Web Agent, if Grounded

Visual grounding for generalist web agents.

OSU, 2024
ICAL: Continual Learning of Multimodal Agents by Transforming Trajectories

Actionable insights from trajectories.

NEURIPS 2024
CogAgent: A Visual Language Model for GUI Agents

High-resolution cross-module attention.

TSINGHUA, 2023
> GUI GROUNDING // 8 PAPERS
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents

Coordinate-free visual grounding approach.

2025
Attention-driven GUI Grounding

Leveraging pretrained MLLMs without fine-tuning.

AAAI 2025
Navigating the Digital World as Humans Do

Universal visual grounding for GUI agents.

2024
OS-ATLAS: Foundation Action Model for Generalist GUI Agents

Foundation model for GUI grounding.

ICLR 2025
OmniParser for Pure Vision Based GUI Agent

Vision-based GUI parsing.

MICROSOFT, 2024
Ferret-UI 2: Universal User Interface Understanding Across Platforms

Cross-platform UI understanding.

APPLE, 2024
Set-of-Mark (SoM) Prompting

Unleashing visual grounding in GPT-4V.

MICROSOFT, 2023
> AGENT DATA & TRAJECTORY SYNTHESIS // 7 PAPERS
> BENCHMARKS & EVALUATION // 11 PAPERS
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks

The definitive benchmark. 369 tasks across Ubuntu, Windows, macOS. Best AI: 12.24% vs 72% human.

NEURIPS 2024
AndroidWorld: A Dynamic Benchmarking Environment

Dynamic Android environment benchmark.

2024
WebArena

Realistic web environment with functional evaluation.

VisualWebArena

Multimodal web agent benchmark.

Mind2Web

Large-scale web agent dataset.

ScreenSpot

GUI grounding benchmark across platforms.

CRAB: Cross-environment Agent Benchmark

Multimodal language model agents.

2024
MobileAgentBench

Efficient benchmark for mobile LLM agents.

2024
Spider2-V: Automating Data Science Workflows

Data science automation benchmark.

NEURIPS 2024
ScienceBoard: Scientific Workflows Evaluation

Multimodal agents in realistic scientific workflows.

2025
> SAFETY & SECURITY // 5 PAPERS
Attacking Vision-Language Computer Agents via Pop-ups

Adversarial attacks through pop-up injection.

2024
EIA: Environmental Injection Attack for Privacy Leakage

Privacy attacks on web agents.

2024
Adversarial Attacks on Multimodal Agents

Comprehensive attack analysis.

2024
< 05 >

COMMERCIAL PLATFORMS

// 3 PLATFORMS
CLAUDE COMPUTER USE
ANTHROPIC

Computer use via Claude API

OPENAI OPERATOR
OPENAI

Browser automation agent from OpenAI

GOOGLE MARINER
DEEPMIND

DeepMind's multimodal web agent

< 06 >

RESOURCES

// VIDEOS, BLOGS, TUTORIALS
VIDEOS & TALKS
Claude | Computer use for coding

Official Anthropic tutorial

YOUTUBE
Claude | Computer use for automating operations

Operations automation demo

YOUTUBE
Claude | Computer use for orchestrating tasks

Task orchestration walkthrough

YOUTUBE
LLMs as Computer Users: An Overview

Comprehensive slide deck

FIGMA
BLOGS & ARTICLES
// INDUSTRY PERSPECTIVES
AI is about to completely change how you use computers

Bill Gates on AI agents

GATESNOTES
When you give a Claude a mouse

Analysis of computer use implications

ETHAN MOLLICK
Claude's agentic future

Frontier models and agency

NATHAN LAMBERT
// TECHNICAL ANALYSIS
Initial explorations of Computer Use

Technical exploration

SIMON WILLISON
Notes on Anthropic's Computer Use Ability

Implementation notes

COMPOSIO
// TUTORIALS & GUIDES
Automating macOS using Claude Computer Use

macOS-specific tutorial

GLAMA.AI
Instant Claude Computer Use Demo

Docker-based demo

LABEX
< 07 >

COMMUNITY

// CONNECT & DISCUSS