We propose OS-Symphony, a robust cross-platform desktop CUA framework that organically integrates an Orchestrator, a Reflection-Memory Agent, and versatile Tool Agents.
OS-Symphony has achieved a score of 65.8 on the OSWorld Official Evaluation (using GPT-5 + UI-TARS-1.5-7B with 50 steps). As of 2026/01/04, this ranks 5th overall, 3rd among methods without multiple rollout, and 1st under the 50-steps constraint!
Note: The results reported in our paper and this webpage are lower due to limitations of the VM environment. While you are allowed to compare against the metrics in our paper, we highly encourage comparing against the official evaluation results.
We propose OS-Symphony, a robust desktop CUA framework that organically integrates an Orchestrator, a Reflection-Memory Agent, and specialized Tool Agents through a unified Reflection Message Protocol, minimizing error accumulation while maximizing overall capability.
OS-Symphony is built around the following components, forming a complete closed reasoning loop:
OS-Symphony is a cross-platform CUA supporting three mainstream OSs, achieving a omni-CUA solution. It achieves success rates of 65.8 on OSWorld
We will release all code and more details. We hope OS-Symphony can inspire and boost future research in building GUI agents.
Click to jump to each section.
OS-Symphony comprises three synergistic components: an Orchestrator, specialized Tool Agents, and a Reflection-Memory Agent (RMA). In the case shown above, the Orchestrator first interprets feedback from the RMA, which identifies execution stagnation caused by a Chrome version mismatch between the task environment and the VLM's pre-training knowledge, together with a Lack of Tutorial error. It then invokes the Searcher to retrieve a relevant tutorial, and following this guidance, successfully completes the task, achieving a closed-loop self-improvement cycle.
Current CUA frameworks suffer from intent drift and insufficient error awareness when executing long-horizon tasks, due to the lack of a concise yet effective memory mechanism. To address this, we introduce a Reflection-Memory Agent (RMA) that manages a milestone-driven long-term memory, which is primarily responsible for alleviating the contextual memory overhead of the Orchestrator.
We introduce a "Visual-Centric Search as a Tool" paradigm, where our Searcher employs a VLM-driven "SeeAct" strategy to directly interact with rendered pages and synthesize tutorials. This approach addresses two critical deficiencies in RAG for CUAs. First, it retains the vital visual context often lost in text-based parsing. Second, it resolves the misalignment between static retrieval and dynamic needs: unlike standard RAG, our method treats search as an on-demand capability, ensuring the agent actively seeks knowledge only when execution gaps arise.
An example is shown below.
We first evaluate OS-Symphony on OSWorld, which comprises 369 real-world tasks across five domains in a Linux environment. Following common practice, we exclude the 8 Google Drive tasks, resulting in a final set of 361 tasks. The results are shown in Table 1
| Model | Steps | Success Rate | |||||
|---|---|---|---|---|---|---|---|
| OS | Office | Daily | Professional | Workflow | Avg. | ||
| OpenCUA-72B | 15 | 50.00 | 38.45 | 46.10 | 71.43 | 13.91 | 39.03 |
| OpenCUA-72B | 50 | 45.83 | 42.73 | 53.79 | 75.51 | 22.31 | 44.52 |
| OpenCUA-72B | 100 | 61.13 | 44.73 | 49.95 | 72.58 | 22.16 | 44.91 |
| UI-TARS |
100 | 41.67 | 50.42 | 55.69 | 51.02 | 14.66 | 41.85 |
| UI-TARS-2 |
100 | 41.67 | 61.11 | 62.12 | 61.22 | 34.13 | 53.10 |
| DeepMiner-Mano-7B | 100 | 50.00 | 39.28 | 44.87 | 73.47 | 17.20 | 40.15 |
| DeepMiner-Mano-72B | 100 | 66.67 | 63.22 | 52.51 | 83.67 | 24.41 | 53.91 |
| Claude-Sonnet-4.5 | 50 | 70.83 | 62.41 | 57.64 | 63.27 | 46.99 | 58.10 |
| Claude-Sonnet-4.5 | 100 | 70.83 | 72.59 | 61.35 | 63.27 | 49.54 | 62.84 |
| UiPath w/ GPT-5 |
50 | 73.91 | 49.52 | 62.12 | 71.43 | 37.30 | 53.69 |
| CoACT-1 w/ GPT-5 |
50 | 70.83 | 60.65 | 54.09 | 69.39 | 42.37 | 56.39 |
| CoACT-1 w/ GPT-5 | 100 | 75.00 | 62.93 | 57.94 | 71.43 | 47.87 | 59.93 |
| CoACT-1 w/ GPT-5 | 150 | 75.00 | 62.93 | 61.78 | 71.43 | 47.87 | 60.76 |
| GTA1 w/ GPT-5 |
100 | 79.17 | 63.91 | 62.56 | 79.59 | 50.91 | 63.41 |
| Agent S3 w/ Qwen3-VL-32B-Instr. ♣ |
50 | 50.00 | 36.67 | 50.62 | 61.22 | 21.96 | 40.11 |
| Agent S3 w/ GPT-5-Mini ♣ | 50 | 62.50 | 54.62 | 46.67 | 44.90 | 37.04 | 47.58 |
| Agent S3 w/ GPT-5 | 100 | 77.50 | 66.46 | 61.23 | 69.80 | 51.37 | 62.63 |
| OS-SYMPHONY w/ Qwen3-VL-32B-Instr. | 50 | 58.33 | 40.94 | 53.54 | 75.10 | 31.24 | 46.86 |
| OS-Symphony w/ GPT5-Mini | 50 | 73.68 | 58.17 | 61.39 | 75.00 | 47.37 | 58.05 |
| OS-Symphony w/ GPT-5 | 50 | 75.00 | 64.85 | 61.19 | 69.23 | 54.86 | 63.61 |
| OS-Symphony w/ GPT-5 | 100 | 75.00 | 65.70 | 63.75 | 69.23 | 58.99 | 65.84 |
As shown in Table 1, when using GPT-5, OS-Symphony achieves new SOTA performance under maximum step limits of 50 and 100, scoring 63.6 and 65.8, respectively. This represents an absolute improvement of approximately 3% over the primary baseline, 100-step Agent S3 with GPT-5, demonstrating the effectiveness of our framework.
To evaluate generalization across widely used operating systems, we additionally evaluate OS-Symphony on WindowsAgentArena. The results are shown in Table 2
| Model | Steps | Success Rate | ||||||
|---|---|---|---|---|---|---|---|---|
| Office | Web | System | Code | Media | Utilization | Avg. | ||
| Qwen3-VL-32B-Instr. | 50 | 19.05 | 49.66 | 54.17 | 21.05 | 42.19 | 25.00 | 34.61 |
| UI-TARS-1.5-7B | 50 | - | - | - | - | - | - | 42.10 |
| UI-TARS-2 | 50 | - | - | - | - | - | - | 50.60 |
| Agent S3 w/ GPT-5 | 50 | - | - | - | - | - | - | 54.10 |
| Agent S3 w/ GPT-5 | 100 | - | - | - | - | - | - | 56.60 |
| OS-Symphony w/ Qwen3-VL-32B-Instr. | 50 | 26.19 | 46.33 | 75.00 | 47.37 | 27.90 | 41.67 | 43.12 |
| OS-Symphony w/ GPT-5-Mini | 50 | 42.86 | 73.00 | 79.17 | 68.42 | 48.66 | 66.67 | 61.50 |
| OS-Symphony w/ GPT-5 | 50 | 54.76 | 73.00 | 75.00 | 42.11 | 70.09 | 75.00 | 63.63 |
As shown in Table 2, OS-Symphony achieves a new SOTA score of 63.6 on WindowsAgentArena using GPT-5 with 50-step, outperforming the 50-step and 100-step Agent S3 baselines by 9.5% and 7.0%, respectively, with even the GPT-5-Mini variant demonstrating superior efficiency by exceeding the 100-step Agent S3 with GPT-5 by 4.9%.
Furthermore, we evaluate OS-Symphony on MacOSArena, a particularly demanding benchmark where both proprietary and specialist models struggle to achieve even modest success rates. Due to the cost constraint, we didn't use GPT-5 as the backbone. The results are shown in Table 3
| Model | Steps | Success Rate | ||
|---|---|---|---|---|
| Single-Apps | Multi-Apps | Avg. | ||
| GPT-4o | 50 | 3.57 | 0.00 | 1.59 |
| Claude-3.7-Sonnet | 50 | 14.29 | 2.86 | 7.94 |
| Aguvis-72B | 50 | 0.00 | 0.00 | 0.00 |
| UI-TARS-1.5-7B | 50 | 14.29 | 2.86 | 7.94 |
| UI-TARS-72B-DPO | 50 | 14.29 | 5.71 | 7.52 |
| Qwen2.5-VL-72B | 50 | 7.14 | 0.00 | 3.17 |
| Qwen3-VL-32B-Instr. | 50 | 17.86 | 0.00 | 7.94 |
| OS-Symphony w/ Qwen3-VL-32B-Instr. | 50 | 32.14 | 8.57 | 19.05 |
| OS-Symphony w/ GPT-5-Mini | 50 | 57.14 | 37.14 | 46.03 |
As shown in Table 3, OS-Symphony achieves a new SOTA score of 46.03 on MacOSArena, significantly outperforming the baseline models. Consequently, we can conclude that OS-Symphony delivers robust performance across all mainstream operating systems, reflecting its remarkable OS-level generalization capabilities.
We conduct the Pass@K experiment. To explore the performance limits, we incrementally increased the temperature of both the Orchestrator and RMA by 0.1 at each pass. All experiments are carried out with GPT-5 + UI-TARS-1.5-7B and 100-step limit.
As shown above, OS-Symphony surpasses human performance (72.4%) at Pass@2 (74.14%) and achieves a success rate approaching 80% at Pass@5 (79.4%) on OSWorld. his demonstrates, first, that our framework has a high performance ceiling and can attain excellent results through such test-time scaling. Second, consistent with our analysis, zero-shot agentic frameworks exhibit substantial stochasticity on GUI tasks, which is evident in both our experimental results and qualitative observations. Future work should therefore prioritize improving deployment-time stability.
We examine the impact of the maximum number of images within the history trajectory on performance. All experiments are carried out with Qwen3-VL-32B-Instruct and UI-TARS-1.5-7B
As shown above, the image count significantly influences the framework's effectiveness. An insufficient number of images fails to provide adequate context regarding operation history, whereas an excessive number leads to a surge in token consumption, thereby impairing the model's reasoning capabilities. The experimental results validate that our selected limit of 8 images is optimal, achieving the peak performance of 46.86.
To better demonstrate the strengths of our framework, we conduct a qualitative analysis of specific success cases observed during experiments.
As shown above, after navigating to the correct page, the primary baseline, Agent S3, suffered from a lack of domain knowledge. It clicked an incorrect button, which visually resembled a settings option but was functionally irrelevant, and subsequently became trapped in an erroneous loop until the maximum step limit was exhausted. In contrast, OS-Symphony invoked the Search Agent at the first step. Through Google Chrome and a series of GUI actions, the Search Agent navigated to the Superuser website and synthesized a relevant tutorial. Guided by this external knowledge, our framework correctly identified and clicked the target button at Step 4, successfully completing the task.
In this work, we presented OS-Symphony, a holistic Compuuter-Using Agent (CUA) framework containing an Orchestrator that synergistically coordinates specialized modules, to address the two challenges of long-horizon robustness and domain generalization. To mitigate the visual loss inherent in extended workflows, our Reflection-Memory Agent employs a milestone-driven strategy for granular memory curation, enabling retrospective auditing to rectify errors such as intent drift and loops. Besides, we employ the Versatile Tool Agents featuring a Multimodal Searcher that transcends text-based retrieval limitations by adopting an active "SeeAct" paradigm, synthesizing high-fidelity, visually aligned tutorials for unseen environments. Extensive experiments confirm that OS-Symphony not only achieves state-of-the-art performance across diverse operating systems but also proves that complex problems can be effectively solved using open-source VLMs.
We express our deepest gratitude to the following excellent projects for their contributions of GUI automation: OSWorld, WindowsAgentArena, MacOSArena, Agent S series, UI-TARS series, GTA1, ScaleCUA etc.
@misc{ossymphony2026,
title={OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent},
author={Bowen Yang and Kaiming Jin and Zhenyu Wu and Zhaoyang Liu and Qiushi Sun and Zehao Li and JingJing Xie and Zhoumianze Liu and Fangzhi Xu and Kanzhi Cheng and Qingyun Li and Yian Wang and Yu Qiao and Zun Wang and Zichen Ding},
year={2026},
eprint={2601.07779},
archivePrefix={arXiv},
primaryClass={cs.MA},
url={https://arxiv.org/abs/2601.07779},
}