The foundation of Ferret-UI's proficiency lies in meticulously curating its training dataset. This process involved gathering training samples across a broad spectrum of elementary UI tasks, such as icon recognition, text finding, and widget listing. Additionally, to further refine the model's reasoning capabilities, an advanced task dataset was compiled. This dataset includes tasks requiring detailed descriptions, interaction dialogues, and function inference. The culmination of these efforts is a comprehensive benchmark designed to thoroughly evaluate Ferret-UI's capabilities in understanding and interacting with UI screens. This section outlines the strategic approach to data curation and task formulation, underscoring their pivotal role in the development of Ferret-UI.