Telegram Web Link
OmniParser for Pure Vision Based GUI Agent

1 Aug 2024 · Yadong Lu, Jianwei Yang, Yelong Shen, Ahmed Awadallah

The recent success of large vision language models shows great potential in driving the agent system operating on user interfaces. However, we argue that the power multimodal models like GPT-4V as a general agent on multiple operating systems across different applications is largely underestimated due to the lack of a robust screen parsing technique capable of: 1) reliably identifying interactable icons within the user interface, and 2) understanding the semantics of various elements in a screenshot and accurately associate the intended action with the corresponding region on the screen. To fill these gaps, we introduce \textsc{OmniParser}, a comprehensive method for parsing user interface screenshots into structured elements, which significantly enhances the ability of #GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface. We first curated an interactable icon detection dataset using popular webpages and an icon description dataset. These datasets were utilized to fine-tune specialized models: a detection model to parse interactable regions on the screen and a caption model to extract the functional semantics of the detected elements. \textsc{#OmniParser} significantly improves GPT-4V's performance on ScreenSpot benchmark. And on #Mind2Web and AITW benchmark, \textsc{OmniParser} with screenshot only input #outperforms the GPT-4V baselines requiring additional information outside of screenshot.

Paper: https://arxiv.org/pdf/2408.00203v1.pdf

Code: https://github.com/microsoft/omniparser

Dataset: ScreenSpot


@Machine_learn
Competitive Programming with Large Reasoning Models
OpenAI∗


link

@Machine_learn
The Pandas Workshop (2022).pdf
28.9 MB
The Pandas Workshop A comprehensive guide to using Python for data analysis with real-world case studies

@Machine_learn
Enhance-A-Video: Better Generated Video for Free

11 Feb 2025 · Yang Luo, Xuanlei Zhao, Mengzhao Chen, Kaipeng Zhang, Wenqi Shao, Kai Wang, Zhangyang Wang, Yang You

DiT-based video generation has achieved remarkable results, but research into enhancing existing models remains relatively unexplored. In this work, we introduce a training-free approach to enhance the coherence and quality of DiT-based generated videos, named Enhance-A-Video. The core idea is enhancing the cross-frame correlations based on non-diagonal temporal attention distributions. Thanks to its simple design, our approach can be easily applied to most DiT-based video generation frameworks without any retraining or fine-tuning. Across various DiT-based video generation models, our approach demonstrates promising improvements in both temporal consistency and visual quality. We hope this research can inspire future explorations in video generation enhancement.

Paper: https://arxiv.org/pdf/2502.07508v1.pdf

Code: https://github.com/NUS-HPC-AI-Lab/Enhance-A-Video



@Machine_learn
Forwarded from Papers
با عرض سلام
نیاز به یک نفر داریم که در موضوع زیر‌کمکمون کنه (نفر اول)
🔸🔸🔸🔸🔸🔸🔸🔸🔸
Title: Chronic kidney disease classification: Deep ansemble approach
کنفرانس مد نظر :

⭐️https://saiconference.com/IntelliSys

⚙️Abstract: Chronic kidney disease (CKD) is a progressive disease that may lead to kidney failure, so early diagnosis is crucial for proper management. This condition has a high mortality rate, especially in developing countries. CKD is often overlooked because there are no apparent symptoms in the early stages. Meanwhile, early diagnosis and timely clinical intervention are essential to reduce the progression of the disease. CKD diagnosis using deep learning (DL) and feature selection (FS) methods can be a useful application of artificial intelligence (AI) in healthcare. DL algorithms can provide cost-effective and efficient computer-aided diagnosis (CAD) to assist physicians. DL models are based on automatic feature selection.
In some cases, manual feature extraction can improve the results before the network learning process. This study aims to present an ensemble deep-learning model for CKD classification. The proposed method used Deep Embedded Clustering (DEC) as a similarity feature. Also, latent features obtained from the Gaussian Mixture Model (GMM) process were used. The proposed method on UCI databases achieved an accuracy of 1.0 using the Synthetic Minority Over-Sampling technique (SMOTE).


دوستانی که مشارکت میکنم بخشی از هزینه چاپ رو هم تقبل میکنن. بخش related work and introduction, هم بر عهده ی مشارکت کنندست.
@Raminmousa
Papers channel:
https://www.tg-me.com/+SP9l58Ta_zZmYmY0
Please open Telegram to view this post
VIEW IN TELEGRAM
2025/02/22 21:45:52
Back to Top
HTML Embed Code: