The 1st Workshop on AI-Ready Data for Science Discovery

Date: November 13, 2025 | Location: Washington DC, USA

About the Workshop

Scientific discovery is entering a new era—powered by artificial intelligence (AI) and machine learning (ML). These technologies are enabling breakthroughs across disciplines such as biology, physics, chemistry, and materials science. However, one major bottleneck remains: the lack of high-quality, domain-specific datasets that are truly AI-ready.

The 1st Workshop on AI-ready Data for Science Discovery (ADSD 2025) aims to address this critical challenge by building a vibrant, interdisciplinary community focused on the creation, curation, and benchmarking of scientific datasets. Hosted at ICDM 2025, ADSD will serve as a platform for researchers, practitioners, and data professionals to collaborate on shaping the future of scientific data mining.

Call for Papers

We welcome a wide array of submissions focused on AI-Ready Dataset for science discovery, encompassing topics such as theories, algorithms, applications, systems, and tools. These topics include but are not limited to:

Data Acquisition and Integration

  • Automated methods for constructing AI-ready data from experiments, simulations, and publications.
  • Methods for multimodal datasets integration, including text, images, tables, and numerical data.
  • Retrieval-augmented generation (RAG) for extracting knowledge from scientific literature.

Data Curation, Quality Control, and Enrichment

  • Methods for dataset collection and annotation.
  • Methods for metadata and synthetic data generation.
  • Methods for data consistency and completeness.
  • Automated data refinement to improve reliability.
  • Automated outlier detection and correction.
  • Systems/tools for continuous dataset integrity.

Benchmarking and Evaluation Frameworks

  • Develop standardized benchmarks applicable across domains, such as biomedicine, materials science, environmental modeling, etc.
  • Methods for developing standardized metrics e.g. accuracy, robustness, scalability, and interpretability, tailored to domain-specific data characteristics.
  • Methods to evaluate data quality and AI-readiness.
  • Methods to evaluate the data interpretability, robustness, and trustworthiness.
  • Open platforms for standardized benchmarking.

Applications in Scientific Research

  • Tools for publication summary and trends analysis.
  • AI for scientific challenges like drug discovery, climate modeling, and material prediction, etc.

Submission Details

We invite the submission of regular research papers (6-10 pages), including the bibliography and any possible appendices. Submissions must be in PDF format, and formatted according to IEEE Conference Template. Submitted papers will be assessed based on their novelty, technical quality, potential impact, insightfulness, depth, clarity, and reproducibility. All the papers are required to be submitted via the ADSD Submission. By the unique ICDM tradition, all accepted workshop papers will be published in the dedicated ICDMW proceedings published by the IEEE Computer Society Press. For more questions about the workshop and submissions, please send email to pfwang@cnic.cn.

Important Dates

* All deadlines are at 11:59 pm in the Anywhere on Earth timezone

Keynote Presentations

Yiqun Xie

Keynote 1

Yiqun Xie, Assistant Professor

Center for Geospatial Information Science, Dept. of Geographical Sciences, University of Maryland

AI Interdisciplinary Institute at Maryland (AIM), University of Maryland

Large-scale Benchmark Datasets for Geo-scientific Discovery: AI Challenges & Opportunities

Advances in deep learning and foundation models have continued to set new expectations for general tasks and bring new potential to harness geospatial big data for Earth monitoring and geoscience discovery, benefiting broad sectors including agriculture, energy, water, disaster response, and urban planning. However, direct applications of deep learning often fall short due to challenges posed by geospatial data, including spatial variability that can significantly weaken the replicability of AI models over space, limited and expensive-to-collect training samples, and the misalignment between pretraining and downstream tasks. To accelerate new AI model development for addressing these challenges, large-scale AI-ready benchmark datasets are essential to enable model training, testing, and comparison under different scenarios of practical importance to domain experts. This talk will showcase several datasets and benchmarks we recently created and open-sourced to support AI model development, which: (1) represent large geographic scales from national, continental, to global scales; (2) cover different data modalities such as remote sensing imagery (e.g., satellites, UAV), in-situ field measurements, and physical simulations; (3) support a variety of machine learning tasks including forecasting, segmentation, and emulation; and (4) directly link to important societal applications. Moreover, I will also discuss AI challenges embedded in these benchmark datasets for large scale applications and potential opportunities in AI model developments to tackle them.

Biography

Yiqun Xie is an Assistant Professor in the Center for Geospatial Information Science, Dept. of Geographical Sciences, and an Affiliate Faculty at the AI Interdisciplinary Institute at Maryland (AIM), at the University of Maryland. He received his PhD in Computer Science at the University of Minnesota, and his research addresses challenges facing machine learning for spatio-temporal data and related scientific problems. His current work focuses on: (1) geo-aware learning for large-scale problems with cross-region variability, (2) knowledge-guided learning for data-sparse problems, and (3) foundation models for geoscientific discovery. His research is supported by NSF, NASA, and Google, and has received recognitions including the Best Paper Award from IEEE ICDM 2021, the Best Application Paper Award from SIAM Data Mining 2023, the Best Vision Paper Award (Blue Sky Ideas Award) from ACM SIGSPATIAL 2019, and highlights from the Great Innovative Ideas by CCC at CRA. He also delivered invited panel talks on GeoAI at Committee Meetings of the National Academies on Science, Engineering, and Medicine (NASEM) and the National Geospatial Advisory Committee (NGAC).


Keynote Speaker 2

Keynote 2

Chandan Reddy, Professor

Department of Computer Science

Virginia Tech.

Agentic AI for Scientific Discovery: Benchmarks, Frameworks, and Applications

The emergence of agentic AI systems marks a pivotal shift in how scientific discovery can be automated, interpreted, and scaled. Traditional approaches to computational discovery have largely relied on data-driven models that excel at prediction but struggle to reason over scientific priors or to integrate structured feedback during the discovery process. In this talk, I will present a suite of agentic AI frameworks that leverage large language models (LLMs) to generate, refine, and evaluate scientific hypotheses, ranging from symbolic equations to geometric and physical laws, across multiple scientific domains. These frameworks collectively form a structured discovery pipeline built around two key phases: (1) hypothesis generation, where LLM agents autonomously propose structured and interpretable hypotheses by retrieving inspirations and composing new associations, and (2) feedback and refinement, where these hypotheses are iteratively improved using signals from data, symbolic decomposition, benchmarks, and reasoning consistency. Together, these approaches demonstrate how agentic systems can move beyond static text generation toward dynamic reasoning and iterative hypothesis formation. Beyond methodology, I will highlight new benchmarks that rigorously test reasoning beyond memorization, frameworks such as LLM-SR that integrate scientific priors with evolutionary search for adaptive refinement, and applications that showcase interpretable discovery in physics, biology, and materials science. Collectively, these contributions demonstrate the evolution of large language models from predictive tools to scientific agents capable of symbolic and compositional reasoning, autonomous exploration, and interpretable hypothesis generation, laying the foundation for reliable, reproducible, and human-aligned scientific discovery.

Biography

Chandan Reddy is a Professor in the Department of Computer Science at Virginia Tech. He received his Ph.D. from Cornell University and his M.S. from Michigan State University. His primary research interests are Generative AI and Trustworthy Machine Learning, aimed at creating robust, fair, and explainable models for Scientific Discovery and Real-World Impact. Dr. Reddy's research has been funded by organizations such as the NSF, NIH, DOE, DOT, and various industries. He has authored over 200 peer-reviewed articles in leading conferences and journals. He received several awards for his research work including the Best Application Paper Award at the ACM SIGKDD conference in 2010, the Best Poster Award at the IEEE VAST conference in 2014, and the Best Student Paper Award at the IEEE ICDM conference in 2016. He was also a finalist in the INFORMS Franz Edelman Award Competition in 2011. Dr. Reddy serves (or has served) on the editorial boards of journals such as ACM TKDD, ACM TIST, NPJ AI, and IEEE Big Data. He is a Senior Member of the IEEE and a Distinguished Member of the ACM. More information about his work is available at https://creddy.net.


Keynote Speaker 3

Keynote 3

Anuj Karpatne, Associate Professor

Department of Computer Science

Virginia Tech.

Knowledge-guided Machine Learning: Advances in an Emerging Paradigm for Scientific Discovery By Harnessing Scientific Knowledge and AI-ready Data

This talk will introduce knowledge-guided machine learning (KGML), a rapidly growing field in AI for Science where scientific knowledge is deeply integrated in machine learning frameworks to produce scientifically grounded, explainable, and generalizable predictions even on out-of-distribution data. This talk will present a multi-dimensional view to organize prior research in this area and illustrate KGML concepts using a variety of case studies in ecology, biology, and public health including modeling the quality of water in lakes across the US and discovering novel biological traits of organisms linked with evolution from biodiversity images. The talk will conclude with a discussion of how KGML is leading a new paradigm in AI for Science while also advancing the Science of AI driven by the needs of problems in science and engineering.

Biography

Anuj Karpatne is an Associate Professor of Computer Science at Virginia Tech, where he leads the Knowledge-Guided Machine Learning (KGML) Lab and serves as a Faculty and Dean’s Fellow in the College of Engineering. His research focuses on the paradigm of knowledge-guided machine learning—integrating scientific principles, physical laws, and domain knowledge into data-driven AI systems to build models that are more interpretable, generalizable, and scientifically grounded. Working at the intersection of machine learning, physics, and environmental sciences, Karpatne develops methods for learning differential equations, constructing digital twins, and coupling physics-based and generative AI models for discovery in fields such as climate science, ecology, geophysics, and biology. He has played key leadership roles in several large interdisciplinary initiatives, including the NSF HDR Imageomics Institute, the NSF PIPP COMPASS Center, and USDA SAS projects, collectively advancing AI-for-Science research. Recognized for his pioneering contributions, he received the Virginia Tech College of Engineering Faculty Fellow Award for Excellence in Research in 2025, and continues to advocate for building trustworthy, knowledge-guided AI systems that serve as collaborative partners in scientific discovery.

Organizing Committee

Steering Co-Chairs

Avatar

Hui Xiong

The Hong Kong University of Science and Technology
(Guangzhou)

Avatar

Xiansheng Hua

Tongji University

Program Co-Chairs

Avatar

Pengfei Wang

Chinese Academy of Sciences

Avatar

Yanjie Fu

Arizona State University

Avatar

Pengyang Wang

University of Macau

Avatar

Kunpeng Liu

Clemson University

Avatar

Jiaxu Cui

Jilin University

Poster Co-Chairs

Avatar

Ran Zhang

University of Chinese Academy of Sciences

Avatar

Zhiyuan Ning

University of Chinese Academy of Sciences

Web Co-Chairs

Avatar

Pengjiang Li

University of Chinese Academy of Sciences

Avatar

Ping Xu

University of Chinese Academy of Sciences

Avatar

Zaitian Wang

University of Chinese Academy of Sciences

Program

10:30–10:35
Welcoming and Introduction
10:35–11:20
Keynote 1: Prof. Yiqun Xie, Large-scale Benchmark Datasets for Geo-scientific Discovery: AI Challenges & Opportunities
11:20–11:30
Synthetic Gait Video Dataset Generation for Privacy‑Preserving Gait Disorder Analysis
11:30–11:40
ChemPaperBench: A Multi‑domain Benchmark for Literature‑grounded Chemical Reasoning of LLM‑based Multi‑Agent Systems
11:40–11:50
Monte Carlo Synthetic Data Generation for Radiograph Denoising
11:50–12:00
scUnity‑AI: An AI‑Ready Single‑Cell RNA‑Seq Resource with Standardized Processing for Multi‑Task Computational Analyses
12:00–12:10
LLM‑Based Synthetic Tabular Data Generation and Curation for Privacy‑Sensitive Scientific Applications
12:10–12:20
DrugBank‑2025: An AI‑Ready Next‑Generation Dataset for Drug-Drug Interaction Prediction
12:20–12:30
Meta‑features informed WGAN for tabular data
Lunch Break
14:00–14:45
Keynote 2: Prof. Chandan Reddy, Agentic AI for Scientific Discovery: Benchmarks, Frameworks, and Applications
14:45–15:30
Keynote 3: Prof. Anuj Karpatne, Knowledge-guided Machine Learning: Advances in an Emerging Paradigm for Scientific Discovery By Harnessing Scientific Knowledge and AI-ready Data
15:30–15:40
Transformer‑Based Topic Mapping and GPT‑Driven Hierarchical Taxonomy in Cardiovascular Research
15:40–15:50
Two‑stage Hierarchical Medical Text Classification using Bio_ClinicalBERT
15:50–16:00
Privatar: A Privacy‑Preserving Framework for LLM Agents
COFFEE BREAK
16:30–16:40
Understanding the Constraints of RAG‑Based Medical LVLMs - A Case Study in Ophthalmic Report Generation
16:40–16:50
Quantifying Influenza Strain Dominance: A Differential Population Growth Rate Analysis Across Regions and Seasons
16:50–17:00
Expert Algorithm for Reducing LDL Cholesterol Level
17:00–17:10
XCheck: A Consistency‑Based Validation Framework for Englacial Layers Annotations
17:10–17:50
Poster Session
17:50–18:00
Concluding Remarks

Program Committee

Contact

For inquiries, please contact us at pfwang@cnic.cn.

Data Sharing Supporters