The 1st Workshop on AI-Ready Data for Science Discovery @ ICDM 2025

About the Workshop

Scientific discovery is entering a new era—powered by artificial intelligence (AI) and machine learning (ML). These technologies are enabling breakthroughs across disciplines such as biology, physics, chemistry, and materials science. However, one major bottleneck remains: the lack of high-quality, domain-specific datasets that are truly AI-ready.

The 1st Workshop on AI-ready Data for Science Discovery (ADSD 2025) aims to address this critical challenge by building a vibrant, interdisciplinary community focused on the creation, curation, and benchmarking of scientific datasets. Hosted at ICDM 2025, ADSD will serve as a platform for researchers, practitioners, and data professionals to collaborate on shaping the future of scientific data mining.

Call for Papers

We welcome a wide array of submissions focused on AI-Ready Dataset for science discovery, encompassing topics such as theories, algorithms, applications, systems, and tools. These topics include but are not limited to:

Data Acquisition and Integration

Automated methods for constructing AI-ready data from experiments, simulations, and publications.
Methods for multimodal datasets integration, including text, images, tables, and numerical data.
Retrieval-augmented generation (RAG) for extracting knowledge from scientific literature.

Data Curation, Quality Control, and Enrichment

Methods for dataset collection and annotation.
Methods for metadata and synthetic data generation.
Methods for data consistency and completeness.
Automated data refinement to improve reliability.
Automated outlier detection and correction.
Systems/tools for continuous dataset integrity.

Benchmarking and Evaluation Frameworks

Develop standardized benchmarks applicable across domains, such as biomedicine, materials science, environmental modeling, etc.
Methods for developing standardized metrics e.g. accuracy, robustness, scalability, and interpretability, tailored to domain-specific data characteristics.
Methods to evaluate data quality and AI-readiness.
Methods to evaluate the data interpretability, robustness, and trustworthiness.
Open platforms for standardized benchmarking.

Applications in Scientific Research

Tools for publication summary and trends analysis.
AI for scientific challenges like drug discovery, climate modeling, and material prediction, etc.

Submission Details

We invite the submission of regular research papers (6-10 pages), including the bibliography and any possible appendices. Submissions must be in PDF format, and formatted according to IEEE Conference Template. Submitted papers will be assessed based on their novelty, technical quality, potential impact, insightfulness, depth, clarity, and reproducibility. All the papers are required to be submitted via the ADSD Submission. By the unique ICDM tradition, all accepted workshop papers will be published in the dedicated ICDMW proceedings published by the IEEE Computer Society Press. For more questions about the workshop and submissions, please send email to pfwang@cnic.cn.

Important Dates

* All deadlines are at 11:59 pm in the Anywhere on Earth timezone

Paper Submission Deadline: August 28, 2025
Acceptance Notification: September 10, 2025(Extended to September 14, 2025)
Camera-ready Submission: September 17, 2025
Workshop Date: November 13, 2025

Keynote Presentations

Keynote 1

Yiqun Xie, Assistant Professor

Center for Geospatial Information Science, Dept. of Geographical Sciences, University of Maryland

AI Interdisciplinary Institute at Maryland (AIM), University of Maryland

Large-scale Benchmark Datasets for Geo-scientific Discovery: AI Challenges & Opportunities

Advances in deep learning and foundation models have continued to set new expectations for general tasks and bring new potential to harness geospatial big data for Earth monitoring and geoscience discovery, benefiting broad sectors including agriculture, energy, water, disaster response, and urban planning. However, direct applications of deep learning often fall short due to challenges posed by geospatial data, including spatial variability that can significantly weaken the replicability of AI models over space, limited and expensive-to-collect training samples, and the misalignment between pretraining and downstream tasks. To accelerate new AI model development for addressing these challenges, large-scale AI-ready benchmark datasets are essential to enable model training, testing, and comparison under different scenarios of practical importance to domain experts. This talk will showcase several datasets and benchmarks we recently created and open-sourced to support AI model development, which: (1) represent large geographic scales from national, continental, to global scales; (2) cover different data modalities such as remote sensing imagery (e.g., satellites, UAV), in-situ field measurements, and physical simulations; (3) support a variety of machine learning tasks including forecasting, segmentation, and emulation; and (4) directly link to important societal applications. Moreover, I will also discuss AI challenges embedded in these benchmark datasets for large scale applications and potential opportunities in AI model developments to tackle them.

Biography

Yiqun Xie is an Assistant Professor in the Center for Geospatial Information Science, Dept. of Geographical Sciences, and an Affiliate Faculty at the AI Interdisciplinary Institute at Maryland (AIM), at the University of Maryland. He received his PhD in Computer Science at the University of Minnesota, and his research addresses challenges facing machine learning for spatio-temporal data and related scientific problems. His current work focuses on: (1) geo-aware learning for large-scale problems with cross-region variability, (2) knowledge-guided learning for data-sparse problems, and (3) foundation models for geoscientific discovery. His research is supported by NSF, NASA, and Google, and has received recognitions including the Best Paper Award from IEEE ICDM 2021, the Best Application Paper Award from SIAM Data Mining 2023, the Best Vision Paper Award (Blue Sky Ideas Award) from ACM SIGSPATIAL 2019, and highlights from the Great Innovative Ideas by CCC at CRA. He also delivered invited panel talks on GeoAI at Committee Meetings of the National Academies on Science, Engineering, and Medicine (NASEM) and the National Geospatial Advisory Committee (NGAC).

Keynote 2

Chandan Reddy, Professor

Department of Computer Science

Virginia Tech.

Agentic AI for Scientific Discovery: Benchmarks, Frameworks, and Applications

The emergence of agentic AI systems marks a pivotal shift in how scientific discovery can be automated, interpreted, and scaled. Traditional approaches to computational discovery have largely relied on data-driven models that excel at prediction but struggle to reason over scientific priors or to integrate structured feedback during the discovery process. In this talk, I will present a suite of agentic AI frameworks that leverage large language models (LLMs) to generate, refine, and evaluate scientific hypotheses, ranging from symbolic equations to geometric and physical laws, across multiple scientific domains. These frameworks collectively form a structured discovery pipeline built around two key phases: (1) hypothesis generation, where LLM agents autonomously propose structured and interpretable hypotheses by retrieving inspirations and composing new associations, and (2) feedback and refinement, where these hypotheses are iteratively improved using signals from data, symbolic decomposition, benchmarks, and reasoning consistency. Together, these approaches demonstrate how agentic systems can move beyond static text generation toward dynamic reasoning and iterative hypothesis formation. Beyond methodology, I will highlight new benchmarks that rigorously test reasoning beyond memorization, frameworks such as LLM-SR that integrate scientific priors with evolutionary search for adaptive refinement, and applications that showcase interpretable discovery in physics, biology, and materials science. Collectively, these contributions demonstrate the evolution of large language models from predictive tools to scientific agents capable of symbolic and compositional reasoning, autonomous exploration, and interpretable hypothesis generation, laying the foundation for reliable, reproducible, and human-aligned scientific discovery.

Biography

Chandan Reddy is a Professor in the Department of Computer Science at Virginia Tech. He received his Ph.D. from Cornell University and his M.S. from Michigan State University. His primary research interests are Generative AI and Trustworthy Machine Learning, aimed at creating robust, fair, and explainable models for Scientific Discovery and Real-World Impact. Dr. Reddy's research has been funded by organizations such as the NSF, NIH, DOE, DOT, and various industries. He has authored over 200 peer-reviewed articles in leading conferences and journals. He received several awards for his research work including the Best Application Paper Award at the ACM SIGKDD conference in 2010, the Best Poster Award at the IEEE VAST conference in 2014, and the Best Student Paper Award at the IEEE ICDM conference in 2016. He was also a finalist in the INFORMS Franz Edelman Award Competition in 2011. Dr. Reddy serves (or has served) on the editorial boards of journals such as ACM TKDD, ACM TIST, NPJ AI, and IEEE Big Data. He is a Senior Member of the IEEE and a Distinguished Member of the ACM. More information about his work is available at https://creddy.net.

Keynote 3

Anuj Karpatne, Associate Professor

Department of Computer Science

Virginia Tech.

Knowledge-guided Machine Learning: Advances in an Emerging Paradigm for Scientific Discovery By Harnessing Scientific Knowledge and AI-ready Data

This talk will introduce knowledge-guided machine learning (KGML), a rapidly growing field in AI for Science where scientific knowledge is deeply integrated in machine learning frameworks to produce scientifically grounded, explainable, and generalizable predictions even on out-of-distribution data. This talk will present a multi-dimensional view to organize prior research in this area and illustrate KGML concepts using a variety of case studies in ecology, biology, and public health including modeling the quality of water in lakes across the US and discovering novel biological traits of organisms linked with evolution from biodiversity images. The talk will conclude with a discussion of how KGML is leading a new paradigm in AI for Science while also advancing the Science of AI driven by the needs of problems in science and engineering.

Biography

Anuj Karpatne is an Associate Professor of Computer Science at Virginia Tech, where he leads the Knowledge-Guided Machine Learning (KGML) Lab and serves as a Faculty and Dean’s Fellow in the College of Engineering. His research focuses on the paradigm of knowledge-guided machine learning—integrating scientific principles, physical laws, and domain knowledge into data-driven AI systems to build models that are more interpretable, generalizable, and scientifically grounded. Working at the intersection of machine learning, physics, and environmental sciences, Karpatne develops methods for learning differential equations, constructing digital twins, and coupling physics-based and generative AI models for discovery in fields such as climate science, ecology, geophysics, and biology. He has played key leadership roles in several large interdisciplinary initiatives, including the NSF HDR Imageomics Institute, the NSF PIPP COMPASS Center, and USDA SAS projects, collectively advancing AI-for-Science research. Recognized for his pioneering contributions, he received the Virginia Tech College of Engineering Faculty Fellow Award for Excellence in Research in 2025, and continues to advocate for building trustworthy, knowledge-guided AI systems that serve as collaborative partners in scientific discovery.

Organizing Committee

Steering Co-Chairs

Hui Xiong

The Hong Kong University of Science and Technology
(Guangzhou)

Xiansheng Hua

Tongji University

Program Co-Chairs

Pengfei Wang

Chinese Academy of Sciences

Yanjie Fu

Arizona State University

Pengyang Wang

University of Macau

Kunpeng Liu

Clemson University

Jiaxu Cui

Jilin University