Exploring One Million Machine Learning Pipelines: A Benchmarking Study

Published: 03 Jun 2025, Last Modified: 03 Jun 2025AutoML 2025 ABCD TrackEveryoneRevisionsBibTeXCC BY 4.0
Confirmation: our paper adheres to reproducibility best practices. In particular, we confirm that all important details required to reproduce results are described in the paper,, the authors agree to the paper being made available online through OpenReview under a CC-BY 4.0 license (https://creativecommons.org/licenses/by/4.0/), and, the authors have read and commit to adhering to the AutoML 2025 Code of Conduct (https://2025.automl.cc/code-of-conduct/).
Reproducibility: zip
TL;DR: This paper explores ML pipelines by benchmarking combinations of feature preprocessing techniques and classification models. It explores good/bad pipeline combinations and provides meta-knowledge datasets to support meta-learning research.
Abstract: Machine learning solutions are largely affected by the values of the hyperparameter of their algorithms. This has motivated a large number of recent research projects on hyperparameter tuning, with the proposal of several, and highly diverse, tuning approaches. Rather than proposing a new approach or identifying the best hyperparameter tuning approach, this paper looks for good machine learning solutions by exploring machine learning pipelines. For such, it benchmarks pipelines focusing on the interaction between feature preprocessing techniques and classification models. The study evaluates the effectiveness of pipeline combinations, identifying high-performing and underperforming combinations. Additionally, it provides meta-knowledge datasets without any optimization selection bias to foster research contributions in meta-learning, accelerating the development of meta-models. The findings provide insights into the most effective preprocessing and modeling combination, guiding practitioners and researchers in their selection processes.
Submission Number: 8
Loading