causalAssembly: Generating Realistic Production Data for Benchmarking Causal Discovery

Konstantin Göbler, Tobias Windisch, Mathias Drton, Tim Pychynski, Steffen Sonntag, Martin Roth

Research output: Contribution to journalConference articlepeer-review

3 Scopus citations

Abstract

Algorithms for causal discovery have recently undergone rapid advances and increasingly draw on flexible nonparametric methods to process complex data. With these advances comes a need for adequate empirical validation of the causal relationships learned by different algorithms. However, for most real and complex data sources true causal relations remain unknown. This issue is further compounded by privacy concerns surrounding the release of suitable high-quality data. To tackle these challenges, we introduce causalAssembly, a semisynthetic data generator designed to facilitate the benchmarking of causal discovery methods. The tool is built using a complex real-world dataset comprised of measurements collected along an assembly line in a manufacturing setting. For these measurements, we establish a partial set of ground truth causal relationships through a detailed study of the physics underlying the processes carried out in the assembly line. The partial ground truth is sufficiently informative to allow for estimation of a full causal graph by mere nonparametric regression. To overcome potential confounding and privacy concerns, we use distributional random forests to estimate and represent conditional distributions implied by the ground truth causal graph. These conditionals are combined into a joint distribution that strictly adheres to a causal model over the observed variables. Sampling from this distribution, causalAssembly generates data that are guaranteed to be Markovian with respect to the ground truth. Using our tool, we showcase how to benchmark several well-known causal discovery algorithms.

Original languageEnglish
Pages (from-to)609-642
Number of pages34
JournalProceedings of Machine Learning Research
Volume236
StatePublished - 2024
Event3rd Conference on Causal Learning and Reasoning, CLeaR 2024 - Los Angeles, United States
Duration: 1 Apr 20243 Apr 2024

Keywords

  • Causal discovery
  • benchmarking
  • distributional random forest
  • production data

Fingerprint

Dive into the research topics of 'causalAssembly: Generating Realistic Production Data for Benchmarking Causal Discovery'. Together they form a unique fingerprint.

Cite this