The impact of averaging logits over probabilities on ensembles of neural networks

Cedrique Rovile Njieutcheu Tassi, Jakob Gawlikowski, Auliya Unnisa Fitri, Rudolph Triebel

Research output: Contribution to journalConference articlepeer-review


Model averaging has become a standard for improving neural networks in terms of accuracy, calibration, and the ability to detect false predictions (FPs). However, recent findings show that model averaging does not necessarily lead to calibrated confidences, especially for underconfident networks. While existing methods for improving the calibration of combined networks focus on recalibrating, building, or sampling calibrated models, we focus on the combination process. Specifically, we evaluate the impact of averaging logits instead of probabilities on the quality of confidence (QoC). We compare combined logits instead of probabilities of members (networks) for models such as ensembles, Monte Carlo Dropout (MCD), and Mixture of Monte Carlo Dropout (MMCD). Comparison is done using experimental results on three datasets using three different architectures. We show that averaging logits instead of probabilities increase the confidence thereby improving the confidence calibration for underconfident models. For example, for MCD evaluated on CIFAR10, averaging logits instead of probabilities reduces the expected calibration error (ECE) from 12.03% to 5.44%. However, the increase in confidence can bring harm to confidence calibration for overconfident models and the separability between true predictions (TPs) and FPs. For example, for MMCD evaluated on MNIST, the average confidence on FPs due to the noisy data increases from 51.31% to 94.58% when averaging logits instead of probabilities. While averaging logits can be applied with underconfident models to improve the calibration on test data, we suggest to average probabilities for safety- and mission-critical applications where the separability of TPs and FPs is of paramount importance.

Original languageEnglish
JournalCEUR Workshop Proceedings
StatePublished - 2022
Externally publishedYes
Event2022 Workshop on Artificial Intelligence Safety, AISafety 2022 - Vienna, Austria
Duration: 24 Jul 202225 Jul 2022


  • Combination process
  • Confidence calibration
  • Ensemble
  • Logit averaging
  • Mixture of Monte Carlo Dropout (MMCD)
  • Model averaging
  • Monte Carlo Dropout (MCD)
  • Probability averaging
  • Quality of confidence (QoC)
  • Separating true predictions (TPs)
  • false predictions (FPs)


Dive into the research topics of 'The impact of averaging logits over probabilities on ensembles of neural networks'. Together they form a unique fingerprint.

Cite this