Evaluating Models#

Evaluation is a first-class citizen of TorchUncertainty. Each routine wraps your model with a rich set of metrics that go far beyond plain accuracy or MSE: proper scores, calibration, selective prediction, ensemble diversity, OOD detection and shift robustness are all computed automatically at validation and test time, and logged to your Lightning logger (and optionally to a CSV file via the save_to_csv argument).

This page summarises which metrics are computed by default for each supported task, and which additional metrics are enabled when you set eval_ood=True or eval_shift=True on the corresponding routine. All metrics live in torch_uncertainty.metrics and most of them are documented in the API reference. For a list of the papers they come from, see the references page.

Note

The metrics computed at validation time are a subset of those computed at test time — they are meant to monitor training and are kept lightweight. The breakdown below focuses on test-time evaluation, which is where uncertainty matters most.

Classification#

The ClassificationRoutine is the most feature-complete routine. It is suited for both binary and multi-class classification and supports single models, ensembles, post-hoc calibration and conformal prediction.

Default test metrics#

Whatever the model, the following metrics are always computed on the in-distribution test set:

Performance & proper scores

cls/Acc — top-1 Accuracy.
cls/Brier — BrierScore, a strictly proper score measuring the squared distance between the predicted probabilities and the one-hot target.
cls/NLL — CategoricalNLL, the negative log-likelihood of the categorical predictive distribution.
cls/Entropy — Entropy of the predictive distribution, a popular total uncertainty estimator.

Calibration

cal/ECE — CalibrationError (Expected Calibration Error) with equal-width bins.
cal/aECE — adaptive (equal-mass) variant of the ECE (AdaptiveCalibrationError).
cal/MCE — Maximum Calibration Error (same metric, norm="max").
cal/SmECE — SmoothCalibrationError, a kernel-smoothed, bin-free estimator of the calibration error.

The number of bins is controlled by num_bins_calibration_error (default 15). It is an argument of the routine itself.

Selective Classification

These metrics evaluate the ability of an uncertainty score to reject uncertain inputs while keeping accurate predictions.

sc/AURC — AURC, Area Under the Risk-Coverage curve.
sc/AUGRC — AUGRC, Area Under the Generalized Risk-Coverage curve.
sc/Cov_5Risk — CovAt5Risk, maximum coverage at which the selective risk stays below 5%.
sc/Risk_80Cov — RiskAt80Cov, selective risk at 80% coverage.

Complexity

test/cplx/flops and test/cplx/params are also logged for reference.

Binary classification only

When num_classes == 1, three additional metrics are logged: cls/AUROC, cls/AUPR and cls/FPR95 (FPR95).

Ensemble-specific metrics

When is_ensemble=True the routine additionally computes per-sample epistemic uncertainty estimators that exploit the diversity between estimators:

test/ens_Disagreement — Disagreement,
test/ens_MI — MutualInformation between the prediction and the model parameters,
test/ens_Entropy — entropy of the per-estimator predictions. When is_ensemble=True, test/Entropy corresponds to the final (average) distribution’s entropy over all test samples. It follows that test/Entropy = test/ens_MI + test/ens_Entropy

Post-processing & conformal prediction

When a post_processing method (e.g., temperature scaling, Laplace) is passed, the exact same set of test metrics is recomputed on the post-processed probabilities under the test/post/ prefix. For Conformal, the routine instead logs test/post/CoverageRate (CoverageRate) and test/post/SetSize (SetSize). See the post-hoc calibration and conformal prediction tutorials for end-to-end examples.

Grouping Loss (optional)

Setting eval_grouping_loss=True enables GroupingLoss, a finer-grained calibration metric that quantifies how much information is lost by reducing predictions to their confidence: see Perez-Lebel et al., ICLR 2023.

OOD detection metrics#

When eval_ood=True and the datamodule exposes an OOD dataloader, a binary ID-vs-OOD detection task is constructed from an OOD score derived from the model (controlled by the ood_criterion argument, see torch_uncertainty.ood_criteria — MSP, max-logit, energy, mutual information, post-processing-based, etc.). The following metrics are logged under the ood/ prefix:

ood/AUROC — Area Under the ROC Curve of the binary detector.
ood/AUPR — Area Under the Precision-Recall Curve.
ood/FPR95 — FPR95, false-positive rate at 95% true-positive rate, the standard OOD-detection threshold metric.
ood/SCOD_AURC — SCODAURC, SCOD Area Under the Risk-Coverage curve.
ood/SCOD_AUGRC — SCODAUGRC, SCOD Area Under the Generalized Risk-Coverage curve.
ood/SCOD_Cov_5Risk — SCODCovAt5Risk.
ood/SCOD_Risk_80Cov — SCODRiskAt80Cov.
ood/Entropy — average entropy of the predictive distribution over OOD samples.
For ensembles, the diversity metrics above are also recomputed under the ood/ens_ prefix.

Have a look at the OOD detection tutorial for a full example.

Distribution-shift metrics#

Setting eval_shift=True evaluates the model on a shifted version of the test set (e.g., CIFAR-10-C, ImageNet-C). The full classification metric collection is recomputed under the shift/ prefix (shift/cls/Acc, shift/cal/ECE, …), along with:

shift/Entropy — average predictive entropy under shift.
shift/severity — the corruption severity reported by the datamodule.
For ensembles, diversity metrics under the shift/ens_ prefix.

See the distribution-shift tutorial for context.

Segmentation#

The SegmentationRoutine reuses much of the classification machinery, applied per pixel. Because dense per-pixel storage would be prohibitive for the calibration and selective-classification metrics, those are computed on a random subsample of pixels controlled by metric_subsampling_rate (default 1e-2).

Default test metrics#

Segmentation performance (computed on every pixel)

seg/mIoU — MeanIntersectionOverUnion.
seg/mAcc — macro-averaged pixel accuracy.
seg/pixAcc — overall pixel accuracy.

Proper scores, calibration, selective classification (on subsampled pixels)

The following metrics are the per-pixel analogues of their classification counterparts and are evaluated on a uniformly subsampled subset of pixels:

seg/Brier (BrierScore), seg/NLL (CategoricalNLL)
cal/ECE, cal/aECE, cal/MCE (CalibrationError), cal/SmECE (SmoothCalibrationError)
sc/AURC (AURC), sc/AUGRC (AUGRC), sc/Cov_5Risk (CovAt5Risk), sc/Risk_80Cov (RiskAt80Cov).

OOD detection metrics#

When eval_ood=True, the routine treats pixels whose target label is greater than or equal to num_classes as OOD (this matches the convention used by the MUAD datamodule). Three dense, segmentation-specific binary metrics are then computed:

ood/AUROC — SegmentationBinaryAUROC.
ood/AUPR — SegmentationBinaryAveragePrecision.
ood/FPR95 — SegmentationFPR95.

The OOD score is again controlled by ood_criterion. See the MUAD segmentation tutorial for an end-to-end example.

Distribution shift#

Distribution-shift evaluation is not implemented yet for segmentation — passing eval_shift=True will raise a NotImplementedError. Contributions are welcome.

Regression#

The RegressionRoutine supports both point-wise regression and probabilistic regression. In the latter case, the model outputs the parameters of a PyTorch Distribution (e.g., a Normal, a Laplace, a NIG, …), enabling calibration evaluation on top of the usual error metrics.

Default test metrics#

Point-wise metrics (always computed)

reg/MAE — MeanAbsoluteError.
reg/MSE — MeanSquaredError.
reg/RMSE — root mean squared error.

When the model is probabilistic, these are computed from the dist_estimate of the predictive distribution ("mean" by default, "median" and "mode" also supported).

Probabilistic metrics (when dist_family is set)

reg/NLL — DistributionNLL, the negative log-likelihood of the predictive distribution.
cal/QCE — QuantileCalibrationError, the regression analogue of the ECE based on the empirical CDF of normalized residuals.

See the probabilistic regression tutorial and the Deep Evidential Regression tutorial for worked examples.

OOD detection & distribution shift#

OOD detection and shift evaluation are not implemented yet for regression — both will raise a NotImplementedError if requested. Please open an issue if you need them.

Pixelwise Regression#

The PixelRegressionRoutine is designed for dense regression tasks (monocular depth estimation in particular) and ships with the metrics commonly reported in the depth-estimation literature.

Default test metrics#

Point-wise metrics

reg/SILog — SILog, scale-invariant log error (Eigen et al., NeurIPS 2014).
reg/log10 — Log10 error.
reg/ARE — mean absolute relative error (MeanGTRelativeAbsoluteError).
reg/RSRE — root squared relative error (MeanGTRelativeSquaredError).
reg/RMSE and reg/RMSELog — root mean squared error in linear and log space (MeanSquaredLogError).
reg/iMAE and reg/iRMSE — inverse-depth errors (MeanAbsoluteErrorInverse, MeanSquaredErrorInverse).
reg/d1, reg/d2, reg/d3 — ThresholdAccuracy, the fraction of pixels whose prediction is within \(1.25^k\) of the ground truth (the standard \(\delta_k\) accuracies).

Probabilistic metrics (when dist_family is set)

reg/NLL — DistributionNLL of the per-pixel predictive distribution. For ensembles, predictions from the different estimators are combined into a MixtureSameFamily distribution before the NLL is computed.

OOD detection & distribution shift#

Like regression, OOD detection and shift evaluation are not implemented yet for pixel regression and will raise NotImplementedError.

Other metrics#

A few additional metrics are implemented but not plugged into a routine by default; you can use them directly on stored predictions, or wire them into a custom routine:

AUSE — Area Under the Sparsification Error curve, a task-agnostic metric measuring how well an uncertainty score ranks samples by their true error.
AdaptiveCalibrationError, ClasswiseCalibrationError, VariationRatio, CovAtxRisk, RiskAtxCov, FPRx — variants of the default metrics with user-controlled thresholds.

See the API reference for the full list and the references page for citations.