Evaluating Models#
Evaluation is a first-class citizen of TorchUncertainty. Each routine wraps your model with a
rich set of metrics that go far beyond plain accuracy or MSE: proper scores, calibration,
selective prediction, ensemble diversity, OOD detection and shift robustness are all
computed automatically at validation and test time, and logged to your Lightning logger
(and optionally to a CSV file via the save_to_csv argument).
This page summarises which metrics are computed by default for each supported task, and
which additional metrics are enabled when you set eval_ood=True or eval_shift=True
on the corresponding routine. All metrics live in
torch_uncertainty.metrics and most of them are documented in the
API reference. For a list of the papers they come from, see the
references page.
Note
The metrics computed at validation time are a subset of those computed at test time — they are meant to monitor training and are kept lightweight. The breakdown below focuses on test-time evaluation, which is where uncertainty matters most.
Classification#
The ClassificationRoutine is the most feature-complete
routine. It is suited for both binary and multi-class classification and supports
single models, ensembles, post-hoc calibration and conformal prediction.
Default test metrics#
Whatever the model, the following metrics are always computed on the in-distribution test set:
Performance & proper scores
cls/Acc— top-1Accuracy.cls/Brier—BrierScore, a strictly proper score measuring the squared distance between the predicted probabilities and the one-hot target.cls/NLL—CategoricalNLL, the negative log-likelihood of the categorical predictive distribution.cls/Entropy—Entropyof the predictive distribution, a popular total uncertainty estimator.
Calibration
cal/ECE—CalibrationError(Expected Calibration Error) with equal-width bins.cal/aECE— adaptive (equal-mass) variant of the ECE (AdaptiveCalibrationError).cal/MCE— Maximum Calibration Error (same metric,norm="max").cal/SmECE—SmoothCalibrationError, a kernel-smoothed, bin-free estimator of the calibration error.
The number of bins is controlled by num_bins_calibration_error (default 15). It is
an argument of the routine itself.
Selective Classification
These metrics evaluate the ability of an uncertainty score to reject uncertain inputs while keeping accurate predictions.
sc/AURC—AURC, Area Under the Risk-Coverage curve.sc/AUGRC—AUGRC, Area Under the Generalized Risk-Coverage curve.sc/Cov@5Risk—CovAt5Risk, maximum coverage at which the selective risk stays below 5%.sc/Risk@80Cov—RiskAt80Cov, selective risk at 80% coverage.
Complexity
test/cplx/flopsandtest/cplx/paramsare also logged for reference.
Binary classification only
When num_classes == 1, three additional metrics are logged: cls/AUROC,
cls/AUPR and cls/FPR95 (FPR95).
Ensemble-specific metrics
When is_ensemble=True the routine additionally computes per-sample epistemic
uncertainty estimators that exploit the diversity between estimators:
test/ens_Disagreement—Disagreement,test/ens_MI—MutualInformationbetween the prediction and the model parameters,test/ens_Entropy— entropy of the per-estimator predictions. Whenis_ensemble=True,test/Entropycorresponds to the final (average) distribution’s entropy over all test samples. It follows thattest/Entropy=test/ens_MI+test/ens_Entropy
Post-processing & conformal prediction
When a post_processing method (e.g., temperature scaling, Laplace) is passed, the
exact same set of test metrics is recomputed on the post-processed probabilities under
the test/post/ prefix. For Conformal,
the routine instead logs test/post/CoverageRate
(CoverageRate) and
test/post/SetSize (SetSize). See the
post-hoc calibration and
conformal prediction tutorials
for end-to-end examples.
Grouping Loss (optional)
Setting eval_grouping_loss=True enables
GroupingLoss, a finer-grained calibration
metric that quantifies how much information is lost by reducing predictions to their
confidence: see Perez-Lebel et al., ICLR 2023.
OOD detection metrics#
When eval_ood=True and the datamodule exposes an OOD dataloader, a binary
ID-vs-OOD detection task is constructed from an OOD score derived from the model
(controlled by the ood_criterion argument, see
torch_uncertainty.ood_criteria — MSP, max-logit, energy, mutual information,
post-processing-based, etc.). The following metrics are logged under the ood/ prefix:
ood/AUROC— Area Under the ROC Curve of the binary detector.ood/AUPR— Area Under the Precision-Recall Curve.ood/FPR95—FPR95, false-positive rate at 95% true-positive rate, the standard OOD-detection threshold metric.ood/Entropy— average entropy of the predictive distribution over OOD samples.For ensembles, the diversity metrics above are also recomputed under the
ood/ens_prefix.
Have a look at the OOD detection tutorial for a full example.
Distribution-shift metrics#
Setting eval_shift=True evaluates the model on a shifted version of the test set
(e.g., CIFAR-10-C, ImageNet-C). The full classification metric collection is recomputed
under the shift/ prefix (shift/cls/Acc, shift/cal/ECE, …), along with:
shift/Entropy— average predictive entropy under shift.shift/severity— the corruption severity reported by the datamodule.For ensembles, diversity metrics under the
shift/ens_prefix.
See the distribution-shift tutorial for context.
Segmentation#
The SegmentationRoutine reuses much of the
classification machinery, applied per pixel. Because dense per-pixel storage would be
prohibitive for the calibration and selective-classification metrics, those are computed
on a random subsample of pixels controlled by metric_subsampling_rate (default
1e-2).
Default test metrics#
Segmentation performance (computed on every pixel)
seg/mIoU—MeanIntersectionOverUnion.seg/mAcc— macro-averaged pixel accuracy.seg/pixAcc— overall pixel accuracy.
Proper scores, calibration, selective classification (on subsampled pixels)
The following metrics are the per-pixel analogues of their classification counterparts and are evaluated on a uniformly subsampled subset of pixels:
seg/Brier(BrierScore),seg/NLL(CategoricalNLL)cal/ECE,cal/aECE,cal/MCE(CalibrationError),cal/SmECE(SmoothCalibrationError)sc/AURC(AURC),sc/AUGRC(AUGRC),sc/Cov@5Risk(CovAt5Risk),sc/Risk@80Cov(RiskAt80Cov).
OOD detection metrics#
When eval_ood=True, the routine treats pixels whose target label is greater than or
equal to num_classes as OOD (this matches the convention used by the MUAD
datamodule). Three dense, segmentation-specific binary metrics are then computed:
ood/AUROC—SegmentationBinaryAUROC.ood/AUPR—SegmentationBinaryAveragePrecision.ood/FPR95—SegmentationFPR95.
The OOD score is again controlled by ood_criterion. See the
MUAD segmentation tutorial for an
end-to-end example.
Distribution shift#
Distribution-shift evaluation is not implemented yet for segmentation — passing
eval_shift=True will raise a NotImplementedError. Contributions are welcome.
Regression#
The RegressionRoutine supports both point-wise
regression and probabilistic regression. In the latter case, the model outputs the
parameters of a PyTorch Distribution (e.g., a Normal, a
Laplace, a NIG, …), enabling calibration evaluation on top of the usual error metrics.
Default test metrics#
Point-wise metrics (always computed)
reg/MAE—MeanAbsoluteError.reg/MSE—MeanSquaredError.reg/RMSE— root mean squared error.
When the model is probabilistic, these are computed from the
dist_estimate of the predictive
distribution ("mean" by default, "median" and "mode" also supported).
Probabilistic metrics (when dist_family is set)
reg/NLL—DistributionNLL, the negative log-likelihood of the predictive distribution.cal/QCE—QuantileCalibrationError, the regression analogue of the ECE based on the empirical CDF of normalized residuals.
See the probabilistic regression tutorial and the Deep Evidential Regression tutorial for worked examples.
OOD detection & distribution shift#
OOD detection and shift evaluation are not implemented yet for regression — both will
raise a NotImplementedError if requested. Please open an issue if you need them.
Pixelwise Regression#
The PixelRegressionRoutine is designed for dense
regression tasks (monocular depth estimation in particular) and ships with the metrics
commonly reported in the depth-estimation literature.
Default test metrics#
Point-wise metrics
reg/SILog—SILog, scale-invariant log error (Eigen et al., NeurIPS 2014).reg/log10—Log10error.reg/ARE— mean absolute relative error (MeanGTRelativeAbsoluteError).reg/RSRE— root squared relative error (MeanGTRelativeSquaredError).reg/RMSEandreg/RMSELog— root mean squared error in linear and log space (MeanSquaredLogError).reg/iMAEandreg/iRMSE— inverse-depth errors (MeanAbsoluteErrorInverse,MeanSquaredErrorInverse).reg/d1,reg/d2,reg/d3—ThresholdAccuracy, the fraction of pixels whose prediction is within \(1.25^k\) of the ground truth (the standard \(\delta_k\) accuracies).
Probabilistic metrics (when dist_family is set)
reg/NLL—DistributionNLLof the per-pixel predictive distribution. For ensembles, predictions from the different estimators are combined into aMixtureSameFamilydistribution before the NLL is computed.
OOD detection & distribution shift#
Like regression, OOD detection and shift evaluation are not implemented yet for pixel
regression and will raise NotImplementedError.
Other metrics#
A few additional metrics are implemented but not plugged into a routine by default; you can use them directly on stored predictions, or wire them into a custom routine:
AUSE— Area Under the Sparsification Error curve, a task-agnostic metric measuring how well an uncertainty score ranks samples by their true error.AdaptiveCalibrationError,ClasswiseCalibrationError,VariationRatio,CovAtxRisk,RiskAtxCov,FPRx— variants of the default metrics with user-controlled thresholds.
See the API reference for the full list and the references page for citations.