Note

Go to the end to download the full example code.

Conformal Prediction on CIFAR-10 with TorchUncertainty#

We evaluate the model’s performance both before and after applying different conformal predictors (THR, APS, RAPS), and visualize how conformal prediction estimates the prediction sets.

We use the pretrained ResNet models we provide on Hugging Face.

import matplotlib.pyplot as plt
import numpy as np
import torch
from huggingface_hub import hf_hub_download

from torch_uncertainty import TUTrainer
from torch_uncertainty.datamodules import CIFAR10DataModule
from torch_uncertainty.models.classification.resnet import resnet
from torch_uncertainty.post_processing import ConformalClsAPS, ConformalClsRAPS, ConformalClsTHR
from torch_uncertainty.routines import ClassificationRoutine

1. Load pretrained model from Hugging Face repository#

We use a ResNet18 model trained on CIFAR-10, provided by the TorchUncertainty team

ckpt_path = hf_hub_download(repo_id="torch-uncertainty/resnet18_c10", filename="resnet18_c10.ckpt")
model = resnet(in_channels=3, num_classes=10, arch=18, conv_bias=False, style="cifar")
ckpt = torch.load(ckpt_path, weights_only=True)
model.load_state_dict(ckpt)
model = model.cuda().eval()

2. Load CIFAR-10 Dataset & Define Dataloaders#

We set eval_ood to True to evaluate the performance of Conformal scores for detecting out-of-distribution samples. In this case, since we use a model trained on the full training set, we use the test set to as calibration set for the Conformal methods and for its evaluation. This is not a proper way to evaluate the coverage.

BATCH_SIZE = 128

datamodule = CIFAR10DataModule(
    root="./data",
    batch_size=BATCH_SIZE,
    num_workers=8,
    eval_ood=True,
    postprocess_set="test",
)
datamodule.prepare_data()
datamodule.setup()

  0%|          | 0.00/170M [00:00<?, ?B/s]
  0%|          | 65.5k/170M [00:00<06:56, 409kB/s]
  0%|          | 229k/170M [00:00<03:43, 763kB/s]
  0%|          | 623k/170M [00:00<01:38, 1.73MB/s]
  1%|          | 983k/170M [00:00<01:17, 2.18MB/s]
  1%|          | 2.10M/170M [00:00<00:41, 4.11MB/s]
  3%|▎         | 4.33M/170M [00:00<00:19, 8.42MB/s]
  4%|▎         | 6.26M/170M [00:00<00:15, 10.7MB/s]
  6%|▋         | 10.8M/170M [00:01<00:08, 19.9MB/s]
  9%|▊         | 14.9M/170M [00:01<00:06, 25.4MB/s]
 11%|█         | 18.6M/170M [00:01<00:05, 28.7MB/s]
 13%|█▎        | 21.9M/170M [00:01<00:05, 28.1MB/s]
 16%|█▌        | 26.6M/170M [00:01<00:04, 31.8MB/s]
 18%|█▊        | 31.2M/170M [00:01<00:03, 35.5MB/s]
 20%|██        | 34.9M/170M [00:01<00:04, 33.3MB/s]
 23%|██▎       | 39.2M/170M [00:01<00:03, 34.6MB/s]
 26%|██▌       | 43.8M/170M [00:01<00:03, 37.7MB/s]
 28%|██▊       | 47.6M/170M [00:02<00:03, 35.3MB/s]
 30%|███       | 51.5M/170M [00:02<00:03, 34.8MB/s]
 33%|███▎      | 55.8M/170M [00:02<00:03, 37.0MB/s]
 35%|███▍      | 59.6M/170M [00:02<00:03, 35.4MB/s]
 37%|███▋      | 63.7M/170M [00:02<00:02, 36.9MB/s]
 40%|███▉      | 67.5M/170M [00:02<00:02, 35.9MB/s]
 42%|████▏     | 71.7M/170M [00:02<00:02, 37.7MB/s]
 44%|████▍     | 75.6M/170M [00:02<00:02, 36.1MB/s]
 47%|████▋     | 79.4M/170M [00:02<00:02, 36.6MB/s]
 49%|████▊     | 83.1M/170M [00:03<00:02, 36.3MB/s]
 51%|█████     | 87.3M/170M [00:03<00:02, 35.2MB/s]
 54%|█████▍    | 91.9M/170M [00:03<00:02, 36.0MB/s]
 57%|█████▋    | 96.6M/170M [00:03<00:01, 38.7MB/s]
 59%|█████▉    | 100M/170M [00:03<00:01, 36.0MB/s]
 61%|██████    | 104M/170M [00:03<00:01, 35.9MB/s]
 64%|██████▍   | 109M/170M [00:03<00:01, 38.2MB/s]
 66%|██████▌   | 113M/170M [00:03<00:01, 36.1MB/s]
 68%|██████▊   | 117M/170M [00:03<00:01, 36.0MB/s]
 71%|███████   | 121M/170M [00:04<00:01, 38.0MB/s]
 73%|███████▎  | 125M/170M [00:04<00:01, 36.0MB/s]
 76%|███████▌  | 129M/170M [00:04<00:01, 37.2MB/s]
 78%|███████▊  | 133M/170M [00:04<00:01, 36.4MB/s]
 80%|████████  | 137M/170M [00:04<00:00, 35.4MB/s]
 83%|████████▎ | 141M/170M [00:04<00:00, 37.0MB/s]
 85%|████████▌ | 145M/170M [00:04<00:00, 36.7MB/s]
 88%|████████▊ | 149M/170M [00:04<00:00, 38.6MB/s]
 90%|████████▉ | 153M/170M [00:04<00:00, 36.3MB/s]
 92%|█████████▏| 157M/170M [00:05<00:00, 36.4MB/s]
 94%|█████████▍| 161M/170M [00:05<00:00, 36.4MB/s]
 97%|█████████▋| 165M/170M [00:05<00:00, 37.5MB/s]
 99%|█████████▉| 169M/170M [00:05<00:00, 35.6MB/s]
100%|██████████| 170M/170M [00:05<00:00, 31.4MB/s]

  0%|          | 0.00/64.3M [00:00<?, ?B/s]
  0%|          | 32.8k/64.3M [00:00<04:27, 240kB/s]
  0%|          | 65.5k/64.3M [00:00<04:28, 239kB/s]
  0%|          | 98.3k/64.3M [00:00<04:28, 239kB/s]
  0%|          | 131k/64.3M [00:00<04:27, 239kB/s]
  0%|          | 197k/64.3M [00:00<03:16, 325kB/s]
  0%|          | 295k/64.3M [00:00<02:19, 459kB/s]
  1%|          | 426k/64.3M [00:00<01:42, 621kB/s]
  1%|          | 557k/64.3M [00:01<01:27, 728kB/s]
  1%|          | 721k/64.3M [00:01<01:12, 874kB/s]
  2%|▏         | 1.02M/64.3M [00:01<00:49, 1.27MB/s]
  2%|▏         | 1.31M/64.3M [00:01<00:40, 1.54MB/s]
  3%|▎         | 1.74M/64.3M [00:01<00:31, 2.02MB/s]
  4%|▎         | 2.26M/64.3M [00:01<00:24, 2.56MB/s]
  5%|▍         | 2.92M/64.3M [00:01<00:19, 3.23MB/s]
  6%|▌         | 3.80M/64.3M [00:02<00:14, 4.19MB/s]
  8%|▊         | 4.85M/64.3M [00:02<00:11, 5.25MB/s]
 10%|▉         | 6.16M/64.3M [00:02<00:08, 6.54MB/s]
 12%|█▏        | 7.73M/64.3M [00:02<00:07, 8.02MB/s]
 15%|█▍        | 9.60M/64.3M [00:02<00:05, 9.68MB/s]
 18%|█▊        | 11.9M/64.3M [00:02<00:04, 11.8MB/s]
 22%|██▏       | 14.2M/64.3M [00:02<00:03, 13.4MB/s]
 26%|██▌       | 16.7M/64.3M [00:03<00:03, 14.7MB/s]
 30%|███       | 19.3M/64.3M [00:03<00:02, 16.0MB/s]
 34%|███▍      | 22.1M/64.3M [00:03<00:02, 17.3MB/s]
 38%|███▊      | 24.6M/64.3M [00:03<00:02, 17.7MB/s]
 43%|████▎     | 27.5M/64.3M [00:03<00:01, 18.6MB/s]
 47%|████▋     | 30.4M/64.3M [00:03<00:01, 19.4MB/s]
 52%|█████▏    | 33.3M/64.3M [00:03<00:01, 20.0MB/s]
 57%|█████▋    | 36.4M/64.3M [00:03<00:01, 20.7MB/s]
 62%|██████▏   | 39.6M/64.3M [00:04<00:01, 21.5MB/s]
 67%|██████▋   | 42.8M/64.3M [00:04<00:00, 22.0MB/s]
 72%|███████▏  | 46.1M/64.3M [00:04<00:00, 22.6MB/s]
 77%|███████▋  | 49.4M/64.3M [00:04<00:00, 23.1MB/s]
 82%|████████▏ | 52.7M/64.3M [00:04<00:00, 23.5MB/s]
 87%|████████▋ | 56.1M/64.3M [00:04<00:00, 23.9MB/s]
 93%|█████████▎| 59.6M/64.3M [00:04<00:00, 24.2MB/s]
 97%|█████████▋| 62.3M/64.3M [00:05<00:00, 22.4MB/s]
100%|██████████| 64.3M/64.3M [00:05<00:00, 12.6MB/s]

3. Define the Lightning Trainer#

trainer = TUTrainer(accelerator="gpu", devices=1, max_epochs=5, enable_progress_bar=False)

4. Function to Visualize the Prediction Sets#

def visualize_prediction_sets(inputs, labels, confidence_scores, classes, num_examples=5) -> None:
    _, axs = plt.subplots(2, num_examples, figsize=(15, 5))
    for i in range(num_examples):
        ax = axs[0, i]
        img = np.clip(
            inputs[i].permute(1, 2, 0).cpu().numpy() * datamodule.std + datamodule.mean, 0, 1
        )
        ax.imshow(img)
        ax.set_title(f"True: {classes[labels[i]]}")
        ax.axis("off")
        ax = axs[1, i]
        for j in range(len(classes)):
            ax.barh(classes[j], confidence_scores[i, j], color="blue")
        ax.set_xlim(0, 1)
        ax.set_xlabel("Confidence Score")
    plt.tight_layout()
    plt.show()

5. Estimate Prediction Sets with ConformalClsTHR#

Using alpha=0.01, we aim for a 1% error rate.

print("[Phase 2]: ConformalClsTHR calibration")
conformal_model = ConformalClsTHR(alpha=0.01, device="cuda")

routine_thr = ClassificationRoutine(
    num_classes=10,
    model=model,
    loss=None,  # No loss needed for evaluation
    eval_ood=True,
    post_processing=conformal_model,
    ood_criterion="post_processing",
)
perf_thr = trainer.test(routine_thr, datamodule=datamodule)

[Phase 2]: ConformalClsTHR calibration

  0%|          | 0/79 [00:00<?, ?it/s]
  1%|▏         | 1/79 [00:00<00:19,  4.05it/s]
 10%|█         | 8/79 [00:00<00:02, 27.02it/s]
 18%|█▊        | 14/79 [00:00<00:01, 37.42it/s]
 25%|██▌       | 20/79 [00:00<00:01, 44.20it/s]
 33%|███▎      | 26/79 [00:00<00:01, 48.69it/s]
 41%|████      | 32/79 [00:00<00:00, 51.71it/s]
 48%|████▊     | 38/79 [00:00<00:00, 53.75it/s]
 56%|█████▌    | 44/79 [00:00<00:00, 55.16it/s]
 63%|██████▎   | 50/79 [00:01<00:00, 56.11it/s]
 71%|███████   | 56/79 [00:01<00:00, 56.76it/s]
 78%|███████▊  | 62/79 [00:01<00:00, 57.21it/s]
 86%|████████▌ | 68/79 [00:01<00:00, 57.55it/s]
 94%|█████████▎| 74/79 [00:01<00:00, 57.71it/s]
100%|██████████| 79/79 [00:01<00:00, 49.80it/s]
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Test metric  ┃      Classification       ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│     Acc      │          93.380%          │
│    Brier     │          0.10812          │
│   Entropy    │          0.08849          │
│     NLL      │          0.26405          │
└──────────────┴───────────────────────────┘
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Test metric  ┃        Calibration        ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│     ECE      │          3.537%           │
│     aECE     │          3.499%           │
└──────────────┴───────────────────────────┘
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Test metric  ┃       OOD Detection       ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│     AUPR     │          86.587%          │
│    AUROC     │          79.260%          │
│   Entropy    │          0.08849          │
│    FPR95     │         100.000%          │
└──────────────┴───────────────────────────┘
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Test metric  ┃ Selective Classification  ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│    AUGRC     │          0.779%           │
│     AURC     │          0.959%           │
│  Cov@5Risk   │          96.510%          │
│  Risk@80Cov  │          1.200%           │
└──────────────┴───────────────────────────┘
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Test metric  ┃      Post-Processing      ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ CoverageRate │          0.99000          │
│   SetSize    │          1.52340          │
└──────────────┴───────────────────────────┘

6. Visualization of ConformalClsTHR prediction sets#

inputs, labels = next(iter(datamodule.test_dataloader()[0]))

conformal_model.cuda()
confidence_scores = conformal_model.conformal(inputs.cuda())

classes = datamodule.test.classes

visualize_prediction_sets(inputs, labels, confidence_scores[:5].cpu(), classes)

True: cat, True: ship, True: ship, True: airplane, True: frog

7. Estimate Prediction Sets with ConformalClsAPS#

print("[Phase 3]: ConformalClsAPS calibration")
conformal_model = ConformalClsAPS(alpha=0.01, device="cuda", enable_ts=False)

routine_aps = ClassificationRoutine(
    num_classes=10,
    model=model,
    loss=None,  # No loss needed for evaluation
    eval_ood=True,
    post_processing=conformal_model,
    ood_criterion="post_processing",
)
perf_aps = trainer.test(routine_aps, datamodule=datamodule)
conformal_model.cuda()
confidence_scores = conformal_model.conformal(inputs.cuda())
visualize_prediction_sets(inputs, labels, confidence_scores[:5].cpu(), classes)

[Phase 3]: ConformalClsAPS calibration
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Test metric  ┃      Classification       ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│     Acc      │          93.380%          │
│    Brier     │          0.10812          │
│   Entropy    │          0.08849          │
│     NLL      │          0.26405          │
└──────────────┴───────────────────────────┘
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Test metric  ┃        Calibration        ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│     ECE      │          3.536%           │
│     aECE     │          3.499%           │
└──────────────┴───────────────────────────┘
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Test metric  ┃       OOD Detection       ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│     AUPR     │          84.769%          │
│    AUROC     │          77.035%          │
│   Entropy    │          0.08849          │
│    FPR95     │         100.000%          │
└──────────────┴───────────────────────────┘
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Test metric  ┃ Selective Classification  ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│    AUGRC     │          0.779%           │
│     AURC     │          0.959%           │
│  Cov@5Risk   │          96.510%          │
│  Risk@80Cov  │          1.200%           │
└──────────────┴───────────────────────────┘
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Test metric  ┃      Post-Processing      ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ CoverageRate │          0.99070          │
│   SetSize    │          1.81040          │
└──────────────┴───────────────────────────┘

8. Estimate Prediction Sets with ConformalClsRAPS#

print("[Phase 4]: ConformalClsRAPS calibration")
conformal_model = ConformalClsRAPS(
    alpha=0.01, regularization_rank=3, penalty=0.002, model=model, device="cuda", enable_ts=False
)

routine_raps = ClassificationRoutine(
    num_classes=10,
    model=model,
    loss=None,  # No loss needed for evaluation
    eval_ood=True,
    post_processing=conformal_model,
    ood_criterion="post_processing",
)
perf_raps = trainer.test(routine_raps, datamodule=datamodule)
conformal_model.cuda()
confidence_scores = conformal_model.conformal(inputs.cuda())
visualize_prediction_sets(inputs, labels, confidence_scores[:5].cpu(), classes)

[Phase 4]: ConformalClsRAPS calibration
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Test metric  ┃      Classification       ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│     Acc      │          93.380%          │
│    Brier     │          0.10812          │
│   Entropy    │          0.08849          │
│     NLL      │          0.26405          │
└──────────────┴───────────────────────────┘
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Test metric  ┃        Calibration        ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│     ECE      │          3.537%           │
│     aECE     │          3.499%           │
└──────────────┴───────────────────────────┘
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Test metric  ┃       OOD Detection       ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│     AUPR     │          85.633%          │
│    AUROC     │          77.626%          │
│   Entropy    │          0.08849          │
│    FPR95     │         100.000%          │
└──────────────┴───────────────────────────┘
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Test metric  ┃ Selective Classification  ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│    AUGRC     │          0.779%           │
│     AURC     │          0.959%           │
│  Cov@5Risk   │          96.510%          │
│  Risk@80Cov  │          1.200%           │
└──────────────┴───────────────────────────┘
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Test metric  ┃      Post-Processing      ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ CoverageRate │          0.99020          │
│   SetSize    │          1.66090          │
└──────────────┴───────────────────────────┘

Summary#

In this tutorial, we explored how to apply conformal prediction to a pretrained ResNet on CIFAR-10. We evaluated three methods: Thresholding (THR), Adaptive Prediction Sets (APS), and Regularized APS (RAPS). For each, we calibrated on a validation set, evaluated OOD performance, and visualized prediction sets. You can explore further by adjusting alpha, changing the model, or testing on other datasets.

Total running time of the script: (1 minutes 0.003 seconds)

Download Jupyter notebook: tutorial_conformal.ipynb

Download Python source code: tutorial_conformal.py

Download zipped: tutorial_conformal.zip

Gallery generated by Sphinx-Gallery