PackedTransformerDecoderLayer#

class torch_uncertainty.layers.PackedTransformerDecoderLayer(d_model, nhead, alpha, num_estimators, gamma=1, dim_feedforward=2048, dropout=0.1, activation=<function relu>, layer_norm_eps=1e-05, batch_first=False, norm_first=False, first=False, last=False, bias=True, device=None, dtype=None)[source]#

Packed-Ensembles-style TransformerDecoderLayer (made up of self-attention, multi-head attention, and feedforward network).

Parameters:

d_model (int) – the number of expected features in the input.
nhead (int) – the number of heads in the multiheadattention models.
alpha (float) – the width multiplier of the layer.
num_estimators (int) – the number of estimators packed in the layer.
gamma (int, optional) – Defaults to 1.
dim_feedforward (int, optional) – the dimension of the feedforward network model. Defaults to 2048.
dropout (float, optional) – the dropout value. Defaults to 0.1.
activation (Callable[[Tensor], Tensor], optional) – the activation function of the intermediate layer, that is a unary callable. Defaults to F.relu.
layer_norm_eps (float, optional) – the eps value in layer normalization components. Defaults to 1e-5.
bias (bool, optional) – If False, Linear and LayerNorm layers will not learn an additive bias. Defaults to True.
batch_first (bool, optional) – If True, then the input and output tensors are provided as \((\text{batch}, \text{seq}, \text{d_model})\). Defaults to False \((\text{seq}, \text{batch}, \text{d_model})\).
norm_first (bool, optional) – If True, the layer norm is done prior to attention and feedforward operations, respectively. Otherwise, it is done after. Defaults to False.
first (bool, optional) – Whether this is the first layer of the network. Defaults to False.
last (bool, optional) – Whether this is the last layer of the network. Defaults to False.
device (torch.device, optional) – The device to use for the layer’s parameters. Defaults to None.
dtype (torch.dtype, optional) – The dtype to use for the layer’s parameters. Defaults to None.

Reference:

Attention Is All You Need: Original Multihead Attention formulation.
Hierarchical Light Tranformer Ensembles for Multimodal Trajectory Forecasting : Packed-Ensembles-style Multihead Attention formulation.

forward(tgt, memory, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None, tgt_is_causal=False, memory_is_causal=False)[source]#

Pass the input (and mask) through the decoder layer.

Parameters:

tgt (Tensor) – The sequence to the decoder layer. Shape: \((B, L, E)\) or \((L, B, E)\).
memory (Tensor) – The sequence from the last layer of the encoder. Shape: \((B, S, E)\) or \((S, B, E)\).
tgt_mask (Tensor | None, optional) – The mask for the tgt sequence. Defaults to None.
memory_mask (Tensor | None, optional) – The mask for the memory sequence. Defaults to None.
tgt_key_padding_mask (Tensor | None, optional) – The mask for the tgt keys per batch. Defaults to None.
memory_key_padding_mask (Tensor | None, optional) – The mask for the memory keys per batch. Defaults to None.
tgt_is_causal (bool, optional) – If specified, applies a causal mask as tgt_mask. Defaults to False. Warning: tgt_is_causal provides a hint the tgt_mask is a causal mask. Providing incorrect hints can result in incorrect execution, including forward and backward compatibility.
memory_is_causal (bool, optional) – If specified, applies a causal mask as memory_mask. Defaults to False. Warning: memory_is_causal provides a hint the memory_mask is a causal mask. Providing incorrect hints can result in incorrect execution, including forward and backward compatibility.

Returns:

The output of the encoder layer. Shape: \((B, L, E)\) or \((L, B, E)\).

Return type:

Tensor

PackedTransformerDecoderLayer#

This Page