PackedTransformerDecoderLayer#
- class torch_uncertainty.layers.PackedTransformerDecoderLayer(d_model, nhead, alpha, num_estimators, gamma=1, dim_feedforward=2048, dropout=0.1, activation=<function relu>, layer_norm_eps=1e-05, batch_first=False, norm_first=False, first=False, last=False, bias=True, device=None, dtype=None)[source]#
Packed-Ensembles-style TransformerDecoderLayer (made up of self-attention, multi-head attention, and feedforward network).
- Parameters:
d_model (
int) – the number of expected features in the input.nhead (
int) – the number of heads in the multiheadattention models.alpha (
float) – the width multiplier of the layer.num_estimators (
int) – the number of estimators packed in the layer.gamma (
int) – Defaults to1.dim_feedforward (
int) – the dimension of the feedforward network model. Defaults to2048.dropout (
float) – the dropout value. Defaults to0.1.activation (Callable[[Tensor], Tensor]) – the activation function of the intermediate layer, that is a unary callable. Defaults to
F.relu.layer_norm_eps (
float) – the eps value in layer normalization components. Defaults to1e-5.bias (
bool) – IfFalse,LinearandLayerNormlayers will not learn an additive bias. Defaults toTrue.batch_first (
bool) – IfTrue, then the input and output tensors are provided as \((\text{batch}, \text{seq}, \text{d_model})\). Defaults toFalse\((\text{seq}, \text{batch}, \text{d_model})\).norm_first (
bool) – IfTrue, the layer norm is done prior to attention and feedforward operations, respectively. Otherwise, it is done after. Defaults toFalse.first (
bool) – Whether this is the first layer of the network. Defaults toFalse.last (
bool) – Whether this is the last layer of the network. Defaults toFalse.device – The device to use for the layer’s parameters. Defaults to
None.dtype – The dtype to use for the layer’s parameters. Defaults to
None.
- Reference:
Attention Is All You Need: Original Multihead Attention formulation.
Hierarchical Light Tranformer Ensembles for Multimodal Trajectory Forecasting : Packed-Ensembles-style Multihead Attention formulation.
- forward(tgt, memory, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None, tgt_is_causal=False, memory_is_causal=False)[source]#
Pass the input (and mask) through the decoder layer.
- Parameters:
tgt (
Tensor) – The sequence to the decoder layer. Shape: \((B, L, E)\) or \((L, B, E)\).memory (
Tensor) – The sequence from the last layer of the encoder. Shape: \((B, S, E)\) or \((S, B, E)\).tgt_mask (
Tensor|None) – The mask for thetgtsequence. Defaults toNone.memory_mask (
Tensor|None) – The mask for thememorysequence. Defaults toNone.tgt_key_padding_mask (
Tensor|None) – The mask for thetgtkeys per batch. Defaults toNone.memory_key_padding_mask (
Tensor|None) – The mask for thememorykeys per batch. Defaults toNone.tgt_is_causal (
bool) – If specified, applies a causal mask astgt_mask. Defaults toFalse. Warning:tgt_is_causalprovides a hint thetgt_maskis a causal mask. Providing incorrect hints can result in incorrect execution, including forward and backward compatibility.memory_is_causal (
bool) – If specified, applies a causal mask asmemory_mask. Defaults toFalse. Warning:memory_is_causalprovides a hint thememory_maskis a causal mask. Providing incorrect hints can result in incorrect execution, including forward and backward compatibility.
- Returns:
The output of the decoder layer. Shape: \((B, L, E)\) or \((L, B, E)\).
- Return type:
Tensor