PackedTransformerEncoderLayer¶
- class torch_uncertainty.layers.PackedTransformerEncoderLayer(d_model, nhead, alpha, num_estimators, gamma=1, dim_feedforward=2048, dropout=0.1, activation=<function relu>, layer_norm_eps=1e-05, bias=True, batch_first=False, norm_first=False, first=False, last=False, device=None, dtype=None)[source]¶
Packed-Ensembles-style TransformerEncoderLayer (made up of self-attention followed by a feedforward network).
- Parameters:
d_model (int) – the number of expected features in the input.
nhead (int) – the number of heads in the multiheadattention models.
alpha (float) – the width multiplier of the layer.
num_estimators (int) – the number of estimators packed in the layer.
gamma (int, optional) – Defaults to
1
.dim_feedforward (int, optional) – the dimension of the feedforward network model. Defaults to
2048
.dropout (float, optional) – the dropout value. Defaults to
0.1
.activation (Callable[[Tensor], Tensor], optional) – the activation function of the intermediate layer, that is a unary callable. Defaults to
F.relu
.layer_norm_eps (float, optional) – the eps value in layer normalization components. Defaults to
1e-5
.bias (bool, optional) – If
False
,Linear
andLayerNorm
layers will not learn an additive bias. Defaults toTrue
.batch_first (bool, optional) – If
True
, then the input and output tensors are provided as \((\text{batch}, \text{seq}, \text{d_model})\). Defaults toFalse
\((\text{seq}, \text{batch}, \text{d_model})\).norm_first (bool, optional) – If
True
, the layer norm is done prior to attention and feedforward operations, respectively. Otherwise, it is done after. Defaults toFalse
.first (bool, optional) – Whether this is the first layer of the network. Defaults to
False
.last (bool, optional) – Whether this is the last layer of the network. Defaults to
False
.device (torch.device, optional) – The device to use for the layer’s parameters. Defaults to
None
.dtype (torch.dtype, optional) – The dtype to use for the layer’s parameters. Defaults to
None
.
- Reference:
Attention Is All You Need: Original Multihead Attention formulation.
Hierarchical Light Tranformer Ensembles for Multimodal Trajectory Forecasting : Packed-Ensembles-style Multihead Attention formulation.
- forward(src, src_mask=None, src_key_padding_mask=None, is_causal=False)[source]¶
Pass the input through the encoder layer.
- Parameters:
src (Tensor) – The sequence to the encoder layer. Shape: \((B, L, E)\) or \((L, B, E)\).
src_mask (Tensor | None, optional) – The mask for the
src
sequence. Defaults toNone
.src_key_padding_mask (Tensor | None, optional) – The mask for the
src
keys per batch. Defaults toNone
.is_causal (bool, optional) – If specified, applies a causal mask as
src_mask
. Defaults toFalse
. Warning:is_causal
provides a hint thesrc_mask
is a causal mask. Providing incorrect hints can result in incorrect execution, including forward and backward compatibility.
- Returns:
The output of the encoder layer. Shape: \((B, L, E)\) or \((L, B, E)\).
- Return type:
Tensor