torchcvnn.models¶
Vision Transformers¶
We provide some predefined ViT models. Their configurations are listed below.
Model name |
Layers |
Heads |
Hidden dimension |
MLP dimension |
Dropout |
Attention dropout |
Norm layer |
---|---|---|---|---|---|---|---|
vit_t |
12 |
3 |
192 |
768 |
0.0 |
0.0 |
RMSNorm |
vit_s |
12 |
6 |
384 |
1536 |
0.0 |
0.0 |
RMSNorm |
vit_b |
12 |
12 |
768 |
3072 |
0.0 |
0.0 |
RMSNorm |
vit_l |
24 |
16 |
1024 |
4096 |
0.0 |
0.0 |
RMSNorm |
vit_h |
32 |
16 |
1280 |
5120 |
0.0 |
0.0 |
RMSNorm |
- torchcvnn.models.vit_t(patch_embedder: Module, device: device | None = None, dtype: dtype = torch.complex64) Module [source]¶
Builds a ViT tiny model.
- Parameters:
patch_embedder – PatchEmbedder instance.
device – Device to use.
dtype – Data type to use.
The patch_embedder is responsible for computing the embedding of the patch as well as adding the positional encoding if required.
It maps from \((B, C, H, W)\) to \((B, hidden\_dim, N_h, N_w)\) where \(N_h imes N_w\) is the number of patches in the image. The embedding dimension must match the expected hidden dimension of the transformer.
- torchcvnn.models.vit_s(patch_embedder: Module, device: device | None = None, dtype: dtype = torch.complex64) Module [source]¶
Builds a ViT small model.
- Parameters:
patch_embedder – PatchEmbedder instance.
device – Device to use.
dtype – Data type to use.
The patch_embedder is responsible for computing the embedding of the patch as well as adding the positional encoding if required.
It maps from \((B, C, H, W)\) to \((B, hidden\_dim, N_h, N_w)\) where \(N_h imes N_w\) is the number of patches in the image. The embedding dimension must match the expected hidden dimension of the transformer.
- torchcvnn.models.vit_b(patch_embedder: Module, device: device | None = None, dtype: dtype = torch.complex64) Module [source]¶
Builds a ViT base model.
- Parameters:
patch_embedder – PatchEmbedder instance.
device – Device to use.
dtype – Data type to use.
The patch_embedder is responsible for computing the embedding of the patch as well as adding the positional encoding if required.
It maps from \((B, C, H, W)\) to \((B, hidden\_dim, N_h, N_w)\) where \(N_h imes N_w\) is the number of patches in the image. The embedding dimension must match the expected hidden dimension of the transformer.
- torchcvnn.models.vit_l(patch_embedder: Module, device: device | None = None, dtype: dtype = torch.complex64) Module [source]¶
Builds a ViT large model.
- Parameters:
patch_embedder – PatchEmbedder instance.
device – Device to use.
dtype – Data type to use.
The patch_embedder is responsible for computing the embedding of the patch as well as adding the positional encoding if required.
It maps from \((B, C, H, W)\) to \((B, hidden\_dim, N_h, N_w)\) where \(N_h imes N_w\) is the number of patches in the image. The embedding dimension must match the expected hidden dimension of the transformer.
- torchcvnn.models.vit_h(patch_embedder: Module, device: device | None = None, dtype: dtype = torch.complex64) Module [source]¶
Builds a ViT huge model.
- Parameters:
patch_embedder – PatchEmbedder instance.
device – Device to use.
dtype – Data type to use.
The patch_embedder is responsible for computing the embedding of the patch as well as adding the positional encoding if required.
It maps from \((B, C, H, W)\) to \((B, hidden\_dim, N_h, N_w)\) where \(N_h imes N_w\) is the number of patches in the image. The embedding dimension must match the expected hidden dimension of the transformer.