torchcvnn.models¶

Vision Transformers¶

We provide some predefined ViT models. Their configurations are listed below.

ViT tiny model configuration¶
Model name	Layers	Heads	Hidden dimension	MLP dimension	Dropout	Attention dropout	Norm layer
vit_t	12	3	192	768	0.0	0.0	RMSNorm
vit_s	12	6	384	1536	0.0	0.0	RMSNorm
vit_b	12	12	768	3072	0.0	0.0	RMSNorm
vit_l	24	16	1024	4096	0.0	0.0	RMSNorm
vit_h	32	16	1280	5120	0.0	0.0	RMSNorm

torchcvnn.models.vit_t(patch_embedder: Module, device: device | None = None, dtype: dtype = torch.complex64) → Module[source]¶

Builds a ViT tiny model.

Parameters:

patch_embedder – PatchEmbedder instance.
device – Device to use.
dtype – Data type to use.

The patch_embedder is responsible for computing the embedding of the patch as well as adding the positional encoding if required.

It maps from \((B, C, H, W)\) to \((B, hidden\_dim, N_h, N_w)\) where \(N_h imes N_w\) is the number of patches in the image. The embedding dimension must match the expected hidden dimension of the transformer.

torchcvnn.models.vit_s(patch_embedder: Module, device: device | None = None, dtype: dtype = torch.complex64) → Module[source]¶

Builds a ViT small model.

Parameters:

patch_embedder – PatchEmbedder instance.
device – Device to use.
dtype – Data type to use.

The patch_embedder is responsible for computing the embedding of the patch as well as adding the positional encoding if required.

It maps from \((B, C, H, W)\) to \((B, hidden\_dim, N_h, N_w)\) where \(N_h imes N_w\) is the number of patches in the image. The embedding dimension must match the expected hidden dimension of the transformer.

torchcvnn.models.vit_b(patch_embedder: Module, device: device | None = None, dtype: dtype = torch.complex64) → Module[source]¶

Builds a ViT base model.

Parameters:

patch_embedder – PatchEmbedder instance.
device – Device to use.
dtype – Data type to use.

The patch_embedder is responsible for computing the embedding of the patch as well as adding the positional encoding if required.

It maps from \((B, C, H, W)\) to \((B, hidden\_dim, N_h, N_w)\) where \(N_h imes N_w\) is the number of patches in the image. The embedding dimension must match the expected hidden dimension of the transformer.

torchcvnn.models.vit_l(patch_embedder: Module, device: device | None = None, dtype: dtype = torch.complex64) → Module[source]¶

Builds a ViT large model.

Parameters:

patch_embedder – PatchEmbedder instance.
device – Device to use.
dtype – Data type to use.

The patch_embedder is responsible for computing the embedding of the patch as well as adding the positional encoding if required.

It maps from \((B, C, H, W)\) to \((B, hidden\_dim, N_h, N_w)\) where \(N_h imes N_w\) is the number of patches in the image. The embedding dimension must match the expected hidden dimension of the transformer.

torchcvnn.models.vit_h(patch_embedder: Module, device: device | None = None, dtype: dtype = torch.complex64) → Module[source]¶

Builds a ViT huge model.

Parameters:

patch_embedder – PatchEmbedder instance.
device – Device to use.
dtype – Data type to use.

The patch_embedder is responsible for computing the embedding of the patch as well as adding the positional encoding if required.

It maps from \((B, C, H, W)\) to \((B, hidden\_dim, N_h, N_w)\) where \(N_h imes N_w\) is the number of patches in the image. The embedding dimension must match the expected hidden dimension of the transformer.