torchcvnn.models

Vision Transformers

We provide some predefined ViT models. Their configurations are listed below.

ViT tiny model configuration

Model name

Layers

Heads

Hidden dimension

MLP dimension

Dropout

Attention dropout

Norm layer

vit_t

12

3

192

768

0.0

0.0

RMSNorm

vit_s

12

6

384

1536

0.0

0.0

RMSNorm

vit_b

12

12

768

3072

0.0

0.0

RMSNorm

vit_l

24

16

1024

4096

0.0

0.0

RMSNorm

vit_h

32

16

1280

5120

0.0

0.0

RMSNorm

torchcvnn.models.vit_t(patch_embedder: Module, device: device | None = None, dtype: dtype = torch.complex64) Module[source]

Builds a ViT tiny model.

Parameters:
  • patch_embedder – PatchEmbedder instance.

  • device – Device to use.

  • dtype – Data type to use.

The patch_embedder is responsible for computing the embedding of the patch as well as adding the positional encoding if required.

It maps from \((B, C, H, W)\) to \((B, hidden\_dim, N_h, N_w)\) where \(N_h imes N_w\) is the number of patches in the image. The embedding dimension must match the expected hidden dimension of the transformer.

torchcvnn.models.vit_s(patch_embedder: Module, device: device | None = None, dtype: dtype = torch.complex64) Module[source]

Builds a ViT small model.

Parameters:
  • patch_embedder – PatchEmbedder instance.

  • device – Device to use.

  • dtype – Data type to use.

The patch_embedder is responsible for computing the embedding of the patch as well as adding the positional encoding if required.

It maps from \((B, C, H, W)\) to \((B, hidden\_dim, N_h, N_w)\) where \(N_h imes N_w\) is the number of patches in the image. The embedding dimension must match the expected hidden dimension of the transformer.

torchcvnn.models.vit_b(patch_embedder: Module, device: device | None = None, dtype: dtype = torch.complex64) Module[source]

Builds a ViT base model.

Parameters:
  • patch_embedder – PatchEmbedder instance.

  • device – Device to use.

  • dtype – Data type to use.

The patch_embedder is responsible for computing the embedding of the patch as well as adding the positional encoding if required.

It maps from \((B, C, H, W)\) to \((B, hidden\_dim, N_h, N_w)\) where \(N_h imes N_w\) is the number of patches in the image. The embedding dimension must match the expected hidden dimension of the transformer.

torchcvnn.models.vit_l(patch_embedder: Module, device: device | None = None, dtype: dtype = torch.complex64) Module[source]

Builds a ViT large model.

Parameters:
  • patch_embedder – PatchEmbedder instance.

  • device – Device to use.

  • dtype – Data type to use.

The patch_embedder is responsible for computing the embedding of the patch as well as adding the positional encoding if required.

It maps from \((B, C, H, W)\) to \((B, hidden\_dim, N_h, N_w)\) where \(N_h imes N_w\) is the number of patches in the image. The embedding dimension must match the expected hidden dimension of the transformer.

torchcvnn.models.vit_h(patch_embedder: Module, device: device | None = None, dtype: dtype = torch.complex64) Module[source]

Builds a ViT huge model.

Parameters:
  • patch_embedder – PatchEmbedder instance.

  • device – Device to use.

  • dtype – Data type to use.

The patch_embedder is responsible for computing the embedding of the patch as well as adding the positional encoding if required.

It maps from \((B, C, H, W)\) to \((B, hidden\_dim, N_h, N_w)\) where \(N_h imes N_w\) is the number of patches in the image. The embedding dimension must match the expected hidden dimension of the transformer.