A Transformer model for image-like data from CompVis that is based on the Vision Transformer introduced by Dosovitskiy et al. The [Transformer2DModel] accepts discrete (classes of vector embeddings) or continuous (actual embeddings) inputs.
When the input is continuous:
(batch_size, sequence_length, feature_dimension).When the input is discrete:
[!TIP] It is assumed one of the input classes is the masked latent pixel. The predicted classes of the unnoised image don’t contain a prediction for the masked pixel because the unnoised image cannot be masked.
[[autodoc]] Transformer2DModel
[[autodoc]] models.modeling_outputs.Transformer2DModelOutput