transformers/docs/source/en/model_doc/auto.md
Jaeyong Sung 583db52bc6
Add Dia model (#38405)
* add dia model

* add tokenizer files

* cleanup some stuff

* brut copy paste code

* rough cleanup of the modeling code

* nuke some stuff

* more nuking

* more cleanups

* updates

* add mulitLayerEmbedding vectorization

* nits

* more modeling simplifications

* updates

* update rope

* update rope

* just fixup

* update configuration files

* more cleanup!

* default config values

* update

* forgotten comma

* another comma!

* update, more cleanups

* just more nits

* more config cleanups

* time for the encoder

* fix

* sa=mall nit

* nits

* n

* refacto a bit

* cleanup

* update cv scipt

* fix last issues

* fix last nits

* styling

* small fixes

* just run 1 generation

* fixes

* nits

* fix conversion

* fix

* more fixes

* full generate

* ouf!

* fixes!

* updates

* fix

* fix cvrt

* fixup

* nits

* delete wrong test

* update

* update

* test tokenization

* let's start changing things bit by bit - fix encoder step

* removing custom generation, moving to GenerationMixin

* add encoder decoder attention masks for generation

* mask changes, correctness checked against ad29837 in dia repo

* refactor a bit already --> next cache

* too important not to push :)

* minimal cleanup + more todos

* make main overwrite modeling utils

* add cfg filter & eos filter

* add eos countdown & delay pattern

* update eos countdown

* add max step eos countdown

* fix tests

* fix some things

* fix generation with testing

* move cfg & eos stuff to logits processor

* make RepetitionPenaltyLogitsProcessor flexible

- can accept 3D scores like (batch_size, channel, vocab)

* fix input_ids concatenation dimension in GenerationMixin for flexibility

* Add DiaHangoverLogitsProcessor and DiaExponentialDecayLengthPenalty classes; refactor logits processing in DiaForConditionalGeneration to utilize new configurations and improve flexibility.

* Add stopping criteria

* refactor

* move delay pattern from processor to modeling like musicgen.

- add docs
- change eos countdown to eos delay pattern

* fix processor & fix tests

* refactor types

* refactor imports

* format code

* fix docstring to pass ci

* add docstring to DiaConfig & add DiaModel to test

* fix docstring

* add docstring

* fix some bugs

* check

* porting / merging results from other branch - IMPORTANT: it very likely breaks generation, the goal is to have a proper forward path first

* experimental testing of left padding for first channel

* whoops

* Fix merge to make generation work

* fix cfg filter

* add position ids

* add todos, break things

* revert changes to generation --> we will force 2d but go 3d on custom stuff

* refactor a lot, change prepare decoder ids to work with left padding (needs testing), add todos

* some first fixes to get to 10. in generation

* some more generation fixes / adjustment

* style + rope fixes

* move cfg out, simplify a few things, more todos

* nit

* start working on custom logit processors

* nit

* quick fixes

* cfg top k

* more refactor of logits processing, needs a decision if gen config gets the new attributes or if we move it to config or similar

* lets keep changes to core code minimal, only eos scaling is questionable atm

* simpler eos delay logits processor

* that was for debugging :D

* proof of concept rope

* small fix on device mismatch

* cfg fixes + delay logits max len

* transformers rope

* modular dia

* more cleanup

* keep modeling consistently 3D, generate handles 2D internally

* decoder starts with bos if nothing

* post processing prototype

* style

* lol

* force sample / greedy + fixes on padding

* style

* fixup tokenization

* nits

* revert

* start working on dia tests

* fix a lot of tests

* more test fixes

* nit

* more test fixes + some features to simplify code more

* more cleanup

* forgot that one

* autodocs

* small consistency fixes

* fix regression

* small fixes

* dia feature extraction

* docs

* wip processor

* fix processor order

* processing goes brrr

* transpose before

* small fix

* fix major bug but needs now a closer look into the custom processors esp cfg

* small thing on logits

* nits

* simplify indices and shifts

* add simpler version of padding tests back (temporarily)

* add logit processor tests

* starting tests on processor

* fix mask application during generation

* some fixes on the weights conversion

* style + fixup logits order

* simplify conversion

* nit

* remove padding tests

* nits on modeling

* hmm

* fix tests

* trigger

* probably gonna be reverted, just a quick design around audio tokenizer

* fixup typing

* post merge + more typing

* initial design for audio tokenizer

* more design changes

* nit

* more processor tests and style related things

* add to init

* protect import

* not sure why tbh

* add another protect

* more fixes

* wow

* it aint stopping :D

* another missed type issue

* ...

* change design around audio tokenizer to prioritize init and go for auto - in regards to the review

* change to new causal mask function + docstrings

* change ternary

* docs

* remove todo, i dont think its essential tbh

* remove pipeline as current pipelines do not fit in the current scheme, same as csm

* closer to wrapping up the processor

* text to audio, just for demo purposes (will likely be reverted)

* check if it's this

* save audio function

* ensure no grad

* fixes on prefixed audio, hop length is used via preprocess dac, device fixes

* integration tests (tested locally on a100) + some processor utils / fixes

* style

* nits

* another round of smaller things

* docs + some fixes (generate one might be big)

* msytery solved

* small fix on conversion

* add abstract audio tokenizer, change init check to abstract class

* nits

* update docs + fix some processing :D

* change inheritance scheme for audio tokenizer

* delete dead / unnecessary code in copied generate loop

* last nits on new pipeline behavior (+ todo on tests) + style

* trigger

---------

Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Vasqu <antonprogamer@gmail.com>
2025-06-26 11:04:23 +00:00

8.8 KiB

Auto Classes

In many cases, the architecture you want to use can be guessed from the name or the path of the pretrained model you are supplying to the from_pretrained() method. AutoClasses are here to do this job for you so that you automatically retrieve the relevant model given the name/path to the pretrained weights/config/vocabulary.

Instantiating one of [AutoConfig], [AutoModel], and [AutoTokenizer] will directly create a class of the relevant architecture. For instance

model = AutoModel.from_pretrained("google-bert/bert-base-cased")

will create a model that is an instance of [BertModel].

There is one class of AutoModel for each task, and for each backend (PyTorch, TensorFlow, or Flax).

Extending the Auto Classes

Each of the auto classes has a method to be extended with your custom classes. For instance, if you have defined a custom class of model NewModel, make sure you have a NewModelConfig then you can add those to the auto classes like this:

from transformers import AutoConfig, AutoModel

AutoConfig.register("new-model", NewModelConfig)
AutoModel.register(NewModelConfig, NewModel)

You will then be able to use the auto classes like you would usually do!

If your NewModelConfig is a subclass of [~transformers.PretrainedConfig], make sure its model_type attribute is set to the same key you use when registering the config (here "new-model").

Likewise, if your NewModel is a subclass of [PreTrainedModel], make sure its config_class attribute is set to the same class you use when registering the model (here NewModelConfig).

AutoConfig

autodoc AutoConfig

AutoTokenizer

autodoc AutoTokenizer

AutoFeatureExtractor

autodoc AutoFeatureExtractor

AutoImageProcessor

autodoc AutoImageProcessor

AutoVideoProcessor

autodoc AutoVideoProcessor

AutoProcessor

autodoc AutoProcessor

Generic model classes

The following auto classes are available for instantiating a base model class without a specific head.

AutoModel

autodoc AutoModel

TFAutoModel

autodoc TFAutoModel

FlaxAutoModel

autodoc FlaxAutoModel

Generic pretraining classes

The following auto classes are available for instantiating a model with a pretraining head.

AutoModelForPreTraining

autodoc AutoModelForPreTraining

TFAutoModelForPreTraining

autodoc TFAutoModelForPreTraining

FlaxAutoModelForPreTraining

autodoc FlaxAutoModelForPreTraining

Natural Language Processing

The following auto classes are available for the following natural language processing tasks.

AutoModelForCausalLM

autodoc AutoModelForCausalLM

TFAutoModelForCausalLM

autodoc TFAutoModelForCausalLM

FlaxAutoModelForCausalLM

autodoc FlaxAutoModelForCausalLM

AutoModelForMaskedLM

autodoc AutoModelForMaskedLM

TFAutoModelForMaskedLM

autodoc TFAutoModelForMaskedLM

FlaxAutoModelForMaskedLM

autodoc FlaxAutoModelForMaskedLM

AutoModelForMaskGeneration

autodoc AutoModelForMaskGeneration

TFAutoModelForMaskGeneration

autodoc TFAutoModelForMaskGeneration

AutoModelForSeq2SeqLM

autodoc AutoModelForSeq2SeqLM

TFAutoModelForSeq2SeqLM

autodoc TFAutoModelForSeq2SeqLM

FlaxAutoModelForSeq2SeqLM

autodoc FlaxAutoModelForSeq2SeqLM

AutoModelForSequenceClassification

autodoc AutoModelForSequenceClassification

TFAutoModelForSequenceClassification

autodoc TFAutoModelForSequenceClassification

FlaxAutoModelForSequenceClassification

autodoc FlaxAutoModelForSequenceClassification

AutoModelForMultipleChoice

autodoc AutoModelForMultipleChoice

TFAutoModelForMultipleChoice

autodoc TFAutoModelForMultipleChoice

FlaxAutoModelForMultipleChoice

autodoc FlaxAutoModelForMultipleChoice

AutoModelForNextSentencePrediction

autodoc AutoModelForNextSentencePrediction

TFAutoModelForNextSentencePrediction

autodoc TFAutoModelForNextSentencePrediction

FlaxAutoModelForNextSentencePrediction

autodoc FlaxAutoModelForNextSentencePrediction

AutoModelForTokenClassification

autodoc AutoModelForTokenClassification

TFAutoModelForTokenClassification

autodoc TFAutoModelForTokenClassification

FlaxAutoModelForTokenClassification

autodoc FlaxAutoModelForTokenClassification

AutoModelForQuestionAnswering

autodoc AutoModelForQuestionAnswering

TFAutoModelForQuestionAnswering

autodoc TFAutoModelForQuestionAnswering

FlaxAutoModelForQuestionAnswering

autodoc FlaxAutoModelForQuestionAnswering

AutoModelForTextEncoding

autodoc AutoModelForTextEncoding

TFAutoModelForTextEncoding

autodoc TFAutoModelForTextEncoding

Computer vision

The following auto classes are available for the following computer vision tasks.

AutoModelForDepthEstimation

autodoc AutoModelForDepthEstimation

AutoModelForImageClassification

autodoc AutoModelForImageClassification

TFAutoModelForImageClassification

autodoc TFAutoModelForImageClassification

FlaxAutoModelForImageClassification

autodoc FlaxAutoModelForImageClassification

AutoModelForVideoClassification

autodoc AutoModelForVideoClassification

AutoModelForKeypointDetection

autodoc AutoModelForKeypointDetection

AutoModelForMaskedImageModeling

autodoc AutoModelForMaskedImageModeling

TFAutoModelForMaskedImageModeling

autodoc TFAutoModelForMaskedImageModeling

AutoModelForObjectDetection

autodoc AutoModelForObjectDetection

AutoModelForImageSegmentation

autodoc AutoModelForImageSegmentation

AutoModelForImageToImage

autodoc AutoModelForImageToImage

AutoModelForSemanticSegmentation

autodoc AutoModelForSemanticSegmentation

TFAutoModelForSemanticSegmentation

autodoc TFAutoModelForSemanticSegmentation

AutoModelForInstanceSegmentation

autodoc AutoModelForInstanceSegmentation

AutoModelForUniversalSegmentation

autodoc AutoModelForUniversalSegmentation

AutoModelForZeroShotImageClassification

autodoc AutoModelForZeroShotImageClassification

TFAutoModelForZeroShotImageClassification

autodoc TFAutoModelForZeroShotImageClassification

AutoModelForZeroShotObjectDetection

autodoc AutoModelForZeroShotObjectDetection

Audio

The following auto classes are available for the following audio tasks.

AutoModelForAudioClassification

autodoc AutoModelForAudioClassification

AutoModelForAudioFrameClassification

autodoc TFAutoModelForAudioClassification

TFAutoModelForAudioFrameClassification

autodoc AutoModelForAudioFrameClassification

AutoModelForCTC

autodoc AutoModelForCTC

AutoModelForSpeechSeq2Seq

autodoc AutoModelForSpeechSeq2Seq

TFAutoModelForSpeechSeq2Seq

autodoc TFAutoModelForSpeechSeq2Seq

FlaxAutoModelForSpeechSeq2Seq

autodoc FlaxAutoModelForSpeechSeq2Seq

AutoModelForAudioXVector

autodoc AutoModelForAudioXVector

AutoModelForTextToSpectrogram

autodoc AutoModelForTextToSpectrogram

AutoModelForTextToWaveform

autodoc AutoModelForTextToWaveform

AutoModelForAudioTokenization

autodoc AutoModelForAudioTokenization

Multimodal

The following auto classes are available for the following multimodal tasks.

AutoModelForTableQuestionAnswering

autodoc AutoModelForTableQuestionAnswering

TFAutoModelForTableQuestionAnswering

autodoc TFAutoModelForTableQuestionAnswering

AutoModelForDocumentQuestionAnswering

autodoc AutoModelForDocumentQuestionAnswering

TFAutoModelForDocumentQuestionAnswering

autodoc TFAutoModelForDocumentQuestionAnswering

AutoModelForVisualQuestionAnswering

autodoc AutoModelForVisualQuestionAnswering

AutoModelForVision2Seq

autodoc AutoModelForVision2Seq

TFAutoModelForVision2Seq

autodoc TFAutoModelForVision2Seq

FlaxAutoModelForVision2Seq

autodoc FlaxAutoModelForVision2Seq

AutoModelForImageTextToText

autodoc AutoModelForImageTextToText

Time Series

AutoModelForTimeSeriesPrediction

autodoc AutoModelForTimeSeriesPrediction