mirror of https://github.com/huggingface/transformers.git synced 2025-07-05 22:00:09 +06:00

Speedup model init on CPU (by 10x+ for llama-3-8B as one example) (#31771 )

* 1,100%!

* Clean

* Don't touch DS

* Experiment with dtype allocation

* skip test_load_save_without_tied_weights test

* A little faster

* Include proper upscaling?

* Fixup tests

* Potentially skip?

* Let's see if this fixes git history

* Maintain new dtype

* Fin

* Rm hook idea for now

* New approach, see what breaks

* stage

* Clean

* Stash

* Should be fin now, just need to mark failing models

* Clean up

* Simplify

* Deal with weird models

* Enc/Dec

* Skip w/ reason

* Adjust test

* Fix test

* one more test

* Keep experimenting

* Fix ref

* TO REMOVE: testing feedback CI

* Right push

* Update tests/utils/test_modeling_utils.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* disable

* Add new func

* Test nits from Amy

* Update src/transformers/modeling_utils.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Adjust comment

* Adjust comment on skip

* make private

* Fin

* Should be a not flag

* Clarify and rename test

---------

Co-authored-by: Marc Sun <marc@huggingface.co>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

2024-07-16 09:32:01 -04:00

2.4 KiB

Raw Blame History

Models

The base classes [PreTrainedModel], [TFPreTrainedModel], and [FlaxPreTrainedModel] implement the common methods for loading/saving a model either from a local file or directory, or from a pretrained model configuration provided by the library (downloaded from HuggingFace's AWS S3 repository).

[PreTrainedModel] and [TFPreTrainedModel] also implement a few methods which are common among all the models to:

resize the input token embeddings when new tokens are added to the vocabulary
prune the attention heads of the model.

The other methods that are common to each model are defined in [~modeling_utils.ModuleUtilsMixin] (for the PyTorch models) and [~modeling_tf_utils.TFModuleUtilsMixin] (for the TensorFlow models) or for text generation, [~generation.GenerationMixin] (for the PyTorch models), [~generation.TFGenerationMixin] (for the TensorFlow models) and [~generation.FlaxGenerationMixin] (for the Flax/JAX models).

PreTrainedModel

autodoc PreTrainedModel - push_to_hub - all

Custom models should also include a _supports_assign_param_buffer, which determines if superfast init can apply on the particular model. Signs that your model needs this are if test_save_and_load_from_pretrained fails. If so, set this to False.

ModuleUtilsMixin

autodoc modeling_utils.ModuleUtilsMixin

TFPreTrainedModel

autodoc TFPreTrainedModel - push_to_hub - all

TFModelUtilsMixin

autodoc modeling_tf_utils.TFModelUtilsMixin

FlaxPreTrainedModel

autodoc FlaxPreTrainedModel - push_to_hub - all

Pushing to the Hub

autodoc utils.PushToHubMixin

Sharded checkpoints

autodoc modeling_utils.load_sharded_checkpoint

2.4 KiB Raw Blame History