Commit Graph

158 Commits

Author SHA1 Message Date
Hamza Benchekroun
797860c68c
feat: add flexible Liger Kernel configuration to TrainingArguments (#38911)
Some checks are pending
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Waiting to run
Build documentation / build (push) Waiting to run
New model PR merged notification / Notify new model (push) Waiting to run
Slow tests on important models (on Push - A10) / Get all modified files (push) Waiting to run
Slow tests on important models (on Push - A10) / Slow & FA2 tests (push) Blocked by required conditions
Self-hosted runner (push-caller) / Check if setup was changed (push) Waiting to run
Self-hosted runner (push-caller) / build-docker-containers (push) Blocked by required conditions
Self-hosted runner (push-caller) / Trigger Push CI (push) Blocked by required conditions
Secret Leaks / trufflehog (push) Waiting to run
Update Transformers metadata / build_and_package (push) Waiting to run
* feat: add flexible Liger Kernel configuration to TrainingArguments

Add support for granular Liger Kernel configuration through a new
`liger_kernel_config` parameter in TrainingArguments. This allows users
to selectively enable/disable specific kernels (rope, swiglu, cross_entropy,
etc.) instead of the current approach that rely on default configuration.

Features:
- Add `liger_kernel_config` dict parameter to TrainingArguments
- Support selective kernel application for all supported models
- Maintain full backward compatibility with existing `use_liger_kernel` flag

Example usage:
```python
TrainingArguments(
    use_liger_kernel=True,
    liger_kernel_config={
        "rope": True,
        "swiglu": True,
        "cross_entropy": False,
        "fused_linear_cross_entropy": True
    }
)
Closes #38905

* Address comments and update Liger section in Trainer docs
2025-06-19 15:54:08 +00:00
Yao Matrix
3526e25d3d
enable misc test cases on XPU (#38852)
Some checks are pending
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Waiting to run
Build documentation / build (push) Waiting to run
Slow tests on important models (on Push - A10) / Get all modified files (push) Waiting to run
Slow tests on important models (on Push - A10) / Slow & FA2 tests (push) Blocked by required conditions
Self-hosted runner (push-caller) / Check if setup was changed (push) Waiting to run
Self-hosted runner (push-caller) / build-docker-containers (push) Blocked by required conditions
Self-hosted runner (push-caller) / Trigger Push CI (push) Blocked by required conditions
Secret Leaks / trufflehog (push) Waiting to run
Update Transformers metadata / build_and_package (push) Waiting to run
* enable misc test cases on XPU

Signed-off-by: YAO Matrix <matrix.yao@intel.com>

* fix style

Signed-off-by: YAO Matrix <matrix.yao@intel.com>

* tweak bamba ground truth on XPU

Signed-off-by: YAO Matrix <matrix.yao@intel.com>

* remove print

Signed-off-by: YAO Matrix <matrix.yao@intel.com>

* one more

Signed-off-by: YAO Matrix <matrix.yao@intel.com>

* fix style

Signed-off-by: YAO Matrix <matrix.yao@intel.com>

---------

Signed-off-by: YAO Matrix <matrix.yao@intel.com>
2025-06-18 09:20:49 +02:00
Yao Matrix
c8c1e525ed
from 1.11.0, torchao.prototype.low_bit_optim is promoted to torchao.optim (#38689)
* since 1.11.0, torchao.prototype.low_bit_optim is promoted to
torchao.optim

Signed-off-by: YAO Matrix <matrix.yao@intel.com>

* fix review comments

Signed-off-by: YAO Matrix <matrix.yao@intel.com>

---------

Signed-off-by: YAO Matrix <matrix.yao@intel.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
2025-06-11 12:16:25 +00:00
Yao Matrix
dc76eff12b
remove ipex_optimize_model usage (#38632)
* remove ipex_optimize_model usage

Signed-off-by: YAO Matrix <matrix.yao@intel.com>

* update Dockerfile

Signed-off-by: root <root@a4bf01945cfe.jf.intel.com>

---------

Signed-off-by: YAO Matrix <matrix.yao@intel.com>
Signed-off-by: root <root@a4bf01945cfe.jf.intel.com>
Co-authored-by: root <root@a4bf01945cfe.jf.intel.com>
2025-06-06 20:04:44 +02:00
Yao Matrix
a5a0c7b888
switch to device agnostic device calling for test cases (#38247)
* use device agnostic APIs in test cases

Signed-off-by: Matrix Yao <matrix.yao@intel.com>

* fix style

Signed-off-by: Matrix Yao <matrix.yao@intel.com>

* add one more

Signed-off-by: YAO Matrix <matrix.yao@intel.com>

* xpu now supports integer device id, aligning to CUDA behaviors

Signed-off-by: Matrix Yao <matrix.yao@intel.com>

* update to use device_properties

Signed-off-by: Matrix Yao <matrix.yao@intel.com>

* fix style

Signed-off-by: Matrix Yao <matrix.yao@intel.com>

* update comment

Signed-off-by: Matrix Yao <matrix.yao@intel.com>

* fix comments

Signed-off-by: Matrix Yao <matrix.yao@intel.com>

* fix style

Signed-off-by: Matrix Yao <matrix.yao@intel.com>

---------

Signed-off-by: Matrix Yao <matrix.yao@intel.com>
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-05-26 10:18:53 +02:00
Yao Matrix
3bd1c20149
enable misc cases on XPU & use device agnostic APIs for cases in tests (#38192)
* use device agnostic APIs in tests

Signed-off-by: Matrix Yao <matrix.yao@intel.com>

* more

Signed-off-by: Matrix Yao <matrix.yao@intel.com>

* fix style

Signed-off-by: Matrix Yao <matrix.yao@intel.com>

* add reset_peak_memory_stats API

Signed-off-by: YAO Matrix <matrix.yao@intel.com>

* update

---------

Signed-off-by: Matrix Yao <matrix.yao@intel.com>
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-05-20 10:09:01 +02:00
Yao Matrix
7caa57e85e
enable trainer test cases on xpu (#38138)
* enable trainer test cases on xpu

Signed-off-by: Matrix Yao <matrix.yao@intel.com>

* fix style

Signed-off-by: Matrix Yao <matrix.yao@intel.com>

---------

Signed-off-by: Matrix Yao <matrix.yao@intel.com>
2025-05-15 12:17:44 +00:00
ivarflakstad
1b00966395
Fix auto batch size finder test (#38125)
Ensure --auto_find_batch_size is the last test arg so indexing is correct
2025-05-14 12:12:04 +00:00
ivarflakstad
e27d230ddd
Disable report callbacks for certain training tests (#38088)
* Disable report callbacks for certain training tests

* Disable report callbacks for test_auto_batch_size_finder
2025-05-13 14:49:55 +02:00
efsotr
e387821a96
Fix tot update in trainer (#37923)
* fix total updates in epoch

* add test; fix max_steps

* replace with multi-gpu decorator
2025-05-12 17:45:24 +02:00
Yih-Dar
f2909e024c
Skip test_push_to_hub_with_saves_each_epoch for now (#38022)
* update

* trigger CI

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-05-08 16:26:24 +02:00
Yao Matrix
5534b80b7f
enable xpu in test_trainer (#37774)
* enable xpu in test_trainer

Signed-off-by: YAO Matrix <matrix.yao@intel.com>

* fix style

Signed-off-by: YAO Matrix <matrix.yao@intel.com>

* enhance _device_agnostic_dispatch to cover value

Signed-off-by: Yao Matrix <matrix.yao@intel.com>

* add default values for torch not available case

Signed-off-by: YAO Matrix <matrix.yao@intel.com>

---------

Signed-off-by: YAO Matrix <matrix.yao@intel.com>
Signed-off-by: Yao Matrix <matrix.yao@intel.com>
2025-05-06 17:13:35 +02:00
Cyril Vallez
0cfbf9c95b
Force torch>=2.6 with torch.load to avoid vulnerability issue (#37785)
* fix all main files

* fix test files

* oups forgot modular

* add link

* update message
2025-04-25 16:57:09 +02:00
Yao Matrix
19e9079dc1
enable 4 test_trainer cases on XPU (#37645)
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
2025-04-23 21:29:42 +02:00
youngrok cha
31ea547b7a
[fix] make legacy bnb code work (#37331)
* [fix] make legacy bnb code work

* [fix] use get with default instead of getter

* add test for bnb 8bit optim skip embed

* [fix] style

* add require annotation of bnb

---------

Co-authored-by: jaycha <jaycha@ncsoft.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
2025-04-22 11:17:29 +02:00
cyyever
371c44d0ef
Remove old code for PyTorch, Accelerator and tokenizers (#37234)
* Remove unneeded library version checks

Signed-off-by: cyy <cyyever@outlook.com>

* Remove PyTorch condition

Signed-off-by: cyy <cyyever@outlook.com>

* Remove PyTorch condition

Signed-off-by: cyy <cyyever@outlook.com>

* Fix ROCm get_device_capability

Signed-off-by: cyy <cyyever@outlook.com>

* Revert "Fix ROCm get_device_capability"

This reverts commit 0e756434bd.

* Remove unnecessary check

Signed-off-by: cyy <cyyever@outlook.com>

* Revert changes

Signed-off-by: cyy <cyyever@outlook.com>

---------

Signed-off-by: cyy <cyyever@outlook.com>
2025-04-10 20:54:21 +02:00
cyyever
1e6b546ea6
Use Python 3.9 syntax in tests (#37343)
Signed-off-by: cyy <cyyever@outlook.com>
2025-04-08 14:12:08 +02:00
cyyever
41a0e58e5b
Set weights_only in torch.load (#36991) 2025-03-27 14:55:50 +00:00
cyyever
2b550c47b2
Remove deprecated training arguments (#36946)
* Remove deprecated training arguments

* More fixes

* More fixes

* More fixes
2025-03-26 16:44:48 +00:00
Yih-Dar
c6814b4ee8
Update ruff to 0.11.2 (#36962)
* update

* update

* update

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-03-25 16:00:11 +01:00
Afanti
26c83490d2
chore: fix typos in the tests directory (#36813)
* chore: fix typos in the tests

* chore: fix typos in the tests

* chore: fix typos in the tests

* chore: fix typos in the tests

* chore: fix typos in the tests

* chore: fix typos in the tests

* chore: fix typos in the tests

* chore: fix typos in the tests

* chore: fix typos in the tests

* chore: fix typos in the tests

* chore: fix typos in the tests

* chore: fix typos in the tests

* chore: fix typos in the tests

* fix: format codes

* chore: fix copy mismatch issue

* fix: format codes

* chore: fix copy mismatch issue

* chore: fix copy mismatch issue

* chore: fix copy mismatch issue

* chore: restore previous words

* chore: revert unexpected changes
2025-03-21 10:20:05 +01:00
yutong_liu
8b479e39bb
Saving Trainer.collator.tokenizer in when Trainer.processing_class is None (#36552)
* feat: Saving tokenizer in collator when processing_class is None

* chore: Style issue

* chore: Typo

* dbg: Check why test failed

* dbg: Remove logics and another test failed which successed before, so should be the stablibility issue

* test: Init unit-test

* chore: Style

* chore: Add err log

* fix: Case

* Update tests/trainer/test_trainer.py

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

* chore: Try to use get_regression_trainer

* fix: Impl and style

* fix: Style

* fix: Case

* fix: Import err

* fix: Missed import

* fix: Import block un-sorted problem

* fix: Try another tokenizer

* fix: Test logic

* chore: Light updates

* chore: Reformat

---------

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
2025-03-20 11:27:47 +01:00
Matt
9be4728af8
Just import torch AdamW instead (#36177)
* Just import torch AdamW instead

* Update docs too

* Make AdamW undocumented

* make fixup

* Add a basic wrapper class

* Add it back to the docs

* Just remove AdamW entirely

* Remove some AdamW references

* Drop AdamW from the public init

* make fix-copies

* Cleanup some references

* make fixup

* Delete lots of transformers.AdamW references

* Remove extra references to adamw_hf
2025-03-19 18:29:40 +00:00
Afanti
7f5077e536
fix typos in the tests directory (#36717) 2025-03-17 17:45:57 +00:00
Ilyas Moutawwakil
6f3e0b68e0
Fix grad accum arbitrary value (#36691) 2025-03-14 22:03:01 +01:00
Sean (Seok-Won) Yi
691d1b52c3
Fix/best model checkpoint fix (#35885)
* Set best_model_checkpoint only when ckpt exists.

Rather than set it explicitly without checking if the checkpoint directory even exists as before, now we moved the setting logic inside of _save_checkpoint and are only setting it if it exists.

* Added best_global_step to TrainerState.

* Added tests for best_model_checkpoint.

* Fixed hard-coded values in test to prevent fail.

* Added helper func and removed hard-coded best_step.

* Added side effect patch generator for _eval.

* Added evaluate side effect func.

* Removed erroneous patching.

* Fixed minor bug.

* Applied Ruff.

* Fixed Ruff problem in make style.

* Used Trainer.set_initial_training_values.
2025-03-14 14:24:53 +01:00
bd793fcb
87b30c3589
fix wandb hp search unable to resume from sweep_id (#35883)
* fix wandb hp search unable to resume from sweep_id

* format styles

---------

Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
2025-03-13 12:32:26 +01:00
Matt
c7eb95581a
Don't accidentally mutate the base_model_tp_plan (#36677)
* Don't accidentally mutate the base_model_tp_plan

* Co-authored by: Joao Gante <joaofranciscocardosogante@gmail.com>

* Trigger tests

* Marking grad accum test as slow

* Add a flaky decorator

* Add a flaky decorator

* Use cyril's codeblock

* Don't copy() when it's None

* Use cyril's new codeblock

* make fixup
2025-03-12 18:59:13 +00:00
Ilyas Moutawwakil
89f6956015
HPU support (#36424)
* test

* fix

* fix

* skip some and run some first

* test fsdp

* fix

* patches for generate

* test distributed

* copy

* don't test distributed loss for hpu

* require fp16 and run first

* changes from marc's PR fixing zero3

* better alternative

* return True when fp16 support on gaudi without creating bridge

* fix

* fix tested dtype in deepspeed inference test

* test

* fix

* test

* fix

* skip

* require fp16

* run first fsdp

* Apply suggestions from code review

* address comments

* address comments and refactor test

* reduce precison

* avoid doing gaudi1 specific stuff in the genreation loop

* document test_gradient_accumulation_loss_alignment_with_model_loss test a bit more
2025-03-12 09:08:12 +01:00
Manny Cortes
082834dd79
fix: prevent model access error during Optuna hyperparameter tuning (#36395)
* fix: prevent model access error during Optuna hyperparameter tuning

The `transformers.integrations.integration_utils.run_hp_search_optuna` function releases model memory and sets trainer.model to None after each trial. This causes an AttributeError when  subsequent Trainer.train calls attempt to access the model before reinitialization. This is only an issue when `fp16_full_eval` or `bf16_full_eval` flags are enabled.

* Update src/transformers/trainer.py

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

---------

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
2025-02-26 17:06:48 +01:00
Thomas Bauwens
8f137b2427
Move DataCollatorForMultipleChoice from the docs to the package (#34763)
* Add implementation for DataCollatorForMultipleChoice based on docs.

* Add DataCollatorForMultipleChoice to import structure.

* Remove custom DataCollatorForMultipleChoice implementations from example scripts.

* Remove custom implementations of DataCollatorForMultipleChoice from docs in English, Spanish, Japanese and Korean.

* Refactor torch version of DataCollatorForMultipleChoice to be more easily understandable.

* Apply suggested changes and run make fixup.

* fix copies, style and fixup

* add missing documentation

* nits

* fix docstring

* style

* nits

* isort

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>
2025-02-13 12:01:28 +01:00
Zach Mueller
1fae54c721
Add more rigerous non-slow grad accum tests (#35668)
* Add more rigerous non-slow grad accum tests

* Further nits

* Re-add space

* Readbility

* Use tinystories instead

* Revert transformer diff

* tweak threshs
2025-02-12 10:26:21 -05:00
hsilva664
281c0c8b5b
adding option to save/reload scaler (#34932)
* Adding option to save/reload scaler

* Removing duplicate variable

* Adding save/reload test

* Small fixes on deterministic algorithm call

* Moving LLM test to another file to isolate its environment

* Moving back to old file and using subprocess to run test isolated

* Reverting back accidental change

* Reverting back accidental change
2025-02-12 15:48:16 +01:00
zhuHQ
08c4959a23
Optim: APOLLO optimizer integration (#36062)
* Added APOLLO optimizer integration

* fix comment

* Remove redundancy: Modularize low-rank optimizer construction

* Remove redundancy: Remove useless comment

* Fix comment: Add typing

* Fix comment: Rewrite apollo desc
2025-02-12 15:33:43 +01:00
nhamanasu
377d8e2b9c
add RAdamScheduleFree optimizer (#35313)
* add RAdamScheduleFree optimizer

* revert schedulefree version to the minimum requirement

* refine is_schedulefree_available so that it can take min_version

* refine documents

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
2025-02-12 11:31:51 +01:00
Arthur
b912f5ee43
use torch.testing.assertclose instead to get more details about error in cis (#35659)
* use torch.testing.assertclose instead to get more details about error in cis

* fix

* style

* test_all

* revert for I bert

* fixes and updates

* more image processing fixes

* more image processors

* fix mamba and co

* style

* less strick

* ok I won't be strict

* skip and be done

* up
2025-01-24 16:55:28 +01:00
fzyzcjy
dc10f7906a
Support adamw_torch_8bit (#34993)
* var

* more

* test
2025-01-21 14:17:49 +01:00
kang sheng
2cbcc5877d
Fix condition when GA loss bug fix is not performed (#35651)
* fix condition when GA loss bug fix is not performed

* max loss diff is 2.29

* fix typo

* add an extra validation that loss should not vary too much
2025-01-16 13:59:53 +01:00
Fanli Lin
2fa876d2d8
[tests] make cuda-only tests device-agnostic (#35607)
* intial commit

* remove unrelated files

* further remove

* Update test_trainer.py

* fix style
2025-01-13 14:48:39 +01:00
Zach Mueller
b02828e4af
Let EarlyStoppingCallback not require load_best_model_at_end (#35101)
* Bookmark

* Add warning
2025-01-10 10:25:32 -05:00
Zach Mueller
1211e616a4
Use inherit tempdir makers for tests + fix failing DS tests (#35600)
* Use existing APIs to make tempdir folders

* Fixup deepspeed too

* output_dir -> tmp_dir
2025-01-10 10:01:58 -05:00
nhamanasu
b32938aeee
Fix all output_dir in test_trainer.py to use tmp_dir (#35266)
* update codecarbon

* replace directly-specified-test-dirs with tmp_dir

* pass tmp_dir to all get_regression_trainer

* test_trainer.py: Use tmp_dir consistently for all output_dir arguments

* fix some with...as tmp_dir blocks

* reflect the comments to improve test_trainer.py

* refresh .gitignore
2025-01-08 19:44:39 +01:00
Sean (Seok-Won) Yi
88e18b3c63
Update doc for metric_for_best_model when save_strategy="best". (#35389)
* Updated docstring for _determine_best_metric.

* Updated docstring for metric_for_best_model.

* Added test case for save strategy.

* Updated incorrect test case.

* Changed eval_strategy to match save_strategy.

* Separated test cases for metric.

* Allow load_best_model when save_strategy == "best".

* Updated docstring for metric_for_best_model.
2025-01-08 16:32:35 +01:00
kang sheng
1ccca8f48c
Fix GA loss bugs and add unit test (#35121)
* fix GA bugs and add unit test

* narrow down model loss unit test diff gap

* format code to make ruff happy

* send num_items_in_batch argument to decoder

* fix GA loss bug in BertLMHeadModel

* use TinyStories-33M to narrow down diff gap

* fotmat code

* missing .config

* avoid add extra args

---------

Co-authored-by: kangsheng <kangsheng@meituan.com>
2024-12-09 09:57:41 +01:00
Yih-Dar
b0a51e5cff
Fix flaky Hub CI (test_trainer.py) (#35062)
* fix

* Update src/transformers/testing_utils.py

Co-authored-by: Lucain <lucainp@gmail.com>

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* check

* check

* check

* check

* check

* check

* Update src/transformers/testing_utils.py

Co-authored-by: Lucain <lucainp@gmail.com>

* Update src/transformers/testing_utils.py

Co-authored-by: Lucain <lucainp@gmail.com>

* check

* check

* check

* Final space

* Final adjustment

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Co-authored-by: Lucain <lucainp@gmail.com>
2024-12-05 17:02:27 +01:00
Zach Mueller
ef976a7e18
Update trainer for easier handling of accumulate, compile fixes, and proper reporting (#34511)
* Update trainer for easier handling of accumulate + proper reporting

* test

* Fixup tests

* Full fix

* Fix style

* rm comment

* Fix tests

* Minimize test + remove py 311 check

* Unused import

* Forward contrib credits from discussions

* Fix reported metrics

* Refactor, good as it's going to get

* rm pad tok id check

* object detection and audio are being annoying

* Fin

* Fin x2

---------

Co-authored-by: Gyanateet Dutta <Ryukijano@users.noreply.github.com>
2024-11-04 07:47:34 -05:00
Sean (Seok-Won) Yi
c1753436db
New option called "best" for args.save_strategy. (#31817)
* Add _determine_best_metric and new saving logic.

1. Logic to determine the best logic was separated out from
`_save_checkpoint`.
2. In `_maybe_log_save_evaluate`, whether or not a new best metric was
achieved is determined after each evaluation, and if the save strategy
is "best' then the TrainerControl is updated accordingly.

* Added SaveStrategy.

Same as IntervalStrategy, but with a new attribute called BEST.

* IntervalStrategy -> SaveStrategy

* IntervalStratgy -> SaveStrategy for save_strat.

* Interval -> Save in docstring.

* Updated docstring for save_strategy.

* Added SaveStrategy and made according changes.

`save_strategy` previously followed `IntervalStrategy` but now follows
`SaveStrategy`.

Changes were made accordingly to the code and the docstring.

* Changes from `make fixup`.

* Removed redundant metrics argument.

* Added new test_save_best_checkpoint test.

1. Checks for both cases where `metric_for_best_model` is explicitly
provided and when it's not provided.
2. The first case should have two checkpoints saved, whereas the second
should have three saved.

* Changed should_training_end saving logic.

The Trainer saves a checkpoints at the end of training by default as
long as `save_strategy != SaveStrategy.NO`. This condition was modified
to include `SaveStrategy.BEST` because it would be counterintuitive that
we'd only want the best checkpoint to be saved but the last one is as
well.

* `args.metric_for_best_model` default to loss.

* Undo metric_for_best_model update.

* Remove checking metric_for_best_model.

* Added test cases for loss and no metric.

* Added error for metric and changed default best_metric.

* Removed unused import.

* `new_best_metric` -> `is_new_best_metric`

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Applied `is_new_best_metric` to all.

Changes were made for consistency and also to fix a potential bug.

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Zach Mueller <muellerzr@gmail.com>
2024-10-28 16:02:22 +01:00
Zach Mueller
6ba31a8a94
Enable users to use their own loss functions + deal with prefetching for grad accum (#34198)
* bookmark

* Bookmark

* Bookmark

* Actually implement

* Pass in kwarg explicitly

* Adjust for if we do or don't have labels

* Bookmark fix for od

* bookmark

* Fin

* closer

* Negate accelerate grad accum div

* Fixup not training long enough

* Add in compute_loss to take full model output

* Document

* compute_loss -> compute_loss_fn

* Add a test

* Refactor

* Refactor

* Uncomment tests

* Update tests/trainer/test_trainer.py

Co-authored-by: Daniel Han <danielhanchen@gmail.com>

---------

Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2024-10-17 17:01:56 -04:00
Marc Sun
3f06f95ebe
Revert "Fix FSDP resume Initialization issue" (#34193)
Revert "Fix FSDP resume Initialization issue (#34032)"

This reverts commit 4de1bdbf63.
2024-10-16 15:25:18 -04:00
Shikhar Mishra
4de1bdbf63
Fix FSDP resume Initialization issue (#34032)
* Fix FSDP Initialization for resume training

* Added init_fsdp function to work with dummy values

* Fix FSDP initialization for resuming training

* Added CUDA decorator for tests

* Added torch_gpu decorator to FSDP tests

* Fixup for failing code quality tests
2024-10-15 13:48:10 +02:00