Allow training to resume even if RNG states are not properly loaded (#14994)

* Allow training to resume even if RNG states are not properly loaded * Proper f-string
2025-08-02 11:11:05 +06:00 · 2021-12-30 17:03:20 -05:00 · 2021-12-30 17:03:20 -05:00 · e68c3756fe
commit e68c3756fe
parent 08cb5718ec
1 changed files with 7 additions and 1 deletions
--- a/src/transformers/trainer.py
+++ b/src/transformers/trainer.py
@ -1553,7 +1553,13 @@ class Trainer:
            if self.args.local_rank != -1:
                torch.cuda.random.set_rng_state(checkpoint_rng_state["cuda"])
            else:
-                torch.cuda.random.set_rng_state_all(checkpoint_rng_state["cuda"])
+                try:
+                    torch.cuda.random.set_rng_state_all(checkpoint_rng_state["cuda"])
+                except Exception as e:
+                    logger.infor(
+                        f"Didn't manage to set back the RNG states of the GPU because of the following error:\n {e}"
+                        "\nThis won't yield the same results as if the training had not been interrupted."
+                    )
        if is_torch_tpu_available():
            xm.set_rng_state(checkpoint_rng_state["xla"])