* average loss over batches and accumulated steps for tracking
* fix layernorm weight decay
* use AdamW from Pytorch instead of Transformers
* add shuffling of sequences inside the batches
* add shuffling of sequences inside the batches
* add logging dir and reformat code
* fix lr tracking
* remove Mistral scaling
* keep Mistral scaling
* reformat code
* fix error
* fix error
* use shuffling function from Pytorch
* remove argument for shuffling batch sequences as it isn't optional
* update package versions and install accelerate from source
* remove unused package
* Update loss average over accumulated steps
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
* Update loss average over accumulated steps
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
* use one shuffle buffer argument
* compute avg_loss in one line
Co-authored-by: Loubna ben allal <loubnabenallal@gmail.com>
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>