transformers/docs/source/en/model_doc/biogpt.md
Anton Vlasjuk d95c864a25
🔴🔴🔴 [Attention] Refactor Attention Interface for Bart-based Models (#38108)
* starting attn refactor for encoder decoder models via bart (eager + sdpa)

* flash attention works, remove unnecessary code

* flex attention support for bart!, gotta check if the renaming is not too aggressive

* some comments

* skip flex grad test for standalone as done with the other test

* revert flex attn rename (for now), sdpa simplify, and todos

* more todos

* refactor mask creation for reuse

* modular attempt at biogpt

* first batch of other models

* fix attn dropout

* fix autoformer copies

* hubert

* another batch of models

* copies/style + last round of bart models --> whisper next?

* remove unnecessary _reshape function and remove copy to whisper

* add skip for decoder-only models out of enc-dec (same as in bart)

* bring back licences

* remove comment, added to pr read instead

* mostly docs

* disable sew flex attn as it's unclear attn mask for now

* oops

* test fixes for enc-dec

* torch fx fixes + try at flex attn

* skip on mbart

* some more fixes

* musicgen skip / delete old attn class logic + sdpa compose compile skip

* disable flex attn for musicgen, not worth the effort

* more fixes and style

* flex attention test for dropout and encoder decoder that dont have main input names

* informer fixes

* the weirdest thing I've encountered yet...

* style

* remove empty tensor attempt, found core root in previous commits

* disable time series due to tests being very text centric on inputs

* add speech to text to be ignoring the other attns, also due to tests

* update docs

* remaining issues resolved ?

* update docs for current state --> nllb moe and pegasus x sdpa is questionable :D

* some models have not set the is_causal flag...

* change dtype in softmax tol old behaviour + some modular fixes

* I hate it but it is what it is

* fixes from main for bart

* forgot this one

* some model fixes

* style

* current status

* marian works now

* fixing some copies

* some copy fixes + time series x informer

* last models possibly and fixes on style/copies

* some post merge fixes

* more fixes

* make attention interface callable and move warnings there

* style lol

* add comment to "unsupported"

* remove callable interface and change interface warnings + some copies

* fix

* ternary is ugly af, make it simpler

* how did that happen

* fix flex attn test

* failing the test

* no more fallback! fixing copies next

* style + attn fixed

* fixing copies and mask creation

* wrong copy

* fixup tests and disable flex attn for now

* fixup last tests?
2025-05-22 17:12:58 +02:00

9.3 KiB

BioGPT

PyTorch FlashAttention SDPA

Overview

The BioGPT model was proposed in BioGPT: generative pre-trained transformer for biomedical text generation and mining by Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon and Tie-Yan Liu. BioGPT is a domain-specific generative pre-trained Transformer language model for biomedical text generation and mining. BioGPT follows the Transformer language model backbone, and is pre-trained on 15M PubMed abstracts from scratch.

The abstract from the paper is the following:

Pre-trained language models have attracted increasing attention in the biomedical domain, inspired by their great success in the general natural language domain. Among the two main branches of pre-trained language models in the general language domain, i.e. BERT (and its variants) and GPT (and its variants), the first one has been extensively studied in the biomedical domain, such as BioBERT and PubMedBERT. While they have achieved great success on a variety of discriminative downstream biomedical tasks, the lack of generation ability constrains their application scope. In this paper, we propose BioGPT, a domain-specific generative Transformer language model pre-trained on large-scale biomedical literature. We evaluate BioGPT on six biomedical natural language processing tasks and demonstrate that our model outperforms previous models on most tasks. Especially, we get 44.98%, 38.42% and 40.76% F1 score on BC5CDR, KD-DTI and DDI end-to-end relation extraction tasks, respectively, and 78.2% accuracy on PubMedQA, creating a new record. Our case study on text generation further demonstrates the advantage of BioGPT on biomedical literature to generate fluent descriptions for biomedical terms.

This model was contributed by kamalkraj. The original code can be found here.

Usage tips

  • BioGPT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than the left.
  • BioGPT was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next token in a sequence. Leveraging this feature allows BioGPT to generate syntactically coherent text as it can be observed in the run_generation.py example script.
  • The model can take the past_key_values (for PyTorch) as input, which is the previously computed key/value attention pairs. Using this (past_key_values or past) value prevents the model from re-computing pre-computed values in the context of text generation. For PyTorch, see past_key_values argument of the BioGptForCausalLM.forward() method for more information on its usage.
  • The head_mask argument is ignored when using all attention implementation other than "eager". If you have a head_mask and want it to have effect, load the model with XXXModel.from_pretrained(model_id, attn_implementation="eager")

Using Scaled Dot Product Attention (SDPA)

PyTorch includes a native scaled dot-product attention (SDPA) operator as part of torch.nn.functional. This function encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the official documentation or the GPU Inference page for more information.

SDPA is used by default for torch>=2.1.1 when an implementation is available, but you may also set attn_implementation="sdpa" in from_pretrained() to explicitly request SDPA to be used.

from transformers import BioGptForCausalLM
model = BioGptForCausalLM.from_pretrained("microsoft/biogpt", attn_implementation="sdpa", torch_dtype=torch.float16)

On a local benchmark (NVIDIA GeForce RTX 2060-8GB, PyTorch 2.3.1, OS Ubuntu 20.04) with float16 and microsoft/biogpt model with a CausalLM head, we saw the following speedups during training.

For the best speedups, we recommend loading the model in half-precision (e.g. torch.float16 or torch.bfloat16).

num_training_steps batch_size seq_len is cuda Time per batch (eager - s) Time per batch (sdpa - s) Speedup (%) Eager peak mem (MB) sdpa peak mem (MB) Mem saving (%)
100 1 128 False 0.038 0.031 21.301 1601.862 1601.497 0.023
100 1 256 False 0.039 0.034 15.084 1624.944 1625.296 -0.022
100 2 128 False 0.039 0.033 16.820 1624.567 1625.296 -0.045
100 2 256 False 0.065 0.059 10.255 1672.164 1672.164 0.000
100 4 128 False 0.062 0.058 6.998 1671.435 1672.164 -0.044
100 4 256 False 0.113 0.100 13.316 2350.179 1848.435 27.144
100 8 128 False 0.107 0.098 9.883 2098.521 1848.435 13.530
100 8 256 False 0.222 0.196 13.413 3989.980 2986.492 33.601

On a local benchmark (NVIDIA GeForce RTX 2060-8GB, PyTorch 2.3.1, OS Ubuntu 20.04) with float16 and microsoft/biogpt model with a simple AutoModel head, we saw the following speedups during inference.

num_batches batch_size seq_len is cuda is half use mask Per token latency eager (ms) Per token latency SDPA (ms) Speedup (%) Mem eager (MB) Mem BT (MB) Mem saved (%)
50 1 64 True True True 0.115 0.098 17.392 716.998 716.998 0.000
50 1 128 True True True 0.115 0.093 24.640 730.916 730.916 0.000
50 2 64 True True True 0.114 0.096 19.204 730.900 730.900 0.000
50 2 128 True True True 0.117 0.095 23.529 759.262 759.262 0.000
50 4 64 True True True 0.113 0.096 18.325 759.229 759.229 0.000
50 4 128 True True True 0.186 0.178 4.289 816.478 816.478 0.000

Resources

BioGptConfig

autodoc BioGptConfig

BioGptTokenizer

autodoc BioGptTokenizer - save_vocabulary

BioGptModel

autodoc BioGptModel - forward

BioGptForCausalLM

autodoc BioGptForCausalLM - forward

BioGptForTokenClassification

autodoc BioGptForTokenClassification - forward

BioGptForSequenceClassification

autodoc BioGptForSequenceClassification - forward