PyTorch FlashAttention SDPA Tensor parallelism
# Phi [Phi](https://huggingface.co/papers/2306.11644) is a 1.3B parameter transformer model optimized for Python code generation. It focuses on "textbook-quality" training data of code examples, exercises and synthetic Python problems rather than scaling the model size or compute. You can find all the original Phi checkpoints under the [Phi-1](https://huggingface.co/collections/microsoft/phi-1-6626e29134744e94e222d572) collection. > [!TIP] > Click on the Phi models in the right sidebar for more examples of how to apply Phi to different language tasks. The example below demonstrates how to generate text with [`Pipeline`], [`AutoModel`] and from the command line. ```py import torch from transformers import pipeline pipeline = pipeline(task="text-generation", model="microsoft/phi-1.5", device=0, torch_dtype=torch.bfloat16) pipeline("pipeline('''def print_prime(n): """ Print all primes between 1 and n"""''')") ``` ```py import torch from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1") model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1", torch_dtype=torch.float16, device_map="auto", attn_implementation="sdpa") input_ids = tokenizer('''def print_prime(n): """ Print all primes between 1 and n """''', return_tensors="pt").to("cuda") output = model.generate(**input_ids, cache_implementation="static") print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` ```bash echo -e "'''def print_prime(n): """ Print all primes between 1 and n"""'''" | transformers run --task text-classification --model microsoft/phi-1.5 --device 0 ``` Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. The example below uses [bitsandbytes](https://huggingface.co/docs/transformers/en/quantization/bitsandbytes) to only quantize the weights to 4-bits. ```py import torch from transformers import BitsAndBytesConfig, AutoTokenizer, AutoModelForCausalLM bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True) tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1") model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1", torch_dtype=torch.float16, device_map="auto", attn_implementation="sdpa", quantization_config=bnb_config) input_ids = tokenizer('''def print_prime(n): """ Print all primes between 1 and n """''', return_tensors="pt").to("cuda") output = model.generate(**input_ids, cache_implementation="static") print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` ## Notes - If you're using Transformers < 4.37.0.dev, set `trust_remote_code=True` in [`~AutoModel.from_pretrained`]. Otherwise, make sure you update Transformers to the latest stable version. ```py import torch from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1") model = AutoModelForCausalLM.from_pretrained( "microsoft/phi-1", torch_dtype=torch.float16, device_map="auto", trust_remote_code=True, attn_implementation="sdpa") input_ids = tokenizer('''def print_prime(n): """ Print all primes between 1 and n """''', return_tensors="pt").to("cuda") output = model.generate(**input_ids, cache_implementation="static") print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` ## PhiConfig [[autodoc]] PhiConfig ## PhiModel [[autodoc]] PhiModel - forward ## PhiForCausalLM [[autodoc]] PhiForCausalLM - forward - generate ## PhiForSequenceClassification [[autodoc]] PhiForSequenceClassification - forward ## PhiForTokenClassification [[autodoc]] PhiForTokenClassification - forward