mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-06 22:30:09 +06:00

* Partial TF port for ESM model * Add ESM-TF tests * Add the various imports for TF-ESM * TF weight conversion almost ready * Stop ignoring the decoder weights in PT * Add tests and lots of fixes * fix-copies * Fix imports, add model docs * Add get_vocab() to tokenizer * Fix vocab links for pretrained files * Allow multiple inputs with a sep * Use EOS as SEP token because ESM vocab lacks SEP * Correctly return special tokens mask from ESM tokenizer * make fixup * Stop testing unsupported embedding resizing * Handle TF bias correctly * Skip all models with slow tokenizers in the token classification test * Fixing the batch/unbatcher of pipelines to accomodate the `None` being passed around. * Fixing pipeline bug caused by slow tokenizer being different. * Update src/transformers/models/esm/modeling_tf_esm.py Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com> * Update src/transformers/models/esm/modeling_tf_esm.py Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com> * Update src/transformers/models/esm/modeling_tf_esm.py Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com> * Update set_input_embeddings and the copyright notices Co-authored-by: Your Name <you@example.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
130 lines
5.6 KiB
Plaintext
130 lines
5.6 KiB
Plaintext
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations under the License.
|
|
-->
|
|
|
|
# ESM
|
|
|
|
## Overview
|
|
This page provides code and pre-trained weights for Transformer protein language models from Meta AI's Fundamental
|
|
AI Research Team, providing the state-of-the-art ESM-2, and the previously released ESM-1b and ESM-1v. Transformer
|
|
protein language models were introduced in the paper [Biological structure and function emerge from scaling
|
|
unsupervised learning to 250 million protein sequences](https://www.pnas.org/content/118/15/e2016239118) by
|
|
Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott,
|
|
C. Lawrence Zitnick, Jerry Ma, and Rob Fergus.
|
|
The first version of this paper was [preprinted in 2019](https://www.biorxiv.org/content/10.1101/622803v1?versioned=true).
|
|
|
|
ESM-2 outperforms all tested single-sequence protein language models across a range of structure prediction tasks,
|
|
and enables atomic resolution structure prediction.
|
|
It was released with the paper [Language models of protein sequences at the scale of evolution enable accurate
|
|
structure prediction](https://doi.org/10.1101/2022.07.20.500902) by Zeming Lin, Halil Akin, Roshan Rao, Brian Hie,
|
|
Zhongkai Zhu, Wenting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido and Alexander Rives.
|
|
|
|
|
|
The abstract from
|
|
"Biological structure and function emerge from scaling unsupervised learning to 250
|
|
million protein sequences" is
|
|
|
|
|
|
*In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised
|
|
learning has led to major advances in representation learning and statistical generation. In the life sciences, the
|
|
anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling
|
|
at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To
|
|
this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250
|
|
million protein sequences spanning evolutionary diversity. The resulting model contains information about biological
|
|
properties in its representations. The representations are learned from sequence data alone. The learned representation
|
|
space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to
|
|
remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and
|
|
can be identified by linear projections. Representation learning produces features that generalize across a range of
|
|
applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and
|
|
improving state-of-the-art features for long-range contact prediction.*
|
|
|
|
|
|
The abstract from
|
|
"Language models of protein sequences at the scale of evolution enable accurate structure prediction" is
|
|
|
|
*Large language models have recently been shown to develop emergent capabilities with scale, going beyond
|
|
simple pattern matching to perform higher level reasoning and generate lifelike images and text. While
|
|
language models trained on protein sequences have been studied at a smaller scale, little is known about
|
|
what they learn about biology as they are scaled up. In this work we train models up to 15 billion parameters,
|
|
the largest language models of proteins to be evaluated to date. We find that as models are scaled they learn
|
|
information enabling the prediction of the three-dimensional structure of a protein at the resolution of
|
|
individual atoms. We present ESMFold for high accuracy end-to-end atomic level structure prediction directly
|
|
from the individual sequence of a protein. ESMFold has similar accuracy to AlphaFold2 and RoseTTAFold for
|
|
sequences with low perplexity that are well understood by the language model. ESMFold inference is an
|
|
order of magnitude faster than AlphaFold2, enabling exploration of the structural space of metagenomic
|
|
proteins in practical timescales.*
|
|
|
|
|
|
|
|
|
|
Tips:
|
|
|
|
- ESM models are trained with a masked language modeling (MLM) objective.
|
|
|
|
The original code can be found [here](https://github.com/facebookresearch/esm) and was
|
|
was developed by the Fundamental AI Research team at Meta AI.
|
|
This model was contributed to huggingface by [jasonliu](https://huggingface.co/jasonliu)
|
|
and [Matt](https://huggingface.co/Rocketknight1).
|
|
|
|
## EsmConfig
|
|
|
|
[[autodoc]] EsmConfig
|
|
- all
|
|
|
|
## EsmTokenizer
|
|
|
|
[[autodoc]] EsmTokenizer
|
|
- build_inputs_with_special_tokens
|
|
- get_special_tokens_mask
|
|
- create_token_type_ids_from_sequences
|
|
- save_vocabulary
|
|
|
|
|
|
## EsmModel
|
|
|
|
[[autodoc]] EsmModel
|
|
- forward
|
|
|
|
## EsmForMaskedLM
|
|
|
|
[[autodoc]] EsmForMaskedLM
|
|
- forward
|
|
|
|
## EsmForSequenceClassification
|
|
|
|
[[autodoc]] EsmForSequenceClassification
|
|
- forward
|
|
|
|
## EsmForTokenClassification
|
|
|
|
[[autodoc]] EsmForTokenClassification
|
|
- forward
|
|
|
|
## TFEsmModel
|
|
|
|
[[autodoc]] TFEsmModel
|
|
- call
|
|
|
|
## TFEsmForMaskedLM
|
|
|
|
[[autodoc]] TFEsmForMaskedLM
|
|
- call
|
|
|
|
## TFEsmForSequenceClassification
|
|
|
|
[[autodoc]] TFEsmForSequenceClassification
|
|
- call
|
|
|
|
## TFEsmForTokenClassification
|
|
|
|
[[autodoc]] TFEsmForTokenClassification
|
|
- call
|