mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-15 10:38:23 +06:00

* Renamed num_added_tokens to num_special_tokens_to_add Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Cherry-Pick: Partially fix space only input without special tokens added to the output #3091 Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added property is_fast on PretrainedTokenizer and PretrainedTokenizerFast Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Make fast tokenizers unittests work on Windows. * Entirely refactored unittest for tokenizers fast. * Remove ABC class for CommonFastTokenizerTest * Added embeded_special_tokens tests from allenai @dirkgr * Make embeded_special_tokens tests from allenai more generic * Uniformize vocab_size as a property for both Fast and normal tokenizers * Move special tokens handling out of PretrainedTokenizer (SpecialTokensMixin) * Ensure providing None input raise the same ValueError than Python tokenizer + tests. * Fix invalid input for assert_padding when testing batch_encode_plus * Move add_special_tokens from constructor to tokenize/encode/[batch_]encode_plus methods parameter. * Ensure tokenize() correctly forward add_special_tokens to rust. * Adding None checking on top on encode / encode_batch for TransfoXLTokenizerFast. Avoid stripping on None values. * unittests ensure tokenize() also throws a ValueError if provided None * Added add_special_tokens unittest for all supported models. * Style * Make sure TransfoXL test run only if PyTorch is provided. * Split up tokenizers tests for each model type. * Fix invalid unittest with new tokenizers API. * Filter out Roberta openai detector models from unittests. * Introduce BatchEncoding on fast tokenizers path. This new structure exposes all the mappings retrieved from Rust. It also keeps the current behavior with model forward. * Introduce BatchEncoding on slow tokenizers path. Backward compatibility. * Improve error message on BatchEncoding for slow path * Make add_prefix_space True by default on Roberta fast to match Python in majority of cases. * Style and format. * Added typing on all methods for PretrainedTokenizerFast * Style and format * Added path for feeding pretokenized (List[str]) input to PretrainedTokenizerFast. * Style and format * encode_plus now supports pretokenized inputs. * Remove user warning about add_special_tokens when working on pretokenized inputs. * Always go through the post processor. * Added support for pretokenized input pairs on encode_plus * Added is_pretokenized flag on encode_plus for clarity and improved error message on input TypeError. * Added pretokenized inputs support on batch_encode_plus * Update BatchEncoding methods name to match Encoding. * Bump setup.py tokenizers dependency to 0.7.0rc1 * Remove unused parameters in BertTokenizerFast * Make sure Roberta returns token_type_ids for unittests. * Added missing typings * Update add_tokens prototype to match tokenizers side and allow AddedToken * Bumping tokenizers to 0.7.0rc2 * Added documentation for BatchEncoding * Added (unused) is_pretokenized parameter on PreTrainedTokenizer encode_plus/batch_encode_plus methods. * Added higher-level typing for tokenize / encode_plus / batch_encode_plus. * Fix unittests failing because add_special_tokens was defined as a constructor parameter on Rust Tokenizers. * Fix text-classification pipeline using the wrong tokenizer * Make pipelines works with BatchEncoding * Turn off add_special_tokens on tokenize by default. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Remove add_prefix_space from tokenize call in unittest. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Style and quality Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Correct message for batch_encode_plus none input exception. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Fix invalid list comprehension for offset_mapping overriding content every iteration. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * TransfoXL uses Strip normalizer. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Bump tokenizers dependency to 0.7.0rc3 Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Support AddedTokens for special_tokens and use left stripping on mask for Roberta. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * SpecilaTokenMixin can use slots to faster access to underlying attributes. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Remove update_special_tokens from fast tokenizers. * Ensure TransfoXL unittests are run only when torch is available. * Style. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Style * Style 🙏🙏 * Remove slots on SpecialTokensMixin, need deep dive into pickle protocol. * Remove Roberta warning on __init__. * Move documentation to Google style. Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>
133 lines
5.1 KiB
Python
133 lines
5.1 KiB
Python
"""
|
|
Simple check list from AllenNLP repo: https://github.com/allenai/allennlp/blob/master/setup.py
|
|
|
|
To create the package for pypi.
|
|
|
|
1. Change the version in __init__.py, setup.py as well as docs/source/conf.py.
|
|
|
|
2. Commit these changes with the message: "Release: VERSION"
|
|
|
|
3. Add a tag in git to mark the release: "git tag VERSION -m'Adds tag VERSION for pypi' "
|
|
Push the tag to git: git push --tags origin master
|
|
|
|
4. Build both the sources and the wheel. Do not change anything in setup.py between
|
|
creating the wheel and the source distribution (obviously).
|
|
|
|
For the wheel, run: "python setup.py bdist_wheel" in the top level directory.
|
|
(this will build a wheel for the python version you use to build it).
|
|
|
|
For the sources, run: "python setup.py sdist"
|
|
You should now have a /dist directory with both .whl and .tar.gz source versions.
|
|
|
|
5. Check that everything looks correct by uploading the package to the pypi test server:
|
|
|
|
twine upload dist/* -r pypitest
|
|
(pypi suggest using twine as other methods upload files via plaintext.)
|
|
You may have to specify the repository url, use the following command then:
|
|
twine upload dist/* -r pypitest --repository-url=https://test.pypi.org/legacy/
|
|
|
|
Check that you can install it in a virtualenv by running:
|
|
pip install -i https://testpypi.python.org/pypi transformers
|
|
|
|
6. Upload the final version to actual pypi:
|
|
twine upload dist/* -r pypi
|
|
|
|
7. Copy the release notes from RELEASE.md to the tag in github once everything is looking hunky-dory.
|
|
|
|
8. Update the documentation commit in .circleci/deploy.sh for the accurate documentation to be displayed
|
|
|
|
9. Update README.md to redirect to correct documentation.
|
|
"""
|
|
|
|
import shutil
|
|
from pathlib import Path
|
|
|
|
from setuptools import find_packages, setup
|
|
|
|
|
|
# Remove stale transformers.egg-info directory to avoid https://github.com/pypa/pip/issues/5466
|
|
stale_egg_info = Path(__file__).parent / "transformers.egg-info"
|
|
if stale_egg_info.exists():
|
|
print(
|
|
(
|
|
"Warning: {} exists.\n\n"
|
|
"If you recently updated transformers to 3.0 or later, this is expected,\n"
|
|
"but it may prevent transformers from installing in editable mode.\n\n"
|
|
"This directory is automatically generated by Python's packaging tools.\n"
|
|
"I will remove it now.\n\n"
|
|
"See https://github.com/pypa/pip/issues/5466 for details.\n"
|
|
).format(stale_egg_info)
|
|
)
|
|
shutil.rmtree(stale_egg_info)
|
|
|
|
|
|
extras = {}
|
|
|
|
extras["mecab"] = ["mecab-python3"]
|
|
extras["sklearn"] = ["scikit-learn"]
|
|
extras["tf"] = ["tensorflow"]
|
|
extras["tf-cpu"] = ["tensorflow-cpu"]
|
|
extras["torch"] = ["torch"]
|
|
|
|
extras["serving"] = ["pydantic", "uvicorn", "fastapi", "starlette"]
|
|
extras["all"] = extras["serving"] + ["tensorflow", "torch"]
|
|
|
|
extras["testing"] = ["pytest", "pytest-xdist"]
|
|
extras["docs"] = ["recommonmark", "sphinx", "sphinx-markdown-tables", "sphinx-rtd-theme"]
|
|
extras["quality"] = [
|
|
"black",
|
|
"isort @ git+git://github.com/timothycrosley/isort.git@e63ae06ec7d70b06df9e528357650281a3d3ec22#egg=isort",
|
|
"flake8",
|
|
]
|
|
extras["dev"] = extras["testing"] + extras["quality"] + ["mecab-python3", "scikit-learn", "tensorflow", "torch"]
|
|
|
|
setup(
|
|
name="transformers",
|
|
version="2.8.0",
|
|
author="Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Sam Shleifer, Google AI Language Team Authors, Open AI team Authors, Facebook AI Authors, Carnegie Mellon University Authors",
|
|
author_email="thomas@huggingface.co",
|
|
description="State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch",
|
|
long_description=open("README.md", "r", encoding="utf-8").read(),
|
|
long_description_content_type="text/markdown",
|
|
keywords="NLP deep learning transformer pytorch tensorflow BERT GPT GPT-2 google openai CMU",
|
|
license="Apache",
|
|
url="https://github.com/huggingface/transformers",
|
|
package_dir={"": "src"},
|
|
packages=find_packages("src"),
|
|
install_requires=[
|
|
"numpy",
|
|
"tokenizers == 0.7.0rc3",
|
|
# dataclasses for Python versions that don't have it
|
|
"dataclasses;python_version<'3.7'",
|
|
# accessing files from S3 directly
|
|
"boto3",
|
|
# filesystem locks e.g. to prevent parallel downloads
|
|
"filelock",
|
|
# for downloading models over HTTPS
|
|
"requests",
|
|
# progress bars in model download and training scripts
|
|
"tqdm >= 4.27",
|
|
# for OpenAI GPT
|
|
"regex != 2019.12.17",
|
|
# for XLNet
|
|
"sentencepiece",
|
|
# for XLM
|
|
"sacremoses",
|
|
],
|
|
extras_require=extras,
|
|
scripts=["transformers-cli"],
|
|
python_requires=">=3.6.0",
|
|
classifiers=[
|
|
"Development Status :: 5 - Production/Stable",
|
|
"Intended Audience :: Developers",
|
|
"Intended Audience :: Education",
|
|
"Intended Audience :: Science/Research",
|
|
"License :: OSI Approved :: Apache Software License",
|
|
"Operating System :: OS Independent",
|
|
"Programming Language :: Python :: 3",
|
|
"Programming Language :: Python :: 3.6",
|
|
"Programming Language :: Python :: 3.7",
|
|
"Topic :: Scientific/Engineering :: Artificial Intelligence",
|
|
],
|
|
)
|