mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-14 18:18:24 +06:00

* add dia model * add tokenizer files * cleanup some stuff * brut copy paste code * rough cleanup of the modeling code * nuke some stuff * more nuking * more cleanups * updates * add mulitLayerEmbedding vectorization * nits * more modeling simplifications * updates * update rope * update rope * just fixup * update configuration files * more cleanup! * default config values * update * forgotten comma * another comma! * update, more cleanups * just more nits * more config cleanups * time for the encoder * fix * sa=mall nit * nits * n * refacto a bit * cleanup * update cv scipt * fix last issues * fix last nits * styling * small fixes * just run 1 generation * fixes * nits * fix conversion * fix * more fixes * full generate * ouf! * fixes! * updates * fix * fix cvrt * fixup * nits * delete wrong test * update * update * test tokenization * let's start changing things bit by bit - fix encoder step * removing custom generation, moving to GenerationMixin * add encoder decoder attention masks for generation * mask changes, correctness checked against ad29837 in dia repo * refactor a bit already --> next cache * too important not to push :) * minimal cleanup + more todos * make main overwrite modeling utils * add cfg filter & eos filter * add eos countdown & delay pattern * update eos countdown * add max step eos countdown * fix tests * fix some things * fix generation with testing * move cfg & eos stuff to logits processor * make RepetitionPenaltyLogitsProcessor flexible - can accept 3D scores like (batch_size, channel, vocab) * fix input_ids concatenation dimension in GenerationMixin for flexibility * Add DiaHangoverLogitsProcessor and DiaExponentialDecayLengthPenalty classes; refactor logits processing in DiaForConditionalGeneration to utilize new configurations and improve flexibility. * Add stopping criteria * refactor * move delay pattern from processor to modeling like musicgen. - add docs - change eos countdown to eos delay pattern * fix processor & fix tests * refactor types * refactor imports * format code * fix docstring to pass ci * add docstring to DiaConfig & add DiaModel to test * fix docstring * add docstring * fix some bugs * check * porting / merging results from other branch - IMPORTANT: it very likely breaks generation, the goal is to have a proper forward path first * experimental testing of left padding for first channel * whoops * Fix merge to make generation work * fix cfg filter * add position ids * add todos, break things * revert changes to generation --> we will force 2d but go 3d on custom stuff * refactor a lot, change prepare decoder ids to work with left padding (needs testing), add todos * some first fixes to get to 10. in generation * some more generation fixes / adjustment * style + rope fixes * move cfg out, simplify a few things, more todos * nit * start working on custom logit processors * nit * quick fixes * cfg top k * more refactor of logits processing, needs a decision if gen config gets the new attributes or if we move it to config or similar * lets keep changes to core code minimal, only eos scaling is questionable atm * simpler eos delay logits processor * that was for debugging :D * proof of concept rope * small fix on device mismatch * cfg fixes + delay logits max len * transformers rope * modular dia * more cleanup * keep modeling consistently 3D, generate handles 2D internally * decoder starts with bos if nothing * post processing prototype * style * lol * force sample / greedy + fixes on padding * style * fixup tokenization * nits * revert * start working on dia tests * fix a lot of tests * more test fixes * nit * more test fixes + some features to simplify code more * more cleanup * forgot that one * autodocs * small consistency fixes * fix regression * small fixes * dia feature extraction * docs * wip processor * fix processor order * processing goes brrr * transpose before * small fix * fix major bug but needs now a closer look into the custom processors esp cfg * small thing on logits * nits * simplify indices and shifts * add simpler version of padding tests back (temporarily) * add logit processor tests * starting tests on processor * fix mask application during generation * some fixes on the weights conversion * style + fixup logits order * simplify conversion * nit * remove padding tests * nits on modeling * hmm * fix tests * trigger * probably gonna be reverted, just a quick design around audio tokenizer * fixup typing * post merge + more typing * initial design for audio tokenizer * more design changes * nit * more processor tests and style related things * add to init * protect import * not sure why tbh * add another protect * more fixes * wow * it aint stopping :D * another missed type issue * ... * change design around audio tokenizer to prioritize init and go for auto - in regards to the review * change to new causal mask function + docstrings * change ternary * docs * remove todo, i dont think its essential tbh * remove pipeline as current pipelines do not fit in the current scheme, same as csm * closer to wrapping up the processor * text to audio, just for demo purposes (will likely be reverted) * check if it's this * save audio function * ensure no grad * fixes on prefixed audio, hop length is used via preprocess dac, device fixes * integration tests (tested locally on a100) + some processor utils / fixes * style * nits * another round of smaller things * docs + some fixes (generate one might be big) * msytery solved * small fix on conversion * add abstract audio tokenizer, change init check to abstract class * nits * update docs + fix some processing :D * change inheritance scheme for audio tokenizer * delete dead / unnecessary code in copied generate loop * last nits on new pipeline behavior (+ todo on tests) + style * trigger --------- Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: Vasqu <antonprogamer@gmail.com>
507 lines
24 KiB
Python
507 lines
24 KiB
Python
# Copyright 2021 the HuggingFace Inc. team.
|
|
#
|
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
# you may not use this file except in compliance with the License.
|
|
# You may obtain a copy of the License at
|
|
#
|
|
# http://www.apache.org/licenses/LICENSE-2.0
|
|
#
|
|
# Unless required by applicable law or agreed to in writing, software
|
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
# See the License for the specific language governing permissions and
|
|
# limitations under the License.
|
|
|
|
import json
|
|
import os
|
|
import sys
|
|
import tempfile
|
|
import unittest
|
|
from pathlib import Path
|
|
from shutil import copyfile
|
|
|
|
from huggingface_hub import HfFolder, Repository
|
|
|
|
import transformers
|
|
from transformers import (
|
|
CONFIG_MAPPING,
|
|
FEATURE_EXTRACTOR_MAPPING,
|
|
MODEL_FOR_AUDIO_TOKENIZATION_MAPPING,
|
|
PROCESSOR_MAPPING,
|
|
TOKENIZER_MAPPING,
|
|
AutoConfig,
|
|
AutoFeatureExtractor,
|
|
AutoProcessor,
|
|
AutoTokenizer,
|
|
BertTokenizer,
|
|
ProcessorMixin,
|
|
Wav2Vec2Config,
|
|
Wav2Vec2FeatureExtractor,
|
|
Wav2Vec2Processor,
|
|
)
|
|
from transformers.testing_utils import TOKEN, TemporaryHubRepo, get_tests_dir, is_staging_test
|
|
from transformers.tokenization_utils import TOKENIZER_CONFIG_FILE
|
|
from transformers.utils import (
|
|
FEATURE_EXTRACTOR_NAME,
|
|
PROCESSOR_NAME,
|
|
is_tokenizers_available,
|
|
)
|
|
|
|
|
|
sys.path.append(str(Path(__file__).parent.parent.parent.parent / "utils"))
|
|
|
|
from test_module.custom_configuration import CustomConfig # noqa E402
|
|
from test_module.custom_feature_extraction import CustomFeatureExtractor # noqa E402
|
|
from test_module.custom_processing import CustomProcessor # noqa E402
|
|
from test_module.custom_tokenization import CustomTokenizer # noqa E402
|
|
|
|
|
|
SAMPLE_PROCESSOR_CONFIG = get_tests_dir("fixtures/dummy_feature_extractor_config.json")
|
|
SAMPLE_VOCAB = get_tests_dir("fixtures/vocab.json")
|
|
SAMPLE_PROCESSOR_CONFIG_DIR = get_tests_dir("fixtures")
|
|
|
|
|
|
class AutoFeatureExtractorTest(unittest.TestCase):
|
|
vocab_tokens = ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]", "bla", "blou"]
|
|
|
|
def setUp(self):
|
|
transformers.dynamic_module_utils.TIME_OUT_REMOTE_CODE = 0
|
|
|
|
def test_processor_from_model_shortcut(self):
|
|
processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
|
|
self.assertIsInstance(processor, Wav2Vec2Processor)
|
|
|
|
def test_processor_from_local_directory_from_repo(self):
|
|
with tempfile.TemporaryDirectory() as tmpdirname:
|
|
model_config = Wav2Vec2Config()
|
|
processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
|
|
|
|
# save in new folder
|
|
model_config.save_pretrained(tmpdirname)
|
|
processor.save_pretrained(tmpdirname)
|
|
|
|
processor = AutoProcessor.from_pretrained(tmpdirname)
|
|
|
|
self.assertIsInstance(processor, Wav2Vec2Processor)
|
|
|
|
def test_processor_from_local_directory_from_extractor_config(self):
|
|
with tempfile.TemporaryDirectory() as tmpdirname:
|
|
# copy relevant files
|
|
copyfile(SAMPLE_PROCESSOR_CONFIG, os.path.join(tmpdirname, FEATURE_EXTRACTOR_NAME))
|
|
copyfile(SAMPLE_VOCAB, os.path.join(tmpdirname, "vocab.json"))
|
|
|
|
processor = AutoProcessor.from_pretrained(tmpdirname)
|
|
|
|
self.assertIsInstance(processor, Wav2Vec2Processor)
|
|
|
|
def test_processor_from_processor_class(self):
|
|
with tempfile.TemporaryDirectory() as tmpdirname:
|
|
feature_extractor = Wav2Vec2FeatureExtractor()
|
|
tokenizer = AutoTokenizer.from_pretrained("facebook/wav2vec2-base-960h")
|
|
|
|
processor = Wav2Vec2Processor(feature_extractor, tokenizer)
|
|
|
|
# save in new folder
|
|
processor.save_pretrained(tmpdirname)
|
|
|
|
if not os.path.isfile(os.path.join(tmpdirname, PROCESSOR_NAME)):
|
|
# create one manually in order to perform this test's objective
|
|
config_dict = {"processor_class": "Wav2Vec2Processor"}
|
|
with open(os.path.join(tmpdirname, PROCESSOR_NAME), "w") as fp:
|
|
json.dump(config_dict, fp)
|
|
|
|
# drop `processor_class` in tokenizer config
|
|
with open(os.path.join(tmpdirname, TOKENIZER_CONFIG_FILE)) as f:
|
|
config_dict = json.load(f)
|
|
config_dict.pop("processor_class")
|
|
|
|
with open(os.path.join(tmpdirname, TOKENIZER_CONFIG_FILE), "w") as f:
|
|
f.write(json.dumps(config_dict))
|
|
|
|
processor = AutoProcessor.from_pretrained(tmpdirname)
|
|
|
|
self.assertIsInstance(processor, Wav2Vec2Processor)
|
|
|
|
def test_processor_from_feat_extr_processor_class(self):
|
|
with tempfile.TemporaryDirectory() as tmpdirname:
|
|
feature_extractor = Wav2Vec2FeatureExtractor()
|
|
tokenizer = AutoTokenizer.from_pretrained("facebook/wav2vec2-base-960h")
|
|
|
|
processor = Wav2Vec2Processor(feature_extractor, tokenizer)
|
|
|
|
# save in new folder
|
|
processor.save_pretrained(tmpdirname)
|
|
|
|
if os.path.isfile(os.path.join(tmpdirname, PROCESSOR_NAME)):
|
|
# drop `processor_class` in processor
|
|
with open(os.path.join(tmpdirname, PROCESSOR_NAME)) as f:
|
|
config_dict = json.load(f)
|
|
config_dict.pop("processor_class")
|
|
|
|
with open(os.path.join(tmpdirname, PROCESSOR_NAME), "w") as f:
|
|
f.write(json.dumps(config_dict))
|
|
|
|
# drop `processor_class` in tokenizer
|
|
with open(os.path.join(tmpdirname, TOKENIZER_CONFIG_FILE)) as f:
|
|
config_dict = json.load(f)
|
|
config_dict.pop("processor_class")
|
|
|
|
with open(os.path.join(tmpdirname, TOKENIZER_CONFIG_FILE), "w") as f:
|
|
f.write(json.dumps(config_dict))
|
|
|
|
processor = AutoProcessor.from_pretrained(tmpdirname)
|
|
|
|
self.assertIsInstance(processor, Wav2Vec2Processor)
|
|
|
|
def test_processor_from_tokenizer_processor_class(self):
|
|
with tempfile.TemporaryDirectory() as tmpdirname:
|
|
feature_extractor = Wav2Vec2FeatureExtractor()
|
|
tokenizer = AutoTokenizer.from_pretrained("facebook/wav2vec2-base-960h")
|
|
|
|
processor = Wav2Vec2Processor(feature_extractor, tokenizer)
|
|
|
|
# save in new folder
|
|
processor.save_pretrained(tmpdirname)
|
|
|
|
if os.path.isfile(os.path.join(tmpdirname, PROCESSOR_NAME)):
|
|
# drop `processor_class` in processor
|
|
with open(os.path.join(tmpdirname, PROCESSOR_NAME)) as f:
|
|
config_dict = json.load(f)
|
|
config_dict.pop("processor_class")
|
|
|
|
with open(os.path.join(tmpdirname, PROCESSOR_NAME), "w") as f:
|
|
f.write(json.dumps(config_dict))
|
|
|
|
# drop `processor_class` in feature extractor
|
|
with open(os.path.join(tmpdirname, FEATURE_EXTRACTOR_NAME)) as f:
|
|
config_dict = json.load(f)
|
|
config_dict.pop("processor_class")
|
|
|
|
with open(os.path.join(tmpdirname, FEATURE_EXTRACTOR_NAME), "w") as f:
|
|
f.write(json.dumps(config_dict))
|
|
|
|
processor = AutoProcessor.from_pretrained(tmpdirname)
|
|
|
|
self.assertIsInstance(processor, Wav2Vec2Processor)
|
|
|
|
def test_processor_from_local_directory_from_model_config(self):
|
|
with tempfile.TemporaryDirectory() as tmpdirname:
|
|
model_config = Wav2Vec2Config(processor_class="Wav2Vec2Processor")
|
|
model_config.save_pretrained(tmpdirname)
|
|
# copy relevant files
|
|
copyfile(SAMPLE_VOCAB, os.path.join(tmpdirname, "vocab.json"))
|
|
# create empty sample processor
|
|
with open(os.path.join(tmpdirname, FEATURE_EXTRACTOR_NAME), "w") as f:
|
|
f.write("{}")
|
|
|
|
processor = AutoProcessor.from_pretrained(tmpdirname)
|
|
|
|
self.assertIsInstance(processor, Wav2Vec2Processor)
|
|
|
|
def test_from_pretrained_dynamic_processor(self):
|
|
# If remote code is not set, we will time out when asking whether to load the model.
|
|
with self.assertRaises(ValueError):
|
|
processor = AutoProcessor.from_pretrained("hf-internal-testing/test_dynamic_processor")
|
|
# If remote code is disabled, we can't load this config.
|
|
with self.assertRaises(ValueError):
|
|
processor = AutoProcessor.from_pretrained(
|
|
"hf-internal-testing/test_dynamic_processor", trust_remote_code=False
|
|
)
|
|
|
|
processor = AutoProcessor.from_pretrained("hf-internal-testing/test_dynamic_processor", trust_remote_code=True)
|
|
self.assertTrue(processor.special_attribute_present)
|
|
self.assertEqual(processor.__class__.__name__, "NewProcessor")
|
|
|
|
feature_extractor = processor.feature_extractor
|
|
self.assertTrue(feature_extractor.special_attribute_present)
|
|
self.assertEqual(feature_extractor.__class__.__name__, "NewFeatureExtractor")
|
|
|
|
tokenizer = processor.tokenizer
|
|
self.assertTrue(tokenizer.special_attribute_present)
|
|
if is_tokenizers_available():
|
|
self.assertEqual(tokenizer.__class__.__name__, "NewTokenizerFast")
|
|
|
|
# Test we can also load the slow version
|
|
new_processor = AutoProcessor.from_pretrained(
|
|
"hf-internal-testing/test_dynamic_processor", trust_remote_code=True, use_fast=False
|
|
)
|
|
new_tokenizer = new_processor.tokenizer
|
|
self.assertTrue(new_tokenizer.special_attribute_present)
|
|
self.assertEqual(new_tokenizer.__class__.__name__, "NewTokenizer")
|
|
else:
|
|
self.assertEqual(tokenizer.__class__.__name__, "NewTokenizer")
|
|
|
|
def test_new_processor_registration(self):
|
|
try:
|
|
AutoConfig.register("custom", CustomConfig)
|
|
AutoFeatureExtractor.register(CustomConfig, CustomFeatureExtractor)
|
|
AutoTokenizer.register(CustomConfig, slow_tokenizer_class=CustomTokenizer)
|
|
AutoProcessor.register(CustomConfig, CustomProcessor)
|
|
# Trying to register something existing in the Transformers library will raise an error
|
|
with self.assertRaises(ValueError):
|
|
AutoProcessor.register(Wav2Vec2Config, Wav2Vec2Processor)
|
|
|
|
# Now that the config is registered, it can be used as any other config with the auto-API
|
|
feature_extractor = CustomFeatureExtractor.from_pretrained(SAMPLE_PROCESSOR_CONFIG_DIR)
|
|
|
|
with tempfile.TemporaryDirectory() as tmp_dir:
|
|
vocab_file = os.path.join(tmp_dir, "vocab.txt")
|
|
with open(vocab_file, "w", encoding="utf-8") as vocab_writer:
|
|
vocab_writer.write("".join([x + "\n" for x in self.vocab_tokens]))
|
|
tokenizer = CustomTokenizer(vocab_file)
|
|
|
|
processor = CustomProcessor(feature_extractor, tokenizer)
|
|
|
|
with tempfile.TemporaryDirectory() as tmp_dir:
|
|
processor.save_pretrained(tmp_dir)
|
|
new_processor = AutoProcessor.from_pretrained(tmp_dir)
|
|
self.assertIsInstance(new_processor, CustomProcessor)
|
|
|
|
finally:
|
|
if "custom" in CONFIG_MAPPING._extra_content:
|
|
del CONFIG_MAPPING._extra_content["custom"]
|
|
if CustomConfig in FEATURE_EXTRACTOR_MAPPING._extra_content:
|
|
del FEATURE_EXTRACTOR_MAPPING._extra_content[CustomConfig]
|
|
if CustomConfig in TOKENIZER_MAPPING._extra_content:
|
|
del TOKENIZER_MAPPING._extra_content[CustomConfig]
|
|
if CustomConfig in PROCESSOR_MAPPING._extra_content:
|
|
del PROCESSOR_MAPPING._extra_content[CustomConfig]
|
|
if CustomConfig in MODEL_FOR_AUDIO_TOKENIZATION_MAPPING._extra_content:
|
|
del MODEL_FOR_AUDIO_TOKENIZATION_MAPPING._extra_content[CustomConfig]
|
|
|
|
def test_from_pretrained_dynamic_processor_conflict(self):
|
|
class NewFeatureExtractor(Wav2Vec2FeatureExtractor):
|
|
special_attribute_present = False
|
|
|
|
class NewTokenizer(BertTokenizer):
|
|
special_attribute_present = False
|
|
|
|
class NewProcessor(ProcessorMixin):
|
|
feature_extractor_class = "AutoFeatureExtractor"
|
|
tokenizer_class = "AutoTokenizer"
|
|
special_attribute_present = False
|
|
|
|
try:
|
|
AutoConfig.register("custom", CustomConfig)
|
|
AutoFeatureExtractor.register(CustomConfig, NewFeatureExtractor)
|
|
AutoTokenizer.register(CustomConfig, slow_tokenizer_class=NewTokenizer)
|
|
AutoProcessor.register(CustomConfig, NewProcessor)
|
|
# If remote code is not set, the default is to use local classes.
|
|
processor = AutoProcessor.from_pretrained("hf-internal-testing/test_dynamic_processor")
|
|
self.assertEqual(processor.__class__.__name__, "NewProcessor")
|
|
self.assertFalse(processor.special_attribute_present)
|
|
self.assertFalse(processor.feature_extractor.special_attribute_present)
|
|
self.assertFalse(processor.tokenizer.special_attribute_present)
|
|
|
|
# If remote code is disabled, we load the local ones.
|
|
processor = AutoProcessor.from_pretrained(
|
|
"hf-internal-testing/test_dynamic_processor", trust_remote_code=False
|
|
)
|
|
self.assertEqual(processor.__class__.__name__, "NewProcessor")
|
|
self.assertFalse(processor.special_attribute_present)
|
|
self.assertFalse(processor.feature_extractor.special_attribute_present)
|
|
self.assertFalse(processor.tokenizer.special_attribute_present)
|
|
|
|
# If remote is enabled, we load from the Hub.
|
|
processor = AutoProcessor.from_pretrained(
|
|
"hf-internal-testing/test_dynamic_processor", trust_remote_code=True
|
|
)
|
|
self.assertEqual(processor.__class__.__name__, "NewProcessor")
|
|
self.assertTrue(processor.special_attribute_present)
|
|
self.assertTrue(processor.feature_extractor.special_attribute_present)
|
|
self.assertTrue(processor.tokenizer.special_attribute_present)
|
|
|
|
finally:
|
|
if "custom" in CONFIG_MAPPING._extra_content:
|
|
del CONFIG_MAPPING._extra_content["custom"]
|
|
if CustomConfig in FEATURE_EXTRACTOR_MAPPING._extra_content:
|
|
del FEATURE_EXTRACTOR_MAPPING._extra_content[CustomConfig]
|
|
if CustomConfig in TOKENIZER_MAPPING._extra_content:
|
|
del TOKENIZER_MAPPING._extra_content[CustomConfig]
|
|
if CustomConfig in PROCESSOR_MAPPING._extra_content:
|
|
del PROCESSOR_MAPPING._extra_content[CustomConfig]
|
|
if CustomConfig in MODEL_FOR_AUDIO_TOKENIZATION_MAPPING._extra_content:
|
|
del MODEL_FOR_AUDIO_TOKENIZATION_MAPPING._extra_content[CustomConfig]
|
|
|
|
def test_from_pretrained_dynamic_processor_with_extra_attributes(self):
|
|
class NewFeatureExtractor(Wav2Vec2FeatureExtractor):
|
|
pass
|
|
|
|
class NewTokenizer(BertTokenizer):
|
|
pass
|
|
|
|
class NewProcessor(ProcessorMixin):
|
|
feature_extractor_class = "AutoFeatureExtractor"
|
|
tokenizer_class = "AutoTokenizer"
|
|
|
|
def __init__(self, feature_extractor, tokenizer, processor_attr_1=1, processor_attr_2=True):
|
|
super().__init__(feature_extractor, tokenizer)
|
|
|
|
self.processor_attr_1 = processor_attr_1
|
|
self.processor_attr_2 = processor_attr_2
|
|
|
|
try:
|
|
AutoConfig.register("custom", CustomConfig)
|
|
AutoFeatureExtractor.register(CustomConfig, NewFeatureExtractor)
|
|
AutoTokenizer.register(CustomConfig, slow_tokenizer_class=NewTokenizer)
|
|
AutoProcessor.register(CustomConfig, NewProcessor)
|
|
# If remote code is not set, the default is to use local classes.
|
|
processor = AutoProcessor.from_pretrained(
|
|
"hf-internal-testing/test_dynamic_processor", processor_attr_2=False
|
|
)
|
|
self.assertEqual(processor.__class__.__name__, "NewProcessor")
|
|
self.assertEqual(processor.processor_attr_1, 1)
|
|
self.assertEqual(processor.processor_attr_2, False)
|
|
finally:
|
|
if "custom" in CONFIG_MAPPING._extra_content:
|
|
del CONFIG_MAPPING._extra_content["custom"]
|
|
if CustomConfig in FEATURE_EXTRACTOR_MAPPING._extra_content:
|
|
del FEATURE_EXTRACTOR_MAPPING._extra_content[CustomConfig]
|
|
if CustomConfig in TOKENIZER_MAPPING._extra_content:
|
|
del TOKENIZER_MAPPING._extra_content[CustomConfig]
|
|
if CustomConfig in PROCESSOR_MAPPING._extra_content:
|
|
del PROCESSOR_MAPPING._extra_content[CustomConfig]
|
|
if CustomConfig in MODEL_FOR_AUDIO_TOKENIZATION_MAPPING._extra_content:
|
|
del MODEL_FOR_AUDIO_TOKENIZATION_MAPPING._extra_content[CustomConfig]
|
|
|
|
def test_dynamic_processor_with_specific_dynamic_subcomponents(self):
|
|
class NewFeatureExtractor(Wav2Vec2FeatureExtractor):
|
|
pass
|
|
|
|
class NewTokenizer(BertTokenizer):
|
|
pass
|
|
|
|
class NewProcessor(ProcessorMixin):
|
|
feature_extractor_class = "NewFeatureExtractor"
|
|
tokenizer_class = "NewTokenizer"
|
|
|
|
def __init__(self, feature_extractor, tokenizer):
|
|
super().__init__(feature_extractor, tokenizer)
|
|
|
|
try:
|
|
AutoConfig.register("custom", CustomConfig)
|
|
AutoFeatureExtractor.register(CustomConfig, NewFeatureExtractor)
|
|
AutoTokenizer.register(CustomConfig, slow_tokenizer_class=NewTokenizer)
|
|
AutoProcessor.register(CustomConfig, NewProcessor)
|
|
# If remote code is not set, the default is to use local classes.
|
|
processor = AutoProcessor.from_pretrained(
|
|
"hf-internal-testing/test_dynamic_processor",
|
|
)
|
|
self.assertEqual(processor.__class__.__name__, "NewProcessor")
|
|
finally:
|
|
if "custom" in CONFIG_MAPPING._extra_content:
|
|
del CONFIG_MAPPING._extra_content["custom"]
|
|
if CustomConfig in FEATURE_EXTRACTOR_MAPPING._extra_content:
|
|
del FEATURE_EXTRACTOR_MAPPING._extra_content[CustomConfig]
|
|
if CustomConfig in TOKENIZER_MAPPING._extra_content:
|
|
del TOKENIZER_MAPPING._extra_content[CustomConfig]
|
|
if CustomConfig in PROCESSOR_MAPPING._extra_content:
|
|
del PROCESSOR_MAPPING._extra_content[CustomConfig]
|
|
if CustomConfig in MODEL_FOR_AUDIO_TOKENIZATION_MAPPING._extra_content:
|
|
del MODEL_FOR_AUDIO_TOKENIZATION_MAPPING._extra_content[CustomConfig]
|
|
|
|
def test_auto_processor_creates_tokenizer(self):
|
|
processor = AutoProcessor.from_pretrained("hf-internal-testing/tiny-random-bert")
|
|
self.assertEqual(processor.__class__.__name__, "BertTokenizerFast")
|
|
|
|
def test_auto_processor_creates_image_processor(self):
|
|
processor = AutoProcessor.from_pretrained("hf-internal-testing/tiny-random-convnext")
|
|
self.assertEqual(processor.__class__.__name__, "ConvNextImageProcessor")
|
|
|
|
def test_auto_processor_save_load(self):
|
|
processor = AutoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-0.5b-ov-hf")
|
|
with tempfile.TemporaryDirectory() as tmp_dir:
|
|
processor.save_pretrained(tmp_dir)
|
|
second_processor = AutoProcessor.from_pretrained(tmp_dir)
|
|
self.assertEqual(second_processor.__class__.__name__, processor.__class__.__name__)
|
|
|
|
|
|
@is_staging_test
|
|
class ProcessorPushToHubTester(unittest.TestCase):
|
|
vocab_tokens = ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]", "bla", "blou"]
|
|
|
|
@classmethod
|
|
def setUpClass(cls):
|
|
cls._token = TOKEN
|
|
HfFolder.save_token(TOKEN)
|
|
|
|
def test_push_to_hub_via_save_pretrained(self):
|
|
with TemporaryHubRepo(token=self._token) as tmp_repo:
|
|
processor = Wav2Vec2Processor.from_pretrained(SAMPLE_PROCESSOR_CONFIG_DIR)
|
|
# Push to hub via save_pretrained
|
|
with tempfile.TemporaryDirectory() as tmp_dir:
|
|
processor.save_pretrained(tmp_dir, repo_id=tmp_repo.repo_id, push_to_hub=True, token=self._token)
|
|
|
|
new_processor = Wav2Vec2Processor.from_pretrained(tmp_repo.repo_id)
|
|
for k, v in processor.feature_extractor.__dict__.items():
|
|
self.assertEqual(v, getattr(new_processor.feature_extractor, k))
|
|
self.assertDictEqual(new_processor.tokenizer.get_vocab(), processor.tokenizer.get_vocab())
|
|
|
|
def test_push_to_hub_in_organization_via_save_pretrained(self):
|
|
with TemporaryHubRepo(namespace="valid_org", token=self._token) as tmp_repo:
|
|
processor = Wav2Vec2Processor.from_pretrained(SAMPLE_PROCESSOR_CONFIG_DIR)
|
|
# Push to hub via save_pretrained
|
|
with tempfile.TemporaryDirectory() as tmp_dir:
|
|
processor.save_pretrained(
|
|
tmp_dir,
|
|
repo_id=tmp_repo.repo_id,
|
|
push_to_hub=True,
|
|
token=self._token,
|
|
)
|
|
|
|
new_processor = Wav2Vec2Processor.from_pretrained(tmp_repo.repo_id)
|
|
for k, v in processor.feature_extractor.__dict__.items():
|
|
self.assertEqual(v, getattr(new_processor.feature_extractor, k))
|
|
self.assertDictEqual(new_processor.tokenizer.get_vocab(), processor.tokenizer.get_vocab())
|
|
|
|
def test_push_to_hub_dynamic_processor(self):
|
|
with TemporaryHubRepo(token=self._token) as tmp_repo:
|
|
CustomFeatureExtractor.register_for_auto_class()
|
|
CustomTokenizer.register_for_auto_class()
|
|
CustomProcessor.register_for_auto_class()
|
|
|
|
feature_extractor = CustomFeatureExtractor.from_pretrained(SAMPLE_PROCESSOR_CONFIG_DIR)
|
|
|
|
with tempfile.TemporaryDirectory() as tmp_dir:
|
|
vocab_file = os.path.join(tmp_dir, "vocab.txt")
|
|
with open(vocab_file, "w", encoding="utf-8") as vocab_writer:
|
|
vocab_writer.write("".join([x + "\n" for x in self.vocab_tokens]))
|
|
tokenizer = CustomTokenizer(vocab_file)
|
|
|
|
processor = CustomProcessor(feature_extractor, tokenizer)
|
|
|
|
with tempfile.TemporaryDirectory() as tmp_dir:
|
|
repo = Repository(tmp_dir, clone_from=tmp_repo, token=self._token)
|
|
processor.save_pretrained(tmp_dir)
|
|
|
|
# This has added the proper auto_map field to the feature extractor config
|
|
self.assertDictEqual(
|
|
processor.feature_extractor.auto_map,
|
|
{
|
|
"AutoFeatureExtractor": "custom_feature_extraction.CustomFeatureExtractor",
|
|
"AutoProcessor": "custom_processing.CustomProcessor",
|
|
},
|
|
)
|
|
|
|
# This has added the proper auto_map field to the tokenizer config
|
|
with open(os.path.join(tmp_dir, "tokenizer_config.json")) as f:
|
|
tokenizer_config = json.load(f)
|
|
self.assertDictEqual(
|
|
tokenizer_config["auto_map"],
|
|
{
|
|
"AutoTokenizer": ["custom_tokenization.CustomTokenizer", None],
|
|
"AutoProcessor": "custom_processing.CustomProcessor",
|
|
},
|
|
)
|
|
|
|
# The code has been copied from fixtures
|
|
self.assertTrue(os.path.isfile(os.path.join(tmp_dir, "custom_feature_extraction.py")))
|
|
self.assertTrue(os.path.isfile(os.path.join(tmp_dir, "custom_tokenization.py")))
|
|
self.assertTrue(os.path.isfile(os.path.join(tmp_dir, "custom_processing.py")))
|
|
|
|
repo.push_to_hub()
|
|
|
|
new_processor = AutoProcessor.from_pretrained(tmp_repo.repo_id, trust_remote_code=True)
|
|
# Can't make an isinstance check because the new_processor is from the CustomProcessor class of a dynamic module
|
|
self.assertEqual(new_processor.__class__.__name__, "CustomProcessor")
|