Re-enable doctests for the quicktour (#15828)

* Re-enable doctests for the quicktour * Re-enable doctests for task_summary (#15830) * Remove &
2025-07-03 21:00:08 +06:00 · 2022-02-25 17:46:38 +01:00 · 2022-02-25 17:46:38 +01:00 · 0118c4f6a8
commit 0118c4f6a8
parent fd5b05eb81
5 changed files with 98 additions and 37 deletions
--- a/conftest.py
+++ b/conftest.py
@ -15,6 +15,7 @@
 # tests directory-specific settings - this file is run automatically
 # by pytest before any tests are run

+import doctest
 import sys
 import warnings
 from os.path import abspath, dirname, join
@ -59,3 +60,17 @@ def pytest_sessionfinish(session, exitstatus):
    # If no tests are collected, pytest exists with code 5, which makes the CI fail.
    if exitstatus == 5:
        session.exitstatus = 0
+
+
+# Doctest custom flag to ignore output.
+IGNORE_RESULT = doctest.register_optionflag('IGNORE_RESULT')
+
+OutputChecker = doctest.OutputChecker
+
+class CustomOutputChecker(OutputChecker):
+    def check_output(self, want, got, optionflags):
+        if IGNORE_RESULT and optionflags:
+            return True
+        return OutputChecker.check_output(self, want, got, optionflags)
+
+doctest.OutputChecker = CustomOutputChecker
--- a/docs/README.md
+++ b/docs/README.md
@ -340,14 +340,47 @@ seen [here](https://github.com/huggingface/transformers/actions/workflows/doctes

 To include your example in the daily doctests, you need add the filename that
 contains the example docstring to the [documentation_tests.txt](../utils/documentation_tests.txt).
-You can test the example locally as follows:

- For Python files ending with *.py*:
+### For Python files
+
+You can run all the tests in the docstrings of a given file with the following command, here is how we test the modeling file of Wav2Vec2 for instance:
+
+```bash
+pytest --doctest-modules src/transformers/models/wav2vec2/modeling_wav2vec2.py -sv --doctest-continue-on-failure
 ```
+
+If you want to isolate a specific docstring, just add `::` after the file name then type the whole path of the function/class/method whose docstring you want to test. For instance, here is how to just test the forward method of `Wav2Vec2ForCTC`:
+
+```bash
 pytest --doctest-modules src/transformers/models/wav2vec2/modeling_wav2vec2.py::transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2ForCTC.forward -sv --doctest-continue-on-failure
 ```

- For Markdown files ending with *.mdx*:
+### For Markdown files
+
+You will first need to run the following command (from the root of the repository) to prepare the doc file (doc-testing needs to add additional lines that we don't include in the doc source files):
+
+```bash
+python utils/prepare_for_doc_test.py src docs
 ```
+
+Then you can test locally a given file with this command (here testing the quicktour):
+
+```bash
 pytest --doctest-modules docs/source/quicktour.mdx -sv --doctest-continue-on-failure --doctest-glob="*.mdx"
 ```
+
+Once you're done, you can run the following command (still from the root of the repository) to undo the changes made by the first command before committing:
+
+```bash
+python utils/prepare_for_doc_test.py src docs --remove_new_line
+```
+
+### Writing doctests
+
+Here are a few tips to help you debug the doctests and make them pass:
+
+- The outputs of the code need to match the expected output **exactly**, so make sure you have the same outputs. In particular doctest will see a difference between single quotes and double quotes, or a missing parenthesis. The only exceptions to that rule are:
+  * whitespace: one give whitespace (space, tabulation, new line) is equivalent to any number of whitespace, so you can add new lines where there are spaces to make your output more readable.
+  * numerical values: you should never put more than 4 or 5 digits to expected results as different setups or library versions might get you slightly different results. `doctest` is configure to ignore any difference lower than the precision to which you wrote (so 1e-4 if you write 4 digits).
+- Don't leave a block of code that is very long to execute. If you can't make it fast, you can either not use the doctest syntax on it (so that it's ignored), or if you want to use the doctest syntax to show the results, you can add a comment `# doctest: +SKIP` at the end of the lines of code too long to execute
+- Each line of code that produces a result needs to have that result written below. You can ignore an output if you don't want to show it in your code example by adding a comment ` # doctest: +IGNORE_RESULT` at the end of the line of code produing it.
--- a/docs/source/quicktour.mdx
+++ b/docs/source/quicktour.mdx
@ -80,7 +80,7 @@ The pipeline downloads and caches a default [pretrained model](https://huggingfa

 ```py
 >>> classifier("We are very happy to show you the 🤗 Transformers library.")
-[{"label": "POSITIVE", "score": 0.9998}]
+[{'label': 'POSITIVE', 'score': 0.9998}]
 ```

 For more than one sentence, pass a list of sentences to the [`pipeline`] which returns a list of dictionaries:
@ -112,20 +112,22 @@ Next, load a dataset (see the 🤗 Datasets [Quick Start](https://huggingface.co
 ```py
 >>> import datasets

->>> dataset = datasets.load_dataset("superb", name="asr", split="test")
+>>> dataset = datasets.load_dataset("superb", name="asr", split="test")  # doctest: +IGNORE_RESULT
 ```

-Now you can iterate over the dataset with the pipeline. `KeyDataset` retrieves the item in the dictionary returned by the dataset:
+You can pass a whole dataset pipeline:

 ```py
->>> from transformers.pipelines.pt_utils import KeyDataset
->>> from tqdm.auto import tqdm
-
->>> for out in tqdm(speech_recognizer(KeyDataset(dataset, "file"))):
-...     print(out)
-{"text": "HE HOPED THERE WOULD BE STEW FOR DINNER TURNIPS AND CARROTS AND BRUISED POTATOES AND FAT MUTTON PIECES TO BE LADLED OUT IN THICK PEPPERED FLOWER FAT AND SAUCE"}
+>>> files = dataset["file"]
+>>> speech_recognizer(files[:4])
+[{'text': 'HE HOPED THERE WOULD BE STEW FOR DINNER TURNIPS AND CARROTS AND BRUISED POTATOES AND FAT MUTTON PIECES TO BE LADLED OUT IN THICK PEPPERED FLOWER FAT AND SAUCE'},
+ {'text': 'STUFFERED INTO YOU HIS BELLY COUNSELLED HIM'},
+ {'text': 'AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS'},
+ {'text': 'HO BERTIE ANY GOOD IN YOUR MIND'}]
 ```

+For a larger dataset where the inputs are big (like in speech or vision), you will want to pass along a generator instead of a list that loads all the inputs in memory. See the [pipeline documentation](main_classes/pipeline) for more information.
+
 ### Use another model and tokenizer in the pipeline

 The [`pipeline`] can accommodate any model from the [Model Hub](https://huggingface.co/models), making it easy to adapt the [`pipeline`] for other use-cases. For example, if you'd like a model capable of handling French text, use the tags on the Model Hub to filter for an appropriate model. The top filtered result returns a multilingual [BERT model](https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment) fine-tuned for sentiment analysis. Great, let's use this model!
@ -141,7 +143,7 @@ Use the [`AutoModelForSequenceClassification`] and ['AutoTokenizer'] to load the

 >>> model = AutoModelForSequenceClassification.from_pretrained(model_name)
 >>> tokenizer = AutoTokenizer.from_pretrained(model_name)
-===PT-TF-SPLIT===
+>>> # ===PT-TF-SPLIT===
 >>> from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

 >>> model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
@ -153,7 +155,7 @@ Then you can specify the model and tokenizer in the [`pipeline`], and apply the
 ```py
 >>> classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
 >>> classifier("Nous sommes très heureux de vous présenter la bibliothèque 🤗 Transformers.")
-[{"label": "5 stars", "score": 0.7272651791572571}]
+[{'label': '5 stars', 'score': 0.7273}]
 ```

 If you can't find a model for your use-case, you will need to fine-tune a pretrained model on your data. Take a look at our [fine-tuning tutorial](./training) to learn how. Finally, after you've fine-tuned your pretrained model, please consider sharing it (see tutorial [here](./model_sharing)) with the community on the Model Hub to democratize NLP for everyone! 🤗
@ -186,8 +188,9 @@ Pass your text to the tokenizer:
 ```py
 >>> encoding = tokenizer("We are very happy to show you the 🤗 Transformers library.")
 >>> print(encoding)
-{"input_ids": [101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102],
- "attention_mask": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
+{'input_ids': [101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 100, 58263, 13299, 119, 102],
+ 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+ 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
 ```

 The tokenizer will return a dictionary containing:
@ -205,7 +208,7 @@ Just like the [`pipeline`], the tokenizer will accept a list of inputs. In addit
 ...     max_length=512,
 ...     return_tensors="pt",
 ... )
-===PT-TF-SPLIT===
+>>> # ===PT-TF-SPLIT===
 >>> tf_batch = tokenizer(
 ...     ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
 ...     padding=True,
@ -226,7 +229,7 @@ Read the [preprocessing](./preprocessing) tutorial for more details about tokeni

 >>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
 >>> pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
-===PT-TF-SPLIT===
+>>> # ===PT-TF-SPLIT===
 >>> from transformers import TFAutoModelForSequenceClassification

 >>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
@ -243,7 +246,7 @@ Now you can pass your preprocessed batch of inputs directly to the model. If you

 ```py
 >>> pt_outputs = pt_model(**pt_batch)
-===PT-TF-SPLIT===
+>>> # ===PT-TF-SPLIT===
 >>> tf_outputs = tf_model(tf_batch)
 ```

@ -254,16 +257,17 @@ The model outputs the final activations in the `logits` attribute. Apply the sof

 >>> pt_predictions = nn.functional.softmax(pt_outputs.logits, dim=-1)
 >>> print(pt_predictions)
-tensor([[2.2043e-04, 9.9978e-01],
-        [5.3086e-01, 4.6914e-01]], grad_fn=<SoftmaxBackward>)
-===PT-TF-SPLIT===
+tensor([[0.0021, 0.0018, 0.0115, 0.2121, 0.7725],
+        [0.2084, 0.1826, 0.1969, 0.1755, 0.2365]], grad_fn=<SoftmaxBackward0>)
+
+>>> # ===PT-TF-SPLIT===
 >>> import tensorflow as tf

 >>> tf_predictions = tf.nn.softmax(tf_outputs.logits, axis=-1)
 >>> print(tf_predictions)
 tf.Tensor(
-[[2.2043e-04 9.9978e-01]
- [5.3086e-01 4.6914e-01]], shape=(2, 2), dtype=float32)
+[[0.00206 0.00177 0.01155 0.21209 0.77253]
+ [0.20842 0.18262 0.19693 0.1755  0.23652]], shape=(2, 5), dtype=float32)
 ```

 <Tip>
@ -288,11 +292,11 @@ Once your model is fine-tuned, you can save it with its tokenizer using [`PreTra

 ```py
 >>> pt_save_directory = "./pt_save_pretrained"
->>> tokenizer.save_pretrained(pt_save_directory)
+>>> tokenizer.save_pretrained(pt_save_directory)  # doctest: +IGNORE_RESULT
 >>> pt_model.save_pretrained(pt_save_directory)
-===PT-TF-SPLIT===
+>>> # ===PT-TF-SPLIT===
 >>> tf_save_directory = "./tf_save_pretrained"
->>> tokenizer.save_pretrained(tf_save_directory)
+>>> tokenizer.save_pretrained(tf_save_directory)  # doctest: +IGNORE_RESULT
 >>> tf_model.save_pretrained(tf_save_directory)
 ```

@ -300,7 +304,7 @@ When you are ready to use the model again, reload it with [`PreTrainedModel.from

 ```py
 >>> pt_model = AutoModelForSequenceClassification.from_pretrained("./pt_save_pretrained")
-===PT-TF-SPLIT===
+>>> # ===PT-TF-SPLIT===
 >>> tf_model = TFAutoModelForSequenceClassification.from_pretrained("./tf_save_pretrained")
 ```

@ -311,7 +315,7 @@ One particularly cool 🤗 Transformers feature is the ability to save a model a

 >>> tokenizer = AutoTokenizer.from_pretrained(tf_save_directory)
 >>> pt_model = AutoModelForSequenceClassification.from_pretrained(tf_save_directory, from_tf=True)
-===PT-TF-SPLIT===
+>>> # ===PT-TF-SPLIT===
 >>> from transformers import TFAutoModel

 >>> tokenizer = AutoTokenizer.from_pretrained(pt_save_directory)
--- a/docs/source/task_summary.mdx
+++ b/docs/source/task_summary.mdx
@ -122,7 +122,8 @@ is paraphrase: 90%
 ...     print(f"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")
 not paraphrase: 94%
 is paraphrase: 6%
-===PT-TF-SPLIT===
+
+>>> # ===PT-TF-SPLIT===
 >>> from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
 >>> import tensorflow as tf

@ -258,7 +259,8 @@ Question: What does 🤗 Transformers provide?
 Answer: general - purpose architectures
 Question: 🤗 Transformers provides interoperability between which frameworks?
 Answer: tensorflow 2. 0 and pytorch
-===PT-TF-SPLIT===
+
+>>> # ===PT-TF-SPLIT===
 >>> from transformers import AutoTokenizer, TFAutoModelForQuestionAnswering
 >>> import tensorflow as tf

@ -407,7 +409,8 @@ Distilled models are smaller than the models they mimic. Using them instead of t
 Distilled models are smaller than the models they mimic. Using them instead of the large versions would help decrease our carbon footprint.
 Distilled models are smaller than the models they mimic. Using them instead of the large versions would help offset our carbon footprint.
 Distilled models are smaller than the models they mimic. Using them instead of the large versions would help improve our carbon footprint.
-===PT-TF-SPLIT===
+
+>>> # ===PT-TF-SPLIT===
 >>> from transformers import TFAutoModelForMaskedLM, AutoTokenizer
 >>> import tensorflow as tf

@ -481,7 +484,8 @@ of tokens.
 >>> resulting_string = tokenizer.decode(generated.tolist()[0])
 >>> print(resulting_string)
 Hugging Face is based in DUMBO, New York City, and ...
-===PT-TF-SPLIT===
+
+>>> # ===PT-TF-SPLIT===
 >>> from transformers import TFAutoModelForCausalLM, AutoTokenizer, tf_top_k_top_p_filtering
 >>> import tensorflow as tf

@ -565,7 +569,8 @@ Below is an example of text generation using `XLNet` and its tokenizer, which in

 >>> print(generated)
 Today the weather is really nice and I am planning ...
-===PT-TF-SPLIT===
+
+>>> # ===PT-TF-SPLIT===
 >>> from transformers import TFAutoModelForCausalLM, AutoTokenizer

 >>> model = TFAutoModelForCausalLM.from_pretrained("xlnet-base-cased")
@ -687,7 +692,7 @@ Here is an example of doing named entity recognition, using a model and a tokeni

 >>> outputs = model(**inputs).logits
 >>> predictions = torch.argmax(outputs, dim=2)
-===PT-TF-SPLIT===
+>>> # ===PT-TF-SPLIT===
 >>> from transformers import TFAutoModelForTokenClassification, AutoTokenizer
 >>> import tensorflow as tf

@ -827,7 +832,8 @@ CNN / Daily Mail), it yields very good results.
 <pad> prosecutors say the marriages were part of an immigration scam. if convicted, barrientos faces two criminal
 counts of "offering a false instrument for filing in the first degree" she has been married 10 times, nine of them
 between 1999 and 2002.</s>
-===PT-TF-SPLIT===
+
+>>> # ===PT-TF-SPLIT===
 >>> from transformers import TFAutoModelForSeq2SeqLM, AutoTokenizer

 >>> model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-base")
@ -890,7 +896,8 @@ Here is an example of doing translation using a model and a tokenizer. The proce

 >>> print(tokenizer.decode(outputs[0]))
 <pad> Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.</s>
-===PT-TF-SPLIT===
+
+>>> # ===PT-TF-SPLIT===
 >>> from transformers import TFAutoModelForSeq2SeqLM, AutoTokenizer

 >>> model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-base")
--- a/utils/documentation_tests.txt
+++ b/utils/documentation_tests.txt
@ -5,3 +5,5 @@ src/transformers/models/unispeech/modeling_unispeech.py
 src/transformers/models/unispeech_sat/modeling_unispeech_sat.py
 src/transformers/models/sew/modeling_sew.py
 src/transformers/models/sew_d/modeling_sew_d.py
+docs/source/quicktour.mdx
+docs/source/task_summary.mdx