diff --git a/docs/source/en/tasks/zero_shot_object_detection.md b/docs/source/en/tasks/zero_shot_object_detection.md index 03e849a6c79..5ac4706bffe 100644 --- a/docs/source/en/tasks/zero_shot_object_detection.md +++ b/docs/source/en/tasks/zero_shot_object_detection.md @@ -26,8 +26,8 @@ is an open-vocabulary object detector. It means that it can detect objects in im the need to fine-tune the model on labeled datasets. OWL-ViT leverages multi-modal representations to perform open-vocabulary detection. It combines [CLIP](../model_doc/clip) with -lightweight object classification and localization heads. Open-vocabulary detection is achieved by embedding free-text queries with the text encoder of CLIP and using them as input to the object classification and localization heads. -associate images and their corresponding textual descriptions, and ViT processes image patches as inputs. The authors +lightweight object classification and localization heads. Open-vocabulary detection is achieved by embedding free-text queries with the text encoder of CLIP and using them as input to the object classification and localization heads, +which associate images with their corresponding textual descriptions, while ViT processes image patches as inputs. The authors of OWL-ViT first trained CLIP from scratch and then fine-tuned OWL-ViT end to end on standard object detection datasets using a bipartite matching loss.