Fix incomplete sentence in Zero-shot object detection documentation (#33430)

Rephrase sentence in zero-shot object detection docs
This commit is contained in:
Sergio Paniego Blanco 2024-09-12 11:25:44 +02:00 committed by GitHub
parent e0ff4321d1
commit 516ee6adc2
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -26,8 +26,8 @@ is an open-vocabulary object detector. It means that it can detect objects in im
the need to fine-tune the model on labeled datasets.
OWL-ViT leverages multi-modal representations to perform open-vocabulary detection. It combines [CLIP](../model_doc/clip) with
lightweight object classification and localization heads. Open-vocabulary detection is achieved by embedding free-text queries with the text encoder of CLIP and using them as input to the object classification and localization heads.
associate images and their corresponding textual descriptions, and ViT processes image patches as inputs. The authors
lightweight object classification and localization heads. Open-vocabulary detection is achieved by embedding free-text queries with the text encoder of CLIP and using them as input to the object classification and localization heads,
which associate images with their corresponding textual descriptions, while ViT processes image patches as inputs. The authors
of OWL-ViT first trained CLIP from scratch and then fine-tuned OWL-ViT end to end on standard object detection datasets using
a bipartite matching loss.