From 80e05453766dc52a3dba230c8a7f0e04234e3054 Mon Sep 17 00:00:00 2001 From: Sangbum Daniel Choi <34004152+SangbumChoi@users.noreply.github.com> Date: Thu, 29 Aug 2024 14:20:29 +0900 Subject: [PATCH] Update docs/source/en/model_doc/vitpose.md --- docs/source/en/model_doc/vitpose.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/vitpose.md b/docs/source/en/model_doc/vitpose.md index 0e8855ae7a1..fcecf107370 100644 --- a/docs/source/en/model_doc/vitpose.md +++ b/docs/source/en/model_doc/vitpose.md @@ -14,7 +14,7 @@ specific language governing permissions and limitations under the License. ## Overview -The ViTPose model was proposed in [ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation](https://arxiv.org/abs/2204.12484) by Yufei Xu, Jing Zhang, Qiming Zhang, Dacheng Tao. ViTPose employs a standard, non-hierarchical [Vision Transformer](vit) as backbone for the task of keypoint estimation. A simple decoder head is added on top to predict the heatmaps from a given image. Despite its simplicity, the model gets state-of-the-art results on the challenging MS COCO Keypoint Detection benchmark. +The ViTPose model was proposed in [ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation](https://arxiv.org/abs/2204.12484) by Yufei Xu, Jing Zhang, Qiming Zhang, Dacheng Tao. ViTPose employs a standard, non-hierarchical [Vision Transformer](https://arxiv.org/pdf/2010.11929v2) as backbone for the task of keypoint estimation. A simple decoder head is added on top to predict the heatmaps from a given image. Despite its simplicity, the model gets state-of-the-art results on the challenging MS COCO Keypoint Detection benchmark. The abstract from the paper is the following: