mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-06 22:30:09 +06:00

* Reorganize doc for multilingual support * Fix style * Style * Toc trees * Adapt templates
43 lines
2.7 KiB
Plaintext
43 lines
2.7 KiB
Plaintext
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations under the License.
|
|
-->
|
|
|
|
# BORT
|
|
|
|
## Overview
|
|
|
|
The BORT model was proposed in [Optimal Subarchitecture Extraction for BERT](https://arxiv.org/abs/2010.10499) by
|
|
Adrian de Wynter and Daniel J. Perry. It is an optimal subset of architectural parameters for the BERT, which the
|
|
authors refer to as "Bort".
|
|
|
|
The abstract from the paper is the following:
|
|
|
|
*We extract an optimal subset of architectural parameters for the BERT architecture from Devlin et al. (2018) by
|
|
applying recent breakthroughs in algorithms for neural architecture search. This optimal subset, which we refer to as
|
|
"Bort", is demonstrably smaller, having an effective (that is, not counting the embedding layer) size of 5.5% the
|
|
original BERT-large architecture, and 16% of the net size. Bort is also able to be pretrained in 288 GPU hours, which
|
|
is 1.2% of the time required to pretrain the highest-performing BERT parametric architectural variant, RoBERTa-large
|
|
(Liu et al., 2019), and about 33% of that of the world-record, in GPU hours, required to train BERT-large on the same
|
|
hardware. It is also 7.9x faster on a CPU, as well as being better performing than other compressed variants of the
|
|
architecture, and some of the non-compressed variants: it obtains performance improvements of between 0.3% and 31%,
|
|
absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks.*
|
|
|
|
Tips:
|
|
|
|
- BORT's model architecture is based on BERT, so one can refer to [BERT's documentation page](bert) for the
|
|
model's API as well as usage examples.
|
|
- BORT uses the RoBERTa tokenizer instead of the BERT tokenizer, so one can refer to [RoBERTa's documentation page](roberta) for the tokenizer's API as well as usage examples.
|
|
- BORT requires a specific fine-tuning algorithm, called [Agora](https://adewynter.github.io/notes/bort_algorithms_and_applications.html#fine-tuning-with-algebraic-topology) ,
|
|
that is sadly not open-sourced yet. It would be very useful for the community, if someone tries to implement the
|
|
algorithm to make BORT fine-tuning work.
|
|
|
|
This model was contributed by [stefan-it](https://huggingface.co/stefan-it). The original code can be found [here](https://github.com/alexa/bort/).
|