Add two model_cards: ethanyt/guwenbert-base and ethanyt/guwenbert-large (#8041)

This commit is contained in:
Ethan 2020-10-29 20:21:54 +08:00 committed by GitHub
parent ba2ad3a98a
commit b215090eed
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 148 additions and 0 deletions

View File

@ -0,0 +1,74 @@
---
language:
- "zh"
thumbnail: "https://user-images.githubusercontent.com/9592150/97142000-cad08e00-179a-11eb-88df-aff9221482d8.png"
tags:
- "chinese"
- "classical chinese"
- "literary chinese"
- "ancient chinese"
- "bert"
- "pytorch"
license: "apache-2.0"
pipeline_tag: "fill-mask"
widget:
- text: "[MASK]太元中,武陵人捕鱼为业。"
- text: "山不在[MASK],有仙则名。"
- text: "浔阳江头夜送客,枫叶[MASK]花秋瑟瑟。"
---
# GuwenBERT
## Model description
![GuwenBERT](https://user-images.githubusercontent.com/9592150/97142000-cad08e00-179a-11eb-88df-aff9221482d8.png)
This is a RoBERTa model pre-trained on Classical Chinese. You can fine-tune GuwenBERT for downstream tasks, such as sentence breaking, punctuation, named entity recognition, and so on.
For more information about RoBERTa, take a look at the RoBERTa's offical repo.
## How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("ethanyt/guwenbert-base")
model = AutoModel.from_pretrained("ethanyt/guwenbert-base")
```
## Training data
The training data is daizhige dataset (殆知阁古代文献) which is contains of 15,694 books in Classical Chinese, covering Buddhism, Confucianism, Medicine, History, Zi, Yi, Yizang, Shizang, Taoism, and Jizang.
76% of them are punctuated.
The total number of characters is 1.7B (1,743,337,673).
All traditional Characters are converted to simplified characters.
The vocabulary is constructed from this data set and the size is 23,292.
## Training procedure
The models are initialized with `hfl/chinese-roberta-wwm-ext` and then pre-trained with a 2-step strategy.
In the first step, the model learns MLM with only word embeddings updated during training, until convergence. In the second step, all parameters are updated during training.
The models are trained on 4 V100 GPUs for 120K steps (20K for step#1, 100K for step#2) with a batch size of 2,048 and a sequence length of 512. The optimizer used is Adam with a learning rate of 2e-4, adam-betas of (0.9,0.98), adam-eps of 1e-6, a weight decay of 0.01, learning rate warmup for 5K steps, and linear decay of learning rate after.
## Eval results
### "Gulian Cup" Ancient Books Named Entity Recognition Evaluation
Second place in the competition. Detailed test results:
| NE Type | Precision | Recall | F1 |
|:----------:|:-----------:|:------:|:-----:|
| Book Name | 77.50 | 73.73 | 75.57 |
| Other Name | 85.85 | 89.32 | 87.55 |
| Micro Avg. | 83.88 | 85.39 | 84.63 |
## About Us
We are from [Datahammer](https://datahammer.net), Beijing Institute of Technology.
For more cooperation, please contact email: ethanyt [at] qq.com
> Created with ❤️ by Tan Yan [![Github icon](https://cdn0.iconfinder.com/data/icons/octicons/1024/mark-github-32.png)](https://github.com/Ethan-yt) and Zewen Chi [![Github icon](https://cdn0.iconfinder.com/data/icons/octicons/1024/mark-github-32.png)](https://github.com/CZWin32768)

View File

@ -0,0 +1,74 @@
---
language:
- "zh"
thumbnail: "https://user-images.githubusercontent.com/9592150/97142000-cad08e00-179a-11eb-88df-aff9221482d8.png"
tags:
- "chinese"
- "classical chinese"
- "literary chinese"
- "ancient chinese"
- "bert"
- "pytorch"
license: "apache-2.0"
pipeline_tag: "fill-mask"
widget:
- text: "[MASK]太元中,武陵人捕鱼为业。"
- text: "山不在[MASK],有仙则名。"
- text: "浔阳江头夜送客,枫叶[MASK]花秋瑟瑟。"
---
# GuwenBERT
## Model description
![GuwenBERT](https://user-images.githubusercontent.com/9592150/97142000-cad08e00-179a-11eb-88df-aff9221482d8.png)
This is a RoBERTa model pre-trained on Classical Chinese. You can fine-tune GuwenBERT for downstream tasks, such as sentence breaking, punctuation, named entity recognition, and so on.
For more information about RoBERTa, take a look at the RoBERTa's offical repo.
## How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("ethanyt/guwenbert-large")
model = AutoModel.from_pretrained("ethanyt/guwenbert-large")
```
## Training data
The training data is daizhige dataset (殆知阁古代文献) which is contains of 15,694 books in Classical Chinese, covering Buddhism, Confucianism, Medicine, History, Zi, Yi, Yizang, Shizang, Taoism, and Jizang.
76% of them are punctuated.
The total number of characters is 1.7B (1,743,337,673).
All traditional Characters are converted to simplified characters.
The vocabulary is constructed from this data set and the size is 23,292.
## Training procedure
The models are initialized with `hfl/chinese-roberta-wwm-ext-large` and then pre-trained with a 2-step strategy.
In the first step, the model learns MLM with only word embeddings updated during training, until convergence. In the second step, all parameters are updated during training.
The models are trained on 4 V100 GPUs for 120K steps (20K for step#1, 100K for step#2) with a batch size of 2,048 and a sequence length of 512. The optimizer used is Adam with a learning rate of 1e-4, adam-betas of (0.9,0.98), adam-eps of 1e-6, a weight decay of 0.01, learning rate warmup for 5K steps, and linear decay of learning rate after.
## Eval results
### "Gulian Cup" Ancient Books Named Entity Recognition Evaluation
Second place in the competition. Detailed test results:
| NE Type | Precision | Recall | F1 |
|:----------:|:-----------:|:------:|:-----:|
| Book Name | 77.50 | 73.73 | 75.57 |
| Other Name | 85.85 | 89.32 | 87.55 |
| Micro Avg. | 83.88 | 85.39 | 84.63 |
## About Us
We are from [Datahammer](https://datahammer.net), Beijing Institute of Technology.
For more cooperation, please contact email: ethanyt [at] qq.com
> Created with ❤️ by Tan Yan [![Github icon](https://cdn0.iconfinder.com/data/icons/octicons/1024/mark-github-32.png)](https://github.com/Ethan-yt) and Zewen Chi [![Github icon](https://cdn0.iconfinder.com/data/icons/octicons/1024/mark-github-32.png)](https://github.com/CZWin32768)