Source: arxiv, Editor: Xiaoqin, Zhang Jia
[Introduction to Xin Zhiyuan]BERT dominates the GLUE list again! Today, Facebook publishes a
King BERT is back!
Not long ago, the XLNet pre-training model proposed by CMU and Google's brain completely crushed 20 tasks.
However, the throne of XLNet did not sit long. Today, Facebook announced an enhanced pre-training model based on BERT development.RoBERTa
The latest GLUE list
The name of RoBERTa is called``Robustlyoptetized BERTapiach'' , and the powerful optimized BERT method is quite simple and rough.
The study was carried out by a team from Facebook AI and the University of Washington. The first author was Yinhan Liu, a Chinese researcher, along with Jingfei Du and Danqi Chen.
Veselin Stoyanov, one of the authors, published the results on Twitter.
Yann LeCun, head of Facebook AI, recommends:
To say how RoBERTa dominated the three benchmark rankings, in short, with more data, adopt more sophisticated training techniques, and train longer.
In his paper, the author writes:
The author indicates that the choice of super parameters has a great influence on the final results.
They released models and code:https://github.com/pytorch/fairseq
Next, Xin Zhiyuan brings a detailed interpretation of this paper:
As long as the training is good, BERT can exceed the performance of all subsequent methods.
Self-training methods, such as ELMo, GPT, BERT, XLM and XLNet, have brought significant performance improvements, but it is quite challenging to determine which aspects of these methods contribute most to performance improvements. Because of the high computational cost of training, the amount of executable tuning is limited, and private training data of different sizes are often used for tuning, which limits the measurement of the effect of modeling progress.
We conducted a replication study of the BERT pre-training model (Devlin et al., 2019), including a careful evaluation of the effects of hyperparametric tuning and the size of the training set. We find that BERT is obviously under-trained, and propose an improved method of training BERT model, which we call RoBERTa, which can achieve or exceed the performance of all post-BERT methods.
The changes we have made are simple, including:
(1) Training the model with longer time, larger batch and more data;
(2) Delete the target of the next sentence;
(3) Training long sequences;
(4) Dynamic change of masking mode applied to training data.
We also collected a new data set (CC-NEWS) with the same size as other private data sets to better control the size effect of the training set.
When controlling training data, our upgraded training program further enhanced the results of BERT published in GLUE and SQuAD rankings.
After a long period of training, our model scored 88.5 points in the public GLUE rankings, which was comparable to the 88.4 points reported by Yang et al. (2019). Our model achieves the state-of-the-art level on four of the nine GLUE tasks: MNLI, QNLI, RTE and STS-B. In addition, we achieved the highest scores in the SQuAD and RACE rankings.
In conclusion, the contributions of this paper are as follows:
(1) We propose a set of important BERT design selection and training strategies, and introduce alternatives that can improve the performance of downstream tasks.
(2) We use a new data set CCNEWS and confirm that using more data for pre-training can further improve the performance of downstream tasks;
(3) Our training improvements show that with the right design options, the pre-trained masked language model is more competitive than all other recently published methods. We released the model, pre-training and fine-tuning code implemented in PyTorch.
Model architecture: Transformer
BERT uses the now popular transformer architecture, which we will not discuss in detail here. We use the L-tier transformer architecture, with each block using a self-attention head and hidden dimension H.
Before training, BERT used two goals: masked language modeling and the next prediction.
Masked Language Mode (MLM) selects random token samples from the input sequence and replaces them with special token [MASK]. The goal of MLM is to predict the cross-entropy loss when occluding token. BERT consistently chooses 15% of the input token as a possible replacement. Of the selected tokens, 80% were replaced by [MASK], 10% remained unchanged, and 10% were replaced by randomly selected vocabulary token.
In the initial implementation, random occlusion and replacement are executed once at the beginning and saved to the training period, but in practice, because the data is repeated, the masks of each training statement are not always the same.
The next sentence prediction (NSP) is a binary loss used to predict whether two segments follow each other in the original text. Create positive examples by extracting continuous sentences from text corpus. Counterexamples are created by pairing segments from different documents. The sampling probability of positive and negative samples is equal.
The goal of NSP is to improve the performance of downstream tasks, such as natural language reasoning, which requires reasoning between sentences.
In this section, we describe the experimental settings for BERT replication research.
We re-implemented BERT in FAIRSEQ. We mainly follow the original BERT optimization hyperparameters given in Section 2. In addition to peak learning rate and the number of warmup steps, these two parameters are tuned for each setting.
In addition, we also find that the training is very sensitive to Adam epsilon terms, in some cases, after tuning it, we obtain better performance or better stability. Similarly, we found that the settings
We train hybrid precision floating point operations on DGX-1 machines, each with 8 32GB Nvidia V100 GPUs interconnected via Infiniband.
Which choices are critical for successful training of BERT models
This section explores and quantifiesWhich choices are critical for successful training of BERT models。 We keep the model architecture unchanged. Specifically, we start training the BERT model with the same configuration as BERTBASE (L = 12, H = 768, A = 12, 110M params).
1. Static masking vs. dynamic masking
As discussed earlier, BERT relies on random occlusion and prediction token. The original BERT implements a static mask by performing an occlusion during data preprocessing. To avoid using the same mask for each training instance in each epoch, we repeat the training data 10 times to occlude each sequence in 10 different ways in 40 training epochs. Therefore, in the training process, each training sequence uses the same mask four times.
We compare this strategy with dynamic masking, in which masking patterns are generated every time a sequence is provided to the model. This becomes critical when more steps or larger data sets are pre-trained.
Table 1: SEBERTBASEA comparison of static and dynamic masking. We report the accuracy of SQuAD F1 score, MNLI-m and SST-2. The results reported are more than five randomly initialized medians. Reference results are from Yang et al. (2019).
Results Table 1 compares the BERTBASE results published by Devlin et al. (2019) with the results we re-implemented using static or dynamic masking. We find that the performance of the re-implementation using static masking is similar to that of the original BERT model, while dynamic masking can be equivalent to or even slightly better than static masking.
Considering these results and the additional efficiency advantages of dynamic masking, we use dynamic masking in other experiments.
2. Model Input Format and Next Prediction
In the original BERT pre-training process, the model observed two connected document fragments, which were either sampled continuously from the same document (p = 0.5) or from different documents. In addition to masked language modeling objectives, the model also predicts whether observed document fragments come from the same or different documents by assisting the next sentence prediction (NSP) loss training model.
NSP loss is considered to be an important factor in training the original BERT model. Devlin et al. (2019) observed that removal of NSP impaired performance, while QNLI, MNLI and SQuAD performance decreased significantly. However, recent work has questioned the need for NSP losses.
To better understand this difference, we compared several alternative training formats:
Table 2: The results of the development set of the basic models pre-trained on BOOKCORPUS and WIKIPEDIA.
Table 2 shows the results of four different settings. We find that using separate sentences can affect the performance of downstream tasks. We assume that this is because the model is unable to learn long-term dependencies.
Next, we compare the training without NSP loss with the training from a single document (doc-sentence). We find that compared with Devlin et al. (2019), the performance of this setting is better than that of the original BERTBASE results, and the performance of downstream tasks is achieved or slightly improved by eliminating NSP losses.
Finally, we find that limiting the sequence to a single document (doc-sentence) is slightly better than packaging a sequence from multiple documents (full sentences). However, because the doc-sentence format can lead to different batch sizes, we use complete sentences in the rest of the experiments to compare with the related work.
3. Large batches training
Previous neuro-machine translation studies have shown that when the learning rate is properly increased, very large mini-batches training can not only improve the optimization speed, but also improve the performance of the final task. Recent studies have shown that BERT can also receive large batch training.
Devlin et al. (2019) initially trained BERTBASE with only 1 million steps and batch size with 256 sequences.
In Table 3, we compare the complexity and final task performance of BERTBASE in increasing batch size, and control the number of times through training data. We observed that large batches training improved the confusion of masked language modeling goals and the accuracy of the final task. Through distributed data parallel training, large batches are also easier to parallel. In subsequent experiments, we use batches of 8K sequence for parallel training.
Table 3: Uncompleted training data (ppl) of basic models for training in different batch sizes and perplexity of accuracy of development sets.
RoBERTa:Three Benchmark Data Sets
In the previous section, we recommended modifying the BERT pretraining program to improve the performance of the final task. We now summarize these improvements and assess their combined impact. We call this configuration RoBERTa, that is,
Specifically, RoBERTa uses dynamic masking, complete sentences without NSP losses, large mini-batches and larger byte-level BPE training.
In addition, we have also studied two other important factors that have not been emphasized in previous work: (1) data for pre-training and (2) training times through data. For example, the recently proposed XLNet architecture uses nearly 10 times more data than the original BERT. It also trains in batches 8 times larger to get half of the optimization steps, so the number of sequences seen in pre-training is four times that of BERT.
To distinguish these factors from the importance of other modeling choices (e.g., pre-training objectives), we first trained Roberta in terms of BertLarge architecture (L = 24, H = 1024, A = 16355m). As used in Devlin et al., we used BOOKCORPUS and WIKIPEDIA datasets for 100K-step pre-training. We used 1024V100GPU to pre-train our model for about a day.
As shown in Table 4, when controlling training data, we observed that RoBERTa was better than the BERT initially reported.LARGEThe results have been greatly improved, confirming once again the importance of the design choices we discussed in Section 4.
Table 4: when we train more data in advance (16GB)
Next, we combine this data with the three additional data sets described in Section 3.2. We used the same number of training steps (100K) as before to train RoBERTa with comprehensive data. In total, we preprocessed over 160GB of text. We observed further improvements in the performance of all downstream tasks, which validated the importance of data size and diversity in pre-training.
Finally, we train RoBERTa in advance for a much longer time, increasing the number of steps from 100K to 300K, and then further to 500K. Again, we observed a significant improvement in downstream task performance, with 300K and 500K step models outperforming XLNetLARGE in most tasks. We have noticed that even the models we have trained for the longest time do not seem to go beyond our data range and may benefit from additional training.
In the rest of this article, we evaluate our best RoBERTa models based on three different benchmarks: GLUE, SQuaD and RACE. Specifically, we believe that RoBERTa has trained 500K steps on all five datasets introduced in Section 3.2.
Table 5: Results of GLUE. All results are based on a 24-tier architecture. BERTLARGEAnd XLNetLARGEResults RoBERTa results from Devlin et al. and Yang et al. development sets are the median of five runs. The RoBERTa result on the test set is a set of single task models. For RTE, STS and MRPC, we start with the MNLI model rather than the baseline pre-training model. The average is obtained from GLUE leaderboard.
Table 6:SQuAD results. Represents results that depend on additional external training data. Roberta uses only the provided SQuAD data in development and testing. BERTLARGEAnd XLNetLARGEThe results were from Devlin et al. and Yang et al.
Table 7: Results of RACE test suite. BERTLARGEAnd XLNetLARGEThe results come from Yang et al.