Original English:More Efficient NLP Model Pre-training with ELECTRA
This article was originally published on Google AI blog, translated and shared by InfoQ Chinese website with the authorization of the original author.
Recently, great progress has been made in the field of natural language processing due to the progress in language pre trainingBERT,RoBERTa,XLNet,ALBERTandT5And so on the most advanced model. Although these methods are different in design, they all share a common idea, that is, in the specific NLP tasks (such asSentiment analysisandQuestion answering system）Before tuning, use a large number of unmarked text to build a general model of language understanding.
The existing pre training methods usually fall into two categories: language model (LM), such asGPT(Processing input text from left to right, predicting the next word according to the previous context) and the masking language model (MLM), such as BERT、RoBERTa、ALBERT( it can predict the characteristics of a small portion of the word obscured in the input. MLM has the advantage of two-way, it is not one-way, because they can
existElectra: using pre training text encoder as discriminator instead of generatorEFficientlyLearning anENcoder thatClassifiesTOkenReplacementsAccurately) is a new pre-training method that performs better than existing techniques when calculating the same budget. For example, inGLUE Under the natural language understanding benchmark, only can not useSQuAD Q & A benchmark leads. ELECTRA good efficiency means it works well even on a small scale Open source model, including some ready-made pre trained language representation models.
Electra uses a new pre training task called Replacement Token testing(RTD), which trains a two-way model (similar to MLM) while learning all input positions (similar to LM). suffer Generating adversarial networks (GANs) ELECTRA training models to distinguish
For example, in the figure below, the wordsDiscriminator) Determine which Token in the original input have been replaced or remain unchanged. crucially, this binary classification task is applied to each input Token, rather than to only a few obscured Token (15% in BERT class models), which makes RTD more efficient than MLM.
From another Token calledgeneratorThe neural network. while the generator can be any model that generates Token output distributions, we use a small masking language model (i.e., a BERT model trained in conjunction with the discriminator with a smaller hidden size). While the generator of the input discriminator is similar in structure GAN, the GAN is applied to text comparisonsdifficulty, we train generators to predict cover words as much as possible, not the opposite. Generators and discriminators share the same input word embedding. After the pre training, the generator is discarded and the discriminator (Electra model) is fine tuned for downstream tasks. Our models are all usedTransformer Neural structure.
Electra test results
We compare Electra with other most advanced NLP models, and find that under the same calculation budget, it has a great improvement compared with the previous method. When less than 25% of the calculation is used, its performance is equivalent to that of Roberta and xlnet.
X axis represents the computation used in the training model (toFLOPsIn units), the y-axis represents the dev glue score. Electra is more efficient than the existing pre training NLP model. Please note that the current best model on glue is as followsT5(11b) is not suitable for this graph because they use more computation than other models (about 10 times as much as Roberta).
In order to further improve the efficiency, we test a small Electra model, which can train on a single GPU in 4 days with good accuracy. Although it doesn't reach the accuracy of the large-scale model that needs many TPUs to train, Electra small still performs well, even surpassing GPT, and only needs 1 / 30 of the calculation. Finally, to see if this encouraging result can be maintained on a large scale, we use more computation to train a large Electra model (roughly the same amount of computation as Roberta, about 10% of T5). The model isSQuAD 2 Q & A data set (see table below) reached a new height and surpassed RoBERTa、XLNet and ALBERT. in the GLUE rankings LargeT5-11bThe model scores higher on glue, but Electra is only 1 / 30 of the size of t5-11b, and only uses 10% calculation power to train.
We have released pre training for Electra and tuning for downstream tasks[计] code Currently supported tasks include text classification, Q & a system and sequence tagging. The code supports fast training of a small Electra model on a GPU. We also released pre training weights for Electra large, Electra base, and Electra small. Electra model is only suitable for English at present, but we hope to release many pre training models for other languages in the future.