The origin of hints can be traced back to some studies, such as GPT-2, T5, GPT-3 and so on. It is found that adding a task-related prefix before the input sample can prompt the model what to output next. For example, in the prediction stage of GPT-3, it is only necessary to translate English into French before inputting the sample, so that the model can be prompted to carry out the translation task next, that is, to predict completely by relying on the knowledge learned by the model in the pre-training stage, and to use the model directly in the downstream tasks without relying on the supervision data of specific tasks. On the one hand, it reduces the calculation and storage cost of the fine-tuning model, on the other hand, it brings good news to the zero/less-beat field where the sample size is extremely scarce.
This method, which relies on Prompt to stimulate the inherent potential of the model and mine the knowledge learned by the model in the large-scale pre-training stage, leads the fourth paradigm in NLP field. People gradually began to think about how to use a large number of parameters of the pre-training language model more efficiently, and how to unify various downstream tasks into a common framework, so that the model can perform different tasks according to different prompt information, so that it is not necessary to train a separate model for each downstream task.
This paper will briefly introduce the core innovations of some important papers in the rapid development of Prompt, and will not describe too many model details in detail (it is recommended to read the original text directly if you want to know the whole picture).
Thesis: Using cloze to classify a few texts and influence natural language (2020).
In this paper, the research of prompt pattern is standardized, and the concept of pattern descriptor is put forward:
For example, for a 5-classification task, given an input sample a, the corresponding template function p and label mapping function v can be:
Note that all kinds of prompt template functions and answer mapping functions are designed by hand.
Then the pre-training model is fine-tuned with the newly constructed P(x) and v(l), and other details are not expanded. Experimental results show that this method performs well in small sample tasks.
Thesis: It's not just the size that matters: small language models are also worried learners (pet original team)
After the advent of GPT-3, it shows amazing learning ability with few samples, but its huge parameters are also prohibitive. However, the author of this paper puts forward that "small models can also have outstanding performance in small sample learning", which directly targets the giant GPT-3, thus establishing the dominance of the paradigm proposed by PET in the rivers and lakes and attracting the attention of major Wulin people.
This paper proves the validity of the paradigm proposed by PET. At the same time, the author also found that designing different prompt templates and label mapping function Verbalizer has a great influence on the performance of the model, which leads people to flock to improve the construction of prompt templates and label mapping Verbalizer.
Thesis: Let the pre-trained language model become a better one-time learner.
Instead of manually constructing prompt templates and tag mapping functions in PET, the template and tag mapping are automatically searched. At the same time, referring to situational learning in GPT protocol -3, demonstration is added as a scenario to help the model better understand what to do.
Experiments show that the effect of this hint-based fine-tuning on a few samples is obviously better than that of standard fine-tuning, and adding examples to samples can really bring benefits.
It is not necessary to construct a discrete class token hint that people can understand, but it is also possible to construct a continuous class vector hint that the model can accept.
4. 1 Paper: Prefix Adjustment: Continuous Prompt for Optimized Generation
In this paper, a continuous prompt method is proposed for NLG task. Add a prefix prefix matrix to each layer of the pre-training model, fix the parameters of the pre-training model, and only train the parameters of the prefix matrix. Under the setting of less beats, the performance exceeds the standard fine-tuning.
The experimental results show that the fine-tuning based on hints can achieve quite standard fine-tuning effect; ; In the case of few samples, it can be fine-tuned beyond the standard.
4.2 Paper: GPT also understands (P-tuning)
Aiming at NLU task, this paper also puts forward the construction of continuous prompt. Unlike prefix adjustment, the hint here only needs to be added to the input layer, not to every layer of the network, so it works well.
BiLSTM is used to encode the prompt, and then the encoded prompt is embedded and the sample X is input into the pre-training language model (PLM), and then both the prompt embedding and the pre-training model are fine-tuned.
Considering that there are two problems in optimizing continuous prompt vectors:
Therefore, the author proposes to use biLSTM as the cue encoder to encode the cue vector.
The specific prompt template is designed as follows:
The experimental results show that the fine-tuning based on hints can reach or even exceed the standard fine-tuning effect.
Paper: the scale power of efficient and rapid parameter tuning.
In this paper, we suggest designing our own hints for each downstream task, splicing them into the input samples, and then completely freezing the weights of the model and training only the weight parameters corresponding to the hints. It is found that with the increase of model volume, the effect of prompt fine tuning gradually catches up with that of standard fine tuning.
Model tuning here refers to standard fine-tuning, that is, updating the parameters of pre-training model on downstream tasks.
Finally, the general rules of the experimental results of each paper are summarized. The fine-tuning strategies used in each paper mainly include the following three types: