Syntactic analysis can be divided into syntactic structure analysis and dependency analysis. The purpose of obtaining the syntactic structure of the whole sentence is called complete syntactic analysis, the purpose of obtaining local components is called local analysis, and dependency analysis is called dependency analysis for short.
Generally speaking, syntactic analysis has three tasks:
Determines whether the output string belongs to a language.
Eliminate lexical and structural ambiguities in input sentences.
Analyze the internal structure of the input sentence, such as composition and context.
The second and third tasks are generally the main tasks of syntactic analysis.
Generally speaking, the construction of a parser needs to consider two parts: one is the formal expression of grammar and the description of entry information, the formal grammar rules constitute the rule base, the entry information is provided by dictionaries or synonyms, and the rule base and dictionaries or synonyms constitute the knowledge base of syntactic analysis; The other part is the analysis algorithm based on knowledge base.
Grammatical formalization belongs to the research field of syntactic theory. At present, context-free grammar (CFG) and constraint-based grammar are widely used in natural language processing, and the latter is also called unity grammar.
Simply put, syntactic structure analysis methods can be divided into two categories: rule-based analysis methods and statistical analysis methods.
The basic idea of rule-based syntactic structure analysis method is to organize grammar rules manually, establish grammar knowledge base, and eliminate syntactic structure ambiguity through conditional constraints and checks.
According to the different forming directions of parsing tree, people usually divide these methods into three types: top-down analysis, bottom-up analysis and their combination. The top-down analysis algorithm realizes the process of rule derivation, and the analysis tree grows from the root node and finally forms the leaf node of the analysis sentence. The implementation process of bottom-up analysis algorithm is this idea, starting from the sentence symbol string, performing the process of continuous standardization, and finally forming the root node.
Rule-based grammatical structure analysis can analyze all possible syntactic structures of input sentences by using hand-written rules; For specific fields and purposes, using targeted rules can better deal with some ambiguities and some extra-grammatical phenomena in sentences.
For a medium-length input sentence, it is very difficult to analyze all possible sentence structures by using grammar rules with large coverage. Even if it is analyzed, it is difficult to achieve effective disambiguation and select the most possible analysis results. Handwriting rules are subjective and need to be generalized, so it is difficult to guarantee the correct rate in the face of complex contexts. Writing rules by hand is a complicated task, and the fields of rules are closely related, which is not conducive to the transplantation of syntactic analysis system to other fields.
Rule-based parsing algorithm can successfully handle the compilation of program language, but it is always difficult to get rid of the dilemma for the processing of natural language, because the knowledge used in program language is strictly limited by the subclasses of context-free grammar, while the formal description method used in natural language processing system far exceeds the expressive ability of context-free grammar; Moreover, when people use programming languages, all expressions must obey the requirements of the machine, which is a process in which people obey the machine. This process is a mapping process from an infinite set of languages to a finite set, but in natural language processing, on the contrary, natural language processing realizes the process of machine tracking and obeying human languages and deducing from the finite set of languages to the infinite set.
Complete grammatical analysis
Basic analysis method based on PCFG
The phrase structure analysis method based on probabilistic context-free grammar can be said to be the most successful grammar-driven statistical syntax analysis method at present, and it can be regarded as the combination of regular method and statistical method.
PCFG is an extension of CFG, for example:
PCFG
Of course, the sum of probabilities of different generating formulas of the same symbol is 1. NP is a noun phrase, VP is a verb phrase and PP is a prepositional phrase.
The syntactic analysis model based on PCFG meets the following three conditions:
Position invariance: the probability of subtree does not depend on the position of the word under the jurisdiction of subtree in the sentence.
Context-free: the probability of subtree does not depend on words outside the control range of subtree.
Ancestor independence: the probability of subtree does not depend on the ancestor nodes of subtree.
According to the above grammar, "He met Jenny with flowers" has two possible grammatical structures:
Moreover, we can multiply all the probabilities in the tree to get the overall probability of two subtrees, and choose the subtree with higher probability as the best structure.
Like, well, PCFG has three basic problems:
Given a sentence w = w 1w2 … wn and grammar g, how to quickly calculate the probability P(W|G)?
Given a sentence w = w 1w2 … wn and grammar g, how to choose the best structure of the sentence? That is, the syntactic structure tree t is selected to have the greatest probability.
Given PCFG G and sentence w = w 1w2 … wn, how to adjust the probability parameter of g to maximize the probability of the sentence?
The first is the first question. In HMM, we use forward algorithm and backward algorithm to calculate the o probability of observation sequence. Similarly, here we use inward algorithm and outward algorithm to calculate P(W|G).
First, we define the inward variable αij(A), which is similar to but different from the forward variable. The probability of the string wiw(i+ 1)…wj in w is deduced from the αij(A) of the non-terminal A, then P(W|G) is naturally equal to α 1n(S), and s is the starting symbol. The calculation is to deduce the probability W=w 1w2…wn of the whole sentence from the starting symbol s.
So as long as there is a recurrence formula of αij(A), P(W|G) can be calculated, and the recurrence formula is as follows:
According to the definition, αii(A) is naturally equal to the probability that symbol A outputs wi; The calculation idea of αij(A) is that this substring wiw (I+1) ... wj can be divided into two parts, the former part wiw (I+1) ... wk is generated by the non-terminal b, and the latter part wkw (k+1) ... wj is generated by the non-terminal c.
The following is a schematic diagram of the calculation method of internal variables:
This problem can also be solved by extroversion algorithm.
Firstly, the extroverted variable is defined. βij(A) is the probability that the initial symbol S will generate the symbol string W1w2 … w (I-1) aw (j+1) … wn in the process of deriving the statement W = W 1Wn (implying that A will generate WIW), that is,
The book Statistical Natural Language Processing (Second Edition) is wrong. Here I give my own understanding. The algorithm steps given in the book are as follows:
Obviously, error, initialization has already initialized the result, so this algorithm is nothing, just equal to 1.
This is the author's vague understanding of the definition of extroverted variables. The definition of extroverted variable is given above, and there is a sentence "It implies that A will generate wiw(i+ 1)…wj". The problem is that A will produce WIW (I+ 1) … WJ. Is this a condition or an inference?
Look at the initialization significance of this algorithm. Say β 1n(A), when A=S, it is 1, which does not mean that s is 0. What does this mean? It means that the sentence "Hint that A will generate Wiw (I+ 1) … WJ" is a condition, and β 1n(S) has implied that S generates W = W 1W2 … WN, so-called w1w2 … w (I-/kloc-w S, so the probability is naturally 1.
But in the third step, what does the author understand? The author takes the sentence "suggesting that A will generate Wiw (I+ 1) … WJ" as an inference, and thinks that in β 1n(S), S will generate W = W 1W2 … WN is an inference, which is really just right, and the required result is that S will generate w = w.
Then what is my understanding? The β 1n(S) calculated by this formula is indeed correct, and the meaning actually includes the sentence "It implies that A will generate WIW (I+ 1)...WJ", which is an inference, but the β 1n(S) generated continuously and recursively in the formula on the right is ".
I tend to add an asterisk to β 1n(S) in the third step to indicate the difference in meaning.
The book also gives a schematic diagram of the calculation method of extroverted variables, which I find difficult to understand:
He said that βij(A) is the probability sum of these two situations. We know that J is bigger than I, so K in this picture is both smaller than I and bigger than J, isn't it funny? It can only be said that two c's on the graph are not a c, and k is not a k.
Then why do I understand it as one? In addition to the same letters, he said earlier that "the shape of B->; AC or b->; Rules CA "and" use b->; AC or b->; CA "two rules, this is obviously a misunderstanding of sequential exchange.
In addition, the use of introversion variables is inconsistent, so it can be said that the explanation of extroversion algorithm in this book is a failure. Moreover, the calculation of extroversion algorithm still needs the recursion of introversion algorithm. It is really good to use introversion algorithm directly, and extroversion algorithm needs to define more variables.
Then there is the second question, which is to choose the best sentence structure, that is, given a sentence w = w 1w2 … wn and grammar g,
Choose the syntax structure tree with the greatest probability. This problem is similar to that in HMM, and it is still solved by the idea of dynamic programming. Finally, the syntax structure tree with maximum probability is generated by CYK algorithm.
The third problem is how to adjust the probability parameters of G to maximize the sentence probability of a given PCFG G and sentence W = W 1W2 … Wn. Contrary to HMM, the algorithm used by PCFG here is called inward algorithm. Like the forward and backward algorithm, it belongs to an EM algorithm. Its basic idea is to randomly assign a probability value to the production of G (satisfying the normalization condition) and get the grammar G0. Then, according to G0 and training data, we can calculate the expected value of the usage times of each rule, and use this expected value for maximum likelihood estimation to get a new parameter value of grammar G, which is marked as G 1, and then execute it circularly.
PCFG is just a special context-free grammar model. According to PCFG's model and sentence, the syntactic analysis of sentence and the generation of grammatical structure tree depend on CYK algorithm. CYK algorithm is used to determine whether any given string w belongs to context-free grammar.
There are many problems in the syntactic analysis model based on PCFG. For example, PCFG is insensitive to lexical information because it does not model vocabulary. Therefore, people put forward lexicalized phrase structure analyzer, which effectively improves the ability of syntactic analyzer based on PCFG.
Moreover, we also mentioned the three independent hypotheses of PCFG, which also led to the lack of structural dependence between rules (just like the three hypotheses of HMM are not completely reasonable), and in natural language, the probability of each non-terminal is often related to its context structure, so someone proposed a method to refine the non-terminal and mark each non-terminal with the syntactic tag information of its parent node.
D.Klein proposed a context-free grammar with potential comments (PCFG-LA), which makes the refinement of non-terminators automatic. In order to avoid falling into local optimum, the EM algorithm is improved, and a hierarchical "split-merge" strategy is proposed to obtain an accurate and compact PCFG- pull model. As a representative of non-lexicalization analyzer, Berkeley analyzer based on PCFG-LA is the best among the open source phrase structure analyzers in terms of performance and running speed. Its syntax tree is as follows:
A comparative example of common syntax tree and PCFG- pull syntax tree
This X is an implicit mark, and the value range of xi is generally set artificially, generally taking an integer between 1 and 16. Moreover, PCFG-LA is also similar to HMM model, in which the original non-endpoint corresponds to the observation output in HMM model, and the implicit mark corresponds to the implicit state in HMM model.
Shallow grammar analysis (local grammar analysis)
In a complete grammatical analysis, it is a very difficult task to determine all the syntactic information contained in a sentence and the relationship between the components in the sentence. So far, all aspects of the parser are difficult to reach a satisfactory level. In order to reduce the complexity of the problem and obtain certain syntactic structure information, shallow syntactic analysis came into being.
Shallow grammatical analysis only needs to identify some structures in a sentence, such as non-recursive noun phrases and verb phrases, which are usually called chunks.
Shallow syntactic analysis decomposes syntactic analysis into two main subtasks, one is the identification and analysis of lexical chunks, and the other is the analysis of the dependency between lexical chunks. Among them, the identification and analysis of lexical chunks is the main task. Shallow parsing simplifies the task of parsing to a certain extent, and is also conducive to the rapid application of parsing system in large-scale real text processing system.
Basic noun phrases are an important category in lexical chunks, which refer to simple, non-nested noun phrases with no other sub-phrases. Basic noun phrases are structurally independent. Examples are as follows:
Basic noun phrase recognition is to identify all basic noun phrases from a sentence. According to this understanding, the sum of components in a sentence can be simply divided into two categories: base NP and non-base NP, so the identification of base NP becomes a classification problem.
Base NP has two representations, one is bracket separation, and the other is IOB representation. The bracket separation method defines the boundary of base NP with square brackets, which is base NP inside and does not belong to base NP outside. In IOB notation, the letter B indicates the beginning of radix NP, I indicates that the current word is within radix NP, and O indicates that the word is outside radix NP.
Recognition method of basic noun phrases based on SVM
Because the basic NP identification is a multi-valued classification problem, and the basic SVM algorithm solves a binary classification problem, generally, the paired method and the one-to-one method can be adopted.
In general, SVM needs to extract features from words, parts of speech and basic NP markers in the context to complete the judgment. Generally, when the word window length is 5 (the current word and its two words before and after), the recognition effect is the best.
Recognition method of basic noun phrases based on WINNOW
WINNOW is an error-driven machine learning method to solve the dichotomy problem, which can quickly learn from a large number of irrelevant features.
WINNOW's sparse network (SNoW) learning structure is a multi-class classifier, which is specially used to deal with large-scale learning tasks in the field of feature recognition. WINNOW algorithm has the ability to deal with high-dimensional independent feature space, and the feature vector in natural language processing has this characteristic, so WINNOW algorithm is also commonly used in part-of-speech tagging, spelling error checking and text classification.
The basic idea of simple WINNOW is that given the feature vector, parameter vector and real threshold θ, all the parameter vectors are initialized to 1, substituted into the training samples, and the inner product of the feature vector and parameter vector is found and compared with θ. If it is greater than θ, it is judged as a positive example, and if it is less than θ, it is judged as a counterexample. Compare the result with the correct answer and change the weight according to the result.
If the positive example is estimated as a counterexample, then the weight of X is expanded, and its original value is 1. If the counterexample is estimated as a positive example, the weight of x with the original value of 1 is reduced. Then reevaluate and change the weight until the training is completed.
This actually reminds me of LR algorithm, because LR algorithm is also the inner product of feature vector and parameter vector. Finally, it is sent to Sigmoid function to get the judgment result, and then the value greater than 0.5 is a positive example, and the value less than 0.5 is a counter example. In fact, as long as the output of Sigmod function is 0.5, the input is the real threshold θ in WINNOW algorithm. But the difference is that WINNOW algorithm only determines the size, but not the probability, while LR uses Sigmoid function to give the probability. LR uses this given probability to adjust the parameters by maximizing the generation probability of the training set, while WINNOW is a direct and naive error situation to increase or decrease the related parameters. Visual LR converges faster than WINNOW because it uses gradient descent. The advantage of WINNOW is that it can handle a large number of features.
Recognition method of basic noun phrases based on conditional random field
The basic NP identification method based on conditional random field has almost the same effect as SVM method, which is superior to the identification method based on WINNOW, the identification method based on MEMM and the perceptron method. The identification method based on conditional random field has obvious advantages over other methods in running speed.
Dependent grammar theory
In natural language processing, we sometimes don't need or only need the phrase structure tree of the whole sentence, but also need to know the dependency between words in the sentence. The framework of describing the language structure by the dependency relationship between words becomes dependency grammar, also called dependency grammar. Syntactic analysis using dependency grammar is also one of the important means of natural language understanding.
Some people think that all structural grammar phenomena can be summarized into three cores: association, combination and transposition. Syntactic association establishes the subordinate relationship between words, which consists of leading words and subordinate words. The verb in the predicate is the center of the sentence, which dominates other components and is not dominated by any other components.
The essence of dependency grammar is a kind of structural grammar, which mainly studies the situation and conditions that the deep semantic structure is reflected to the surface grammatical structure when constructing sentences, and the co-occurrence relationship between predicates and body words, and accordingly divides the parts of speech of predicates.
There are three commonly used dependency structure diagrams:
Computer linguist J. Robinson put forward four axioms of dependency grammar:
A sentence has only one independent component.
All other components of a sentence are subordinate to one component.
No component can depend on two or more components.
If component A is directly subordinate to component B and component C is located between A and B in the sentence, then component C belongs to component A, component B or the component between A and B. ..
These four axioms are equivalent to formal constraints on dependency graph and dependency tree: single parent node, connectivity, acyclic and projection, thus ensuring that the result of dependency analysis of sentences is rooted tree structure.
Projectability is mentioned here. If the dependent arcs between words are drawn without any intersection, they are projective (refer to the two directed graphs above).
In order to facilitate understanding, Chinese scholars put forward five conditions that dependency structure trees should meet:
Simple node condition: only endpoints, no non-endpoints.
Single parent condition: all nodes have only one parent except the root node.
Single root node condition: a dependency tree can only have one root node, which dominates other nodes.
Disconnection condition: branches of the dependency tree cannot intersect each other.
Mutually exclusive conditions: the top-down dominant relationship and the left-to-right antecedent relationship are mutually exclusive. If there is a dominant relationship between two nodes, then they cannot exist in the antecedent relationship.
These five conditions intersect, but they are completely based on the spatial structure of dependent expressions, which is more intuitive and practical than the four axioms.
Gaifman 1965 gives a formal representation of dependency grammar, and proves that there is no difference between dependency grammar and context-free grammar. ..
Language forms similar to context-free grammar limit the projectiveness of the analyzed language, and it is difficult to directly deal with free word-order languages with non-projective phenomena. Constraint grammar based on constraint satisfaction and the corresponding dependency analysis method were developed in the 1990s, which can deal with this kind of non-projective language problems.
The analysis method based on constraint satisfaction is based on constraint dependency grammar, and regards dependency syntax analysis as a finite construction problem that can be described by constraint satisfaction problem.
Constraint dependency grammar uses a series of formal and descriptive constraints to remove dependency analysis that does not meet the constraints until a legal dependency tree is left.
Generative dependency analysis, discriminant dependency analysis and deterministic dependency analysis are three representative methods in data-driven statistical dependency analysis.
Generation dependency analysis method
Generative dependency analysis method uses joint probability model to generate a series of dependency syntax trees and give them probability scores, and then uses correlation algorithm to find the analysis result with the highest probability score as the final output.
The generated dependency analysis model is easy to use, and its parameter training only looks for the count of related components in the training set and calculates the prior probability. However, the joint probability model is used in the production method, and then approximate assumptions and estimates are made when decomposing the probability product. Moreover, due to the high complexity of the global search algorithm, the efficiency is low, but this kind of algorithm has certain advantages in accuracy. However, the reasoning method similar to CYK algorithm makes it difficult for this kind of model to deal with non-projection problems.
Discriminant dependency analysis method
The discriminant dependence analysis method adopts the conditional probability model, which avoids the independence assumption required by the joint probability model (considering that the discriminant model CRF abandons the independence assumption of the generating model HMM), and the training process is to find the parameter θ (similar to Logistic regression and CRF) that maximizes the objective function (the generating probability of training samples).
Discriminant method not only makes exhaustive search in reasoning, but also has global optimality in training algorithm. It is necessary to repeat the process of syntactic analysis for training samples to iterate the parameters. The training process is also a reasoning process, and the time complexity of training and analysis is the same.
Deterministic dependence method
Deterministic dependency analysis method takes one word to be analyzed one by one in a specific direction, and produces a single analysis result for each input word until the last word in the sequence.
This algorithm should make a decision according to the current analysis state in each step of analysis (for example, judging whether it depends on the previous word), so this method is also called decision analysis.
The basic idea of deterministic parsing method is to get a unique syntactic expression, that is, dependency graph (sometimes there may be backtracking and patching) through a certain parsing action sequence.
The relationship between phrase structure and dependency structure
Phrase structure trees can be transformed into dependency trees one by one, and vice versa. Because the dependency tree can correspond to multiple phrase structure trees.