The basic problem of sequence alignment is to compare the similarity or dissimilarity of two or more symbol sequences. From the biological point of view, this problem contains the following meanings: (1) reconstructing the complete sequence of DNA from overlapping sequence fragments; Determine the physical and genetic map storage from the probe data under various experimental conditions, traverse and compare the DNA sequences in the database, compare the similarities of two or more sequences, search related sequences and subsequences in the database, find out the continuous generation mode of nucleotides, find out the information components in protein and DNA sequences, and compare the biological characteristics of DNA sequences, such as local insertion, deletion (the former two are referred to as indel for short) and replacement. The objective function of sequences obtains the minimum distance weighted sum or maximum similarity sum of variation sets between sequences. The methods of alignment include global alignment, local alignment and generation gap punishment. Dynamic programming algorithm is often used to compare two sequences, which is suitable for short sequence length, but not for massive gene sequences (such as human DNA sequence as high as 109bp), and even the algorithm complexity is linear. Therefore, the heuristic method is difficult to work.
2. Comparison and prediction of protein structure.
The basic problem is to compare the similarities or differences of the spatial structures of two or more protein molecules. The structure and function of protein are closely related. It is generally believed that protein with similar functions is generally similar in structure. Protein is a long chain composed of amino acids, with the length ranging from 50 to1000 to 3000 aa. Protein has many functions, such as storage and transportation of enzymes and substances, and signal transmission. Antibodies, etc. The sequence of amino acids inherently determines the three-dimensional structure of protein. It is generally believed that protein has four different structures. The reason for studying the structure and prediction of protein is to understand the function of organisms in medicine, to find the target of docking drugs, and to obtain better crop genetic engineering in agriculture. Enzymatic synthesis is used in industry. The reason for directly comparing protein structure is that the three-dimensional structure of protein is more stable than the first-order structure in evolution and contains more information than AA sequence. The premise of protein's three-dimensional structure research is that the internal amino acid sequence corresponds to the three-dimensional structure one by one (not necessarily true). Physics can be explained by minimum energy. The structure of unknown protein is predicted by observing and summarizing the protein structure law of known structures. Homologous modeling and threading both fall into this category. Homology modeling is used to find protein structures with high similarity (more than 30% amino acids are the same), and the latter is used to compare different protein structures in evolutionary families. However, the research status of structural prediction in protein is far from meeting the actual needs.
3. Gene identification and non-coding region analysis.
The basic problem of gene recognition is to correctly identify the range and exact position of genes in a given genome sequence. Non-coding regions are composed of introns, which are usually discarded after protein formation. However, from the experiment, if the non-coding regions are removed, gene replication cannot be completed. Obviously, DNA sequence, as a genetic language, is not only contained in the coding region, but also implied in the non-coding sequence. At present, there is no general guiding method for analyzing DNA sequences in non-coding regions. In the human genome, not all sequences are encoded, that is, some kind of protein template, and the encoded part only accounts for 3-5% of the total sequence of human genes. Obviously, it is inconceivable to search such a large gene sequence manually. The method of detecting the password region includes measuring the frequency of codons in the password region. First-order and second-order Markov chains, ORF (open reading frame), promoter recognition, HMM (hidden Markov model) and GENSCAN, splicing alignment and so on.
4. Molecular Evolution and Comparative Genomics
Molecular evolution is to use the similarities and differences of the same gene sequence in different species to study the evolution of organisms and build an evolutionary tree. We can not only use DNA sequences, but also use the amino acid sequences encoded by them, even through the structural comparison of related protein, on the premise that similar races are genetically similar. By comparison, we can find out which races are the same. What is the difference? Early research methods usually use external factors, such as size, skin color and number of limbs, as the basis of evolution. In recent years, with the completion of many model organism genome sequencing tasks, people can study molecular evolution from the perspective of the whole genome. When matching genes of different races, there are generally three situations to be dealt with: orthodoxy: genes of different races with the same function; Collateral homology: Homologous genes with different functions; Heterologous gene: a gene that spreads between organisms by other means, such as a virus injection gene. The common method in this field is to construct a phylogenetic tree, which is realized by methods based on features (that is, the specific positions of amino acid bases in DNA sequences or protein) and distances (alignment scores) and some traditional clustering methods (such as UPGMA).
5, sequence overlapping group assembly
According to the current sequencing technology, only 500 or more base pairs can be detected in each reaction. For example, short shot method is used to measure human genes, which requires a large number of short sequences to form overlapping groups. The process of splicing them gradually to form a longer contig until a complete sequence is obtained is called contig assembly. From the perspective of algorithm, the overlapping group of sequences is a NP-complete problem.
6, the origin of genetic code
Generally speaking, the study of genetic code thinks that the relationship between codons and amino acids is caused by an accidental event in the history of biological evolution and has been fixed on the same ancestor of modern organisms until now. Different from this "freezing" theory, some people put forward three theories to explain the genetic code, namely, selection optimization, chemistry and history. With the completion of various biological genome sequencing tasks, it provides new materials for studying the origin of genetic code and testing the authenticity of the above theory.
7. Structure-based drug design
One of the purposes of human genetic engineering is to understand the structure, function and interaction of about 654.38+ million kinds of protein in human body and their relationship with various human diseases, and to seek various treatment and prevention methods including drug therapy. Drug design based on biomacromolecules and micromolecules is an extremely important research field in bioinformatics. In order to inhibit the activity of some enzymes or protein, based on the known tertiary structure of proteins, inhibitor molecules can be designed as candidate drugs on the computer by using molecular permutation algorithm. The purpose of this field is to find new gene drugs, which have great economic benefits.
8. Modeling and simulation of biological system
With the development of large-scale experimental technology and data accumulation, it has become another research hotspot in the post-genome era-system biology to study and analyze biological systems from the global and systematic levels and reveal their development laws. At present, its research contents include simulation of biological system (Curr Opin Rheumatol, 2007, 463-70), system stability analysis (nonlinear dynamic psychological life Sci, 2007, 4 13-33) and system robustness analysis (Ernst Schering Res Found Workshop, 2007, 69-83). The modeling language represented by SBML (Bioinformatics, 2007, 1297-8) has developed rapidly. Boolean networks (PLoS Comput Biol, 2007, e 163), differential equations (Mol Biol Cell, 2004, 3841-. In 2007, 3262-92) and discrete dynamic event system (Bioinformatics, 2007, 336-43), many models have been established by referring to the modeling methods of physical systems such as circuits, and many studies have tried to solve the complexity of the system from the macroscopic analysis ideas such as information flow, entropy and energy flow (Anal Quant Cytol Histol, 2007, 296-308). Of course, it will take a long time to establish the theoretical model of biological system. Although the experimental observation data are increasing greatly, the data needed for biological system model identification far exceeds the output capacity of current data. For example, for the chip data of time series, the number of sampling points is not enough to use the traditional time series modeling method, and the huge experimental cost is the main difficulty of system modeling at present. System description and modeling methods also need pioneering development.
9. Research on Bioinformatics Technology and Methods
Bioinformatics is not only a simple arrangement of biological knowledge and a simple application of knowledge in mathematics, physics, information science and other disciplines. Massive data and complex background lead to the rapid development of machine learning, unified data analysis and system description under the background of bioinformatics. Huge amount of calculation, complex noise patterns and massive time-varying data bring great difficulties to traditional statistical analysis, which requires more flexible data analysis techniques, such as nonparametric statistics (BMC Bioinformatics, 2007,339) and cluster analysis (Qual Life Res, 2007, 1655-63). The analysis of high-dimensional data requires the compression technology of feature space such as partial least squares (PLS). In the development of computer algorithm, it is necessary to fully consider the time and space complexity of the algorithm, and use parallel computing, grid computing and other technologies to expand the realizability of the algorithm.
10, biological image
Why do people who are not related by blood look so alike?
Appearance is made up of points. The more points overlap, the more they look alike. Why do these two unrelated points overlap?
What is the biological basis? Are the genes similar? I don't know, I hope experts can answer.
1 1, others
Such as gene expression profile analysis and metabolic network analysis; Gene chip design and protein omics data analysis have gradually become new important research fields in bioinformatics. In terms of disciplines, disciplines derived from bioinformatics, including structural genomics, functional genomics, comparative genomics, protein's research, pharmacogenomics, traditional Chinese medicine genomics, oncology, molecular epidemiology and environmental genomics, have become important research methods in systems biology. It is not difficult to see from the current development that genetic engineering has entered the post-genome era. We also have a clear understanding of how to deal with the possible misleading in machine learning and mathematics closely related to bioinformatics.