All the scripts I use can be downloaded from +Q 840842906.
Just briefly introduce the direction, maybe we can help each other in the future ~
Please state your purpose. Your real name? School+Name
-2. Two groups of samples, the purpose is to find the differences between the two groups and the reasons for the differences through metagenomics.
-1. principle
leave out
0. Super Strike/Spades Combination
leave out
1. Gene prediction
1. 1 software installation -meta gene mark/ prodigal son
leave out
Reference:/blog-2379401-1085480.html? -/hyattpd/ prodigal son
1.2
Genemark reference: /a/207036304_99908466
For me in {B2A, B3A......}? (Omitted below, $i indicates the contents in brackets. )
do
The prodigal son-a $ i/$ i.pro.fa-d $ i/$ i.gene.fa-f gff-g11-i $ i/$ i.fasta-p meta-s $ i/$ i.stat _ potential.
finished
-i is the input file, that is, the assembled overlapping group. Different articles here will screen out contig with different lengths and CDS with different length prediction results, and you can consult the literature yourself.
Let me cite two papers as examples: (Of course, I didn't mix them up. I assemble each sample once, predict it once, and then put the prediction results of all samples together.
1.3 Then we need to rename the predicted gene and filter the length.
( 1)sed '/& gt; /s/gene/B2A _ gene/' $ I/$ I . gene . fa | sed '/& gt; /s/| gene mark . hmm//' | awk ' { print $ 1 } ' & gt; $i/$i.gene.fa2
(2)awk-F " | " ' { print $ 1 } ' $ I/$ I . gene . fa2 & gt; $i/$i.gene.fa3
(3)bioawk -c fastx '{print $name,length($ seq)} ' $ I/$ I . gene . fa3 & gt; $i/$i/$ i. Length
(4)awk“{ if($2 & gt; 99)print $ 1 } ' $ I/$ I . length & gt; $i/$i.keep
(5)Sam tools faidx $ I/$ I . gene . fa3-r $ I/$ I . keep & gt; $i/$i.filter.gene.fa
(6)RM $ I/$ I . gene . fa2; RM $ I/$ I . gene . fa3; RM $ I/$ I . gene . fa3 . fai; RM $ I/$ I . keep; Rm $i/$i/$ i. length
(1) and (2) are being renamed, and each gene is named B2A _ Gene _ 1 and B2A _ Gene _2. ......
(3) Obtain the length of each gene; (4) Obtain the gene name with the length greater than 100; (5) Extract the gene name with the previous length greater than 100 with samtools as a clear intermediate file.
Ugly, forgive me.
While filtering the prodigal son, I suddenly found that bbmap has a ready-made tool:
(Actually, I found it by reading the command line of atlas when I was running atlas. Atlas is the snakemake process of metagenome analysis: /metagenome-atlas/atlas).
java -ea -Xmx 1g -cp。 /opt/bbmap-37.99/current/ jgi。 RenameReads in=。 /prodigal son/$ i/$ i.gene.faout = $ i.rename.gene.faow = tprefix = $ i.genemins caf =100.
Prefix=$i.gene means renaming the gene to B2A.gene. 1? B2A. Genes. . . . .
Minscaf= 100, that is, the filtration length is less than 100? In= i.e. input file? Out= output file.
1.4? Rename/Length All Samples, Filter Gene. fa? Yiqimao
2.CD-Hit redundancy deletion
CD-hit-I input . fa-o out . fa-c 0.95-d 0-aL 0.9-uL 0.05-aS 0.9-T 80-M 15000
I used -T 80, and the super-calculation to go to school was completed in one afternoon; ; Previously used -T 14, clustering takes a week.
Parameter interpretation /a/207036304_99908466, personal sensory parameters can be copied from the above.
3. Calculation of relative abundance and length standardization
3. 1 ref:/p/a28c 1729 168 e? ivk_sa= 1024320u? SOAP is used to calculate the relative abundance of this link, but I haven't found any information related to SOAP, so bwa mem including Salmon and bowtie2+bbmap can also do the same thing.
Bwa mem -p result sample after redundancy removal 1. 1.fq.gz sample 1.2.fq.gz > sample1.sam.
Get the alignment of each gene on the sample 1.
3.2
grep-v ^@ $ I . AlN . Sam | cut-f 3 | sort | uniq-c | awk ' begin { fs = " "; OFS= "," }{print $2,$ 1}' | awk 'BEGIN{ FS= ","; OFS= "," } { if($2 & gt; 2) Print $ 1, $2; Otherwise, print $ 1, "0"}' & gt$i.count.csv # to get the number of readings matched by each gene, directly mark the number of readings less than or equal to 2 as 0, and then standardize the length.
3.3 ref:/p/a28c 1729 168 e? ivk_sa= 1024320u? The calculation of relative abundance of this link is based on SOAP, but I didn't find any information related to SOAP, so I used bwa mem.
Then you need to get the length of each gene in the non-redundant gene set (that is, the result of CDHIT):
Bioawk-c fastx' {print $ name, length ($ seq)}' Redundant result | tr' \ t'',' >; Gene. Length. csv? The first column is the gene name/sequence name, and the second column is the sequence length.
Then get the length of each gene in gene.count.csv
Generally speaking, the first column of gene.count.csv matches the first column of gene. Length and csv. If there is a match, each column of gene.count.csv (gene name, count) and the second column of gene. Length and csv (gene length) output.
If it does not match, the first column of gene.count.csv is output, with the gene count of 0 and the length of 0.
python jiyinfengdu . py $ I . count . CSV $ I & gt; $i.tmp
Now it is very simple to calculate according to the standardized formula above, and many methods can be realized.
3.4: Then combine the data of each sample.
Paste b2a.tmpb3a.tmpb4a.tmp > end.tsv
Visualization, function annotation, etc. It can be executed later.
to be continued
Scripts used
2022.4 update
Some other methods have been found to calculate the relative abundance of genes (there is no good authoritative method for the quantification of metagenome)
The method used in this paper is 12 years old (although only a few articles are still in use).
1.MOCAT2? (Published in Bioinformatics 16)
The advantage is that it is much more convenient, and the assembly and abundance prediction are all in one package.
2. Wen, Zheng, Zhao, Shao Tietao, Liu, Li, Xie, ...&Li, H. (20 17). Quantitative metagenomics reveals the unique intestinal microbial biomarkers of ankylosing spondylitis. Genome biology
This article is better and gives the method.