1887

Chapter 35 : Characterization of Bacterial Genome Sequences by Similarity Searching

MyBook is a cheap paperback edition of the original book and will be sold at uniform, low price.

Ebook: Choose a downloadable PDF or ePub file. Chapter is a downloadable PDF file. File must be downloaded within 48 hours of purchase

Buy this Chapter
Digital (?) $30.00

Preview this chapter:
Zoom in
Zoomout

Characterization of Bacterial Genome Sequences by Similarity Searching, Page 1 of 2

| /docserver/preview/fulltext/10.1128/9781555817497/9781555812232_Chap35-1.gif /docserver/preview/fulltext/10.1128/9781555817497/9781555812232_Chap35-2.gif

Abstract:

With more than 100 fully sequenced microbial genome sequences publicly available from many of the deep divisions of bacteria and archaea, protein and DNA sequence similarity searching is one of the most widely used methods for analyzing bacterial genes and genomes. In this chapter, the author focuses on more effective methods for similarity searching with protein and DNA sequences, with the goal of identifying distantly related genes in bacteria and bacterial populations. While strategies for similarity searching are often presented as a process of selecting a sequence comparison program, and possibly the scoring matrix and gap penalties, the chapter provides a different perspective. The chapter summarizes the statistical basis for inferring homology; from this statistical perspective, several strategies emerge for improving the effectiveness of similarity searches. The author says as on 2003, there are two widely used sets of programs for searching protein and DNA sequence databases: BLAST and FASTA. All similarity searching programs are most effective when comparing sequences at the protein or translated-protein level, and modern versions of the BLAST and FASTA programs provide efficient algorithms for translated searches with sequences that may include frameshift errors and termination codons. To test the ability of translated-DNA similarity searching to identify anonymous DNA sequences, the author selected either 10 10,000-nucleotide sequences or 100 1,000-nucleotide sequences at random from the completed , O157:H7, or genome. These DNA sequences were used to search the nonredundant (NR) database, excluding all sequences from either the phylum or superkingdom.

Citation: Pearson W. 2007. Characterization of Bacterial Genome Sequences by Similarity Searching, p 842-855. In Reddy C, Beveridge T, Breznak J, Marzluf G, Schmidt T, Snyder L (ed), Methods for General and Molecular Microbiology, Third Edition. ASM Press, Washington, DC. doi: 10.1128/9781555817497.ch35

Key Concept Ranking

Bacterial Proteins
0.45595685
0.45595685
Highlighted Text: Show | Hide
Loading full text...

Full text loading...

Figures

Image of FIGURE 1
FIGURE 1

Low-complexity regions identified in a putative secreted protein from (NP_663818) by the SEG program ( ). Low-complexity regions are shown to the left of the residue numbers.

Citation: Pearson W. 2007. Characterization of Bacterial Genome Sequences by Similarity Searching, p 842-855. In Reddy C, Beveridge T, Breznak J, Marzluf G, Schmidt T, Snyder L (ed), Methods for General and Molecular Microbiology, Third Edition. ASM Press, Washington, DC. doi: 10.1128/9781555817497.ch35
Permissions and Reprints Request Permissions
Download as Powerpoint
Image of FIGURE 2
FIGURE 2

Low-complexity domains increase similarity scores. Distribution of scores obtained searching the NR database excluding sequences with a putative secreted protein from (NP_663818), a 398-amino-acid protein that is about 30% low complexity. The filled symbols show the distribution of scores obtained when low-complexity regions are excluded from the alignment scores; the open diamonds show the scores obtained when low-complexity domains are included. The inset (upper right) shows that about 1,000 additional sequences have high scores if low-complexity domains are included; three of these sequences have an () of < 10, and six more have an () of <0.01. There are no statistically significant matches when low-complexity domains are excluded.

Citation: Pearson W. 2007. Characterization of Bacterial Genome Sequences by Similarity Searching, p 842-855. In Reddy C, Beveridge T, Breznak J, Marzluf G, Schmidt T, Snyder L (ed), Methods for General and Molecular Microbiology, Third Edition. ASM Press, Washington, DC. doi: 10.1128/9781555817497.ch35
Permissions and Reprints Request Permissions
Download as Powerpoint

References

/content/book/10.1128/9781555817497.chap35
1. Altschul, S. F. 1991. Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol. 219:555565.
2. Altschul, S. F.,, M. S. Boguski,, W. Gish,, and J. C. Wootton. 1994. Issues in searching molecular sequence databases. Nat. Genet. 6:119129.
3. Altschul, S. F.,, W. Gish,, W. Miller,, E.W. Myers,, and D. J. Lipman. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403410.
4. Altschul, S. F.,, T. L. Madden,, A. A. Schaffer,, J. Zhang,, Z. Zhang,, W. Miller,, and D. J. Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:33893402.
5. Boeckmann, B.,, A. Bairoch,, R. Apweiler,, M. Blatter,, A. Estreicher,, E. Gasteiger,, M. J. Martin,, K. Michoud,, C. O’Donovan,, I. Phan,, S. Pilbout,, and M. Schneider. 2003. The Swiss-Prot protein knowledgebase and its supplement trembl in 2003. Nucleic Acids Res. 31:365370.
6. Brenner, S. E.,, C. Chothia,, and T. J. Hubbard. 1998. Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc. Natl. Acad. Sci. USA 95:60736078.
7. Dayhoff, M.,, R. M. Schwartz,, and B. C. Orcutt,. 1978. A model of evolutionary change in proteins, p. 345352. In M. Dayhoff (ed.), Atlas of Protein Sequence and Structure, vol. 5, suppl. 3. National Biomedical Research Foundation, Silver Spring, MD.
8. Henikoff, S.,, and J. G. Henikoff. 1992. Amino acid substitutions matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89:1091510919.
9. Jones, D. T.,, W. R. Taylor,, and J. M. Thornton. 1992. The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. 8:275282.
10. Pearson, W. R. 1998. Empirical statistical estimates for sequence similarity searches. J. Mol. Biol. 276:7184.
11. Pearson, W. R., 1999. Flexible similarity searching with the fasta3 program package, p. 185219. In S. Misener, and S. A. Krawetz (ed.), Bioinformatics Methods and Protocols. Humana Press, Totowa, NJ.
12. Pearson, W. R.,, T. Wood,, Z. Zhang,, and W. Miller. 1997. Comparison of DNA sequences with protein sequences. Genomics 46:2436.
13. Pearson, W. R.,, and T. C. Wood,. 2001. Statistical significance in biological sequence comparison, p. 3965. In D. J. Balding,, M. Bishop,, and C. Cannings (ed.), Handbook of Statistical Genetics. Wiley, London,United Kingdom.
14. Reeck, G. R.,, C. de Haen,, D. C. Teller,, R. F. Doolittle,, W. M. Fitch,, R. E. Dickerson,, P. Chambon,, A. D. McLachlan,, E. Margoliash,, T. H. Jukes,, and E. Zuckerland. 1987. “Homology” in proteins and nucleic acids: a terminology muddle and a way out of it. Cell 50:667.
15. Smith, T. F.,, and M. S. Waterman. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147:195197.
16. Sonnhammer, E. L.,, S. R. Eddy,, and R. Durbin. 1997. Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 28:405420.
17. States, D. J.,, W. Gish,, and S. F. Altschul. 1991. Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. Methods Companion Methods Enzymol. 3:6670.
18. Tatusov, R. L.,, E. V. Koonin,, and D. J. Lipman. 1991. A genomic perspective on protein families. Science 278:631637.
19. Wootton, J. C.,, and S. Federhen. 1993. Statistics of local complexity in amino acid sequences and sequence databases. Comput. Chem. 17:149163.
20. Zmasek, C. M.,, and S. R. Eddy. 2002. Rio: analyzing proteomes by automated phylogenomics using resampled inference of orthologs. BMC Bioinformatics 3:14.

Tables

Generic image for table
TABLE 1

Sample search with query sequences from

“Typical” sequence similarity searching results for comparison of MJ0050, SwissProt Y050_METJA, or MJ1633, SwissProt YG33_METJA, with the proteome. Comparisons used the Smith-Waterman algorithm ( ) as implemented by the SSEARCH program from the FASTA package ( ). Results include the description of the library sequence, the length of the library sequence, and four measures of sequence similarity: the raw Smith-Waterman alignment (S-W) score using the BLOSUM50 matrix and a penalty of −10 for a gap open and −2 for each residue in the gap; the bit score (bits) as defined in equation 5; (4,311), the expectation, as defined in equation 6; the percent identity (% ID); and alignment length (AL). aa, amino acids.

Citation: Pearson W. 2007. Characterization of Bacterial Genome Sequences by Similarity Searching, p 842-855. In Reddy C, Beveridge T, Breznak J, Marzluf G, Schmidt T, Snyder L (ed), Methods for General and Molecular Microbiology, Third Edition. ASM Press, Washington, DC. doi: 10.1128/9781555817497.ch35
Generic image for table
TABLE 2

BLAST and FASTA similarity search programs

Citation: Pearson W. 2007. Characterization of Bacterial Genome Sequences by Similarity Searching, p 842-855. In Reddy C, Beveridge T, Breznak J, Marzluf G, Schmidt T, Snyder L (ed), Methods for General and Molecular Microbiology, Third Edition. ASM Press, Washington, DC. doi: 10.1128/9781555817497.ch35
Generic image for table
TABLE 3

DNA and translated-DNA similarity searches

Bit scores from BLASTX and BLASTN searches presented using the BLAST taxonomy summary option. The DNA sequence (M84025) encoding glutamate decarboxylase was used to search the bacterial division of GenBank or Genpept. Species that contain a homolog with a bit score of ≥45 [() < 10 for[BLASTX] are shown. The numbers under the BLASTX and BLASTN columns indicate the highest bit score obtained for that taxonomic group. gram, gram positive;CFB, --; GNS, green nonsulfur.

Citation: Pearson W. 2007. Characterization of Bacterial Genome Sequences by Similarity Searching, p 842-855. In Reddy C, Beveridge T, Breznak J, Marzluf G, Schmidt T, Snyder L (ed), Methods for General and Molecular Microbiology, Third Edition. ASM Press, Washington, DC. doi: 10.1128/9781555817497.ch35
Generic image for table
TABLE 4

Changing scoring parameters

Searches were performed as for Table 1 using different scoring matrices and gap penalties. Three different combinations of BLOSUM scoring matrices and gap penalties are shown: BLOSUM50, with a gap open penalty of −10 and a gap extension penalty of −2 (the default for FASTA); BLOSUM62, with a gap open penalty of −11 and a gap extension penalty of −1 (the default for BLASTP), and BLOSUM62, with a gap open penalty of −7 and a gap extension penalty of −2 (gap penalties that are more effective than the BLASTP values when used with FASTA). For abbreviations, see Table 1 , footnote .

Citation: Pearson W. 2007. Characterization of Bacterial Genome Sequences by Similarity Searching, p 842-855. In Reddy C, Beveridge T, Breznak J, Marzluf G, Schmidt T, Snyder L (ed), Methods for General and Molecular Microbiology, Third Edition. ASM Press, Washington, DC. doi: 10.1128/9781555817497.ch35
Generic image for table
TABLE 5

Identifying orthologs

High-scoring sequences found in searches of the , and genomes using the KefB protein sequence (NP_417809). Results with three different scoring matrices are shown: BLOSUM50 (the FASTA default), PAM120 (appropriate for sequences that are about 40% identical), and PAM20 (appropriate for sequences that are about 80% identical). For abbreviations, see Table 1 , footnote .

Citation: Pearson W. 2007. Characterization of Bacterial Genome Sequences by Similarity Searching, p 842-855. In Reddy C, Beveridge T, Breznak J, Marzluf G, Schmidt T, Snyder L (ed), Methods for General and Molecular Microbiology, Third Edition. ASM Press, Washington, DC. doi: 10.1128/9781555817497.ch35
Generic image for table
TABLE 6

Identification of anonymous DNA sequences at different evolutionary distances

Citation: Pearson W. 2007. Characterization of Bacterial Genome Sequences by Similarity Searching, p 842-855. In Reddy C, Beveridge T, Breznak J, Marzluf G, Schmidt T, Snyder L (ed), Methods for General and Molecular Microbiology, Third Edition. ASM Press, Washington, DC. doi: 10.1128/9781555817497.ch35

This is a required field
Please enter a valid email address
Please check the format of the address you have entered.
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error