Bioinformatics, a practical guide to the analysis of genes and proteins 2nd ed a baxevanis , b ouellette (wiley sons, 2001)

TE AM FL Y BIOINFORMATICS SECOND EDITION METHODS OF BIOCHEMICAL ANALYSIS Volume 43 BIOINFORMATICS A Practical Guide to the Analysis of Genes and Proteins SECOND EDITION Andreas D Baxevanis Genome Technology Branch National Human Genome Research Institute National Institutes of Health Bethesda, Maryland USA B F Francis Ouellette Centre for Molecular Medicine and Therapeutics Children’s and Women’s Health Centre of British Columbia University of British Columbia Vancouver, British Columbia Canada A JOHN WILEY & SONS, INC., PUBLICATION New York • Chichester • Weinheim • Brisbane • Singapore • Toronto Designations used by companies to distinguish their products are often claimed as trademarks In all instances where John Wiley & Sons, Inc., is aware of a claim, the product names appear in initial capital or ALL CAPITAL LETTERS Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration Copyright ᭧ 2001 by John Wiley & Sons, Inc All rights reserved No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic or mechanical, including uploading, downloading, printing, decompiling, recording or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the Publisher Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 605 Third Avenue, New York, NY 10158-0012, (212) 850-6011, fax (212) 850-6008, E-Mail: PERMREQ@WILEY.COM This publication is designed to provide accurate and authoritative information in regard to the subject matter covered It is sold with the understanding that the publisher is not engaged in rendering professional services If professional advice or other expert assistance is required, the services of a competent professional person should be sought This title is also available in print as ISBN 0-471-38390-2 (cloth) and ISBN 0-471-38391-0 (paper) For more information about Wiley products, visit our website at www.Wiley.com ADB dedicates this book to his Goddaughter, Anne Terzian, for her constant kindness, good humor, and love—and for always making me smile BFFO dedicates this book to his daughter, Maya Her sheer joy and delight in the simplest of things lights up my world everyday CONTENTS Foreword Preface Contributors BIOINFORMATICS AND THE INTERNET xiii xv xvii Andreas D Baxevanis Internet Basics Connecting to the Internet Electronic Mail File Transfer Protocol The World Wide Web Internet Resources for Topics Presented in Chapter References 10 13 16 17 THE NCBI DATA MODEL 19 James M Ostell, Sarah J Wheelan, and Jonathan A Kans Introduction PUBs: Publications or Perish SEQ-Ids: What’s in a Name? BIOSEQs: Sequences BIOSEQ-SETs: Collections of Sequences SEQ-ANNOT: Annotating the Sequence SEQ-DESCR: Describing the Sequence Using the Model Conclusions References 19 24 28 31 34 35 40 41 43 43 THE GENBANK SEQUENCE DATABASE 45 Ilene Karsch-Mizrachi and B F Francis Ouellette Introduction Primary and Secondary Databases Format vs Content: Computers vs Humans The Database 45 47 47 49 vii viii CONTENTS The GenBank Flatfile: A Dissection Concluding Remarks Internet Resources for Topics Presented in Chapter References Appendices Appendix 3.1 Example of GenBank Flatfile Format Appendix 3.2 Example of EMBL Flatfile Format Appendix 3.3 Example of a Record in CON Division 49 58 58 59 59 59 61 63 SUBMITTING DNA SEQUENCES TO THE DATABASES 65 Jonathan A Kans and B F Francis Ouellette Introduction Why, Where, and What to Submit? DNA/RNA Population, Phylogenetic, and Mutation Studies Protein-Only Submissions How to Submit on the World Wide Web How to Submit with Sequin Updates Consequences of the Data Model EST/STS/GSS/HTG/SNP and Genome Centers Concluding Remarks Contact Points for Submission of Sequence Data to DDBJ/EMBL/GenBank Internet Resources for Topics Presented in Chapter References 65 66 67 69 69 70 70 77 77 79 79 STRUCTURE DATABASES 83 80 80 81 Christopher W V Hogue Introduction to Structures PDB: Protein Data Bank at the Research Collaboratory for Structural Bioinformatics (RCSB) MMDB: Molecular Modeling Database at NCBI Stucture File Formats Visualizing Structural Information Database Structure Viewers Advanced Structure Modeling Structure Similarity Searching Internet Resources for Topics Presented in Chapter Problem Set References GENOMIC MAPPING AND MAPPING DATABASES 83 87 91 94 95 100 103 103 106 107 107 111 Peter S White and Tara C Matise Interplay of Mapping and Sequencing Genomic Map Elements 112 113 ix CONTENTS Types of Maps Complexities and Pitfalls of Mapping Data Repositories Mapping Projects and Associated Resources Practical Uses of Mapping Resources Internet Resources for Topics Presented in Chapter Problem Set References INFORMATION RETRIEVAL FROM BIOLOGICAL DATABASES 115 120 122 127 142 146 148 149 155 Andreas D Baxevanis Integrated Information Retrieval: The Entrez System LocusLink Sequence Databases Beyond NCBI Medical Databases Internet Resources for Topics Presented in Chapter Problem Set References SEQUENCE ALIGNMENT AND DATABASE SEARCHING 156 172 178 181 183 184 185 187 Gregory D Schuler Introduction The Evolutionary Basis of Sequence Alignment The Modular Nature of Proteins Optimal Alignment Methods Substitution Scores and Gap Penalties Statistical Significance of Alignments Database Similarity Searching FASTA BLAST Database Searching Artifacts Position-Specific Scoring Matrices Spliced Alignments Conclusions Internet Resources for Topics Presented in Chapter References CREATION AND ANALYSIS OF PROTEIN MULTIPLE SEQUENCE ALIGNMENTS 187 188 190 193 195 198 198 200 202 204 208 209 210 212 212 215 Geoffrey J Barton Introduction What is a Multiple Alignment, and Why Do It? Structural Alignment or Evolutionary Alignment? How to Multiply Align Sequences 215 216 216 217 C O L O R P L AT E S (a) (b) (c) (d) Figure 5.5 A constellation of viewing alternatives using RasMol with a portion of the barnase structure 1BN1 (Buckle et al., 1993) 1BN1 has three barnase molecules in the asymmetric unit For this figure, the author edited the PDB file to remove two extra barnase molecules to make the images Like most crystal structures, 1BN1 has no hydrogen locations (a) Barnase in CPK coloring (element-based coloring) in a wire-frame representation (b) Barnase in a space-filling representation (c) Barnase in an ␣-carbon backbone representation, colored by residue type The command line was used to select all the tryptophan residues, render them with ‘‘sticks,’’ color them purple, and show a dot surface representation (d) Barnase in a cartoon format showing secondary structure, ␣-helices in red; ␤strands in yellow Note that in all cases the default atom or residue coloring schemes used are at the discretion of the author of the software C O L O R P L AT E S (a) (c) (b) (d) Figure 5.6 A comparison of three-dimensional structure data obtained by crystallography (left) and NMR methods (right), as seen in Cn3D (a) The crystal structure 1BRN (Buckle and Fersht, 1994) has two barnase molecules in the asymmetric unit, although these are not dimers in solution The image is rendered with an ␣-carbon backbone trace colored by secondary structure (green helices and yellow sheets), and the amino acid residues are shown with a wire-frame rendering, colored by residue type (b) The NMR structure 1BNR (Bycroft et al., 1991) showing barnase in solution Here, there are 20 different models in the ensemble of structures The coloring and rendering are exactly as the crystal structure to its left (c) The crystal structure 109D (Quintana et al., 1991) showing a complex between a minor-groove binding bis-benzimidazole drug and a DNA fragment Note the phosphate ion in the lower left corner (d) The NMR structure 107D showing four models of a complex between a different minor-groove binding compound (Duocarmycin A) and a different DNA fragment It appears that the three-dimensional superposition of these ensembles is incorrectly shifted along the axis of the DNA, an error in PDB’s processing of this particular file C O L O R P L AT E S (a) (b) Figure 5.7 An example of crystallographic correlated disorder encoded in PDB files This is chain C of the HIV protease structure 5HVP (Fitzgerald et al., 1990) This chain is in asymmetric binding site and can orient itself in two different directions Therefore, it has a single chemical graph, but each atom can be in one of two different locations (a) The correct bonding is shown with an MMDB-generated Kinemage file; magenta and red are the correlated disorder ensembles as originally recorded by the depositor, bonding calculated using standard-residue dictionary matching (b) Bonding of the same chain in RasMol, wherein the disorder ensemble information is ignored, and all coordinates are displayed and all possible bonds are bonded together C O L O R P L AT E S Figure 5.8 SwissPDB Viewer 3.51 with OpenGL, showing the calmodulin structure 2CLN The binding of the inhibitor TFP is shown in yellow The side panel allows great control over the rendering of the structure image, and menus provide a wealth of options and tools for structure superposition and modeling including mutagenesis and loop modeling, making it a complete structure modeling and analysis package window with the structures 1RGE and 1B2S Menu options show how Cn3D can highlight residues in the superposition (top right) and in the alignment (bottom right) The Cn3D drawing settings are shown in the top middle, where one can toggle structures on or off in the superposition residues, and the raw VAST score More values can be displayed in the list as well Cn3D is shown on the right, launched from the Web page Figure 5.4 Structures to superposition are selected with the check boxes on the left, and Cn3D is launched from the top of the Web page At the bottom left, controls that change the query are shown from the bottom of the VAST results page The results shown here are selected as examples from a nonredundant set based on a BLAST probability of 10Ϫ7, for the most concise display of hits that are not closely related to one another by sequence The list may be sorted by a number of parameters, including RMSD from the query structure, number of identical Figure 5.9 VAST structure neighbors of barnase On the left is the query window obtained by clicking on the Structure Neighbors link from C O L O R P L AT E S represent features of the DNA (e.g., arrows represent repetitive DNA, and vertical bars represent repeat sequences) Exon and gene models, protein translations, and the results of a genQuest search using the protein translation are shown the prediction, with the histogram representing the probability that a given stretch of DNA is an exon The various bars in the center Figure 10.2 XGRAIL output using the human BAC clone RG364P16 from 7q31 as the query The upper window shows the results of C O L O R P L AT E S C O L O R P L AT E S Figure 10.9 Annotated output from GeneMachine showing the results of multiple gene prediction program runs NCBI Sequin is used at the viewer The top of the output shows the results from various BLAST runs (BLASTN vs dbEST, BLASTN vs nr, and BLASTX vs SWISSPROT) Toward the bottom of the window are shown the results from the predictive methods (FGENES, GENSCAN, MZEF, and GRAIL 2) Annotations indicating the strength of the prediction are preserved and shown wherever possible within the viewer Putative regions of high interest would be areas where hits from the BLAST runs line up with exon predictions from the gene prediction programs C O L O R P L AT E S TARGET QRRQ RTHFTSQQLQ QLEATFQRNR YPDMSTREEI AVW TNLTEAR 11FJL KQRRS RTTFSASQLD ELERAFERTQ YPDIYTREEL AQRTNLTEAR 21FJL QRRS RTTFSASQLD ELERAFERTQ YPDIYTREEL AQRTNLTEAR 11B72 ARTFDWMKVL RTNFTTRQLT ELEKEFHFNK YLSRARRVEI AA TLELNETQ 22HDD KRP RTAFSSEQLA RLKREFNENR YLTERRRQQL SSELGLNEAQ 12HOA MRKRG RQTYTRYQTL ELEKEFHFNR YLTRRRRIEI AHALSLTERQ * * * * * * * * TARGET 11FJL hhhhhh hhhhhhhhh hhhhhhhh hhhhh hhhh 21FJL hhhhhh hhhhhhhhh hhhhhhhh hhhhh hhhh 11B72 hhhhhh hhhhhhhhh hhhhhhhh hhhhh hhhh 22HDD hhhhhh hhhhhhhhh hhhhhhhh hhhhh hhhh 12HOA hhhhhh hhhhhhhhh hhhhhhhh hhhh hhhh ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM 10 11 12 H1 GLN 9.226 107.177 13.966 1.00 99.00 H2 GLN 10.769 107.671 13.751 1.00 99.00 N GLN 9.824 107.785 13.444 1.00 25.00 H3 GLN 9.549 108.738 13.592 1.00 99.00 CA GLN 9.728 107.473 11.999 1.00 25.00 CB GLN 8.265 107.520 11.538 1.00 25.00 CG GLN 7.468 106.270 11.932 1.00 25.00 CD GLN 8.001 104.970 11.312 1.00 25.00 OE1 GLN 8.748 104.928 10.343 1.00 25.00 NE2 GLN 7.629 103.853 11.899 1.00 25.00 HE21GLN 7.979 103.008 11.502 1.00 99.00 HE22GLN 7.015 103.860 12.683 1.00 99.00 Figure 11.7 Molecular modeling using SWISS-MODEL The input sequence for the structure prediction is the homeodomain region of human PITX2 protein The output from SWISS-MODEL contains a text file containing a multiple sequence alignment, showing the alignment of the query against selected template structures from the Protein Data Bank (top) Also provided as part of the output is an atomic coordinate file for the target structure (center) In this example, the atomic coordinates of the target structure have been used to build a surface representation of the derived model using GRASP (lower left) and a ribbon representation of the derived model using RASMOL (lower right) C O L O R P L AT E S Figure 13.2 A screen dump from the program phrapview, showing a graphical display of the state of the data immediately after a run of the phrap assembly engine See text for details Figure 13.3 A screen dump of the gap4 Contig Selector, which gives an overview of the state of a sequencing project and provides a method for users to select contigs for processing See text for details C O L O R P L AT E S Figure 13.4 A screen dump of the gap4 Contig Comparator This transformed version of the Contig Selector is used to display the results of analytical methods that give information about the relationships between contigs For example, it can show sequence matches between contigs and the positions of read pairs that span contigs See text for details C O L O R P L AT E S Figure 13.5 A screen dump of the gap4 Template Display, which shows the positions of DNA templates and the extent of readings derived from them Color coding is used to distinguish between forward and reverse readings and to show consistent and inconsistent read pairs See text for details C O L O R P L AT E S Figure 13.6 A screen dump of the gap4 Consistency Display Here, it is being used to plot a histogram of the number of readings from each strand covering each position along a contig Below that it is showing the segments with no data from one strand or the other See text for details Figure 13.7 A screen dump of the gap4 Contig Editor and Trace display See text for details Figure 13.8 A screen dump of the gap4 Join Editor, which is used to align, edit, display traces, and join contigs See text for details C O L O R P L AT E S C O L O R P L AT E S AAAAA AAAAA AAAAA AAAAA AAAAA AAAAA AAAAA AAAAA Generate SAGE tags ConcatemerizeTags Sequence; count tags and Identify genes Expression Level a b c Gene d Figure 16.1 Serial Analysis of Gene Expression (SAGE) depends on the generation of a tag from the 3Ј end of an mRNA Tags are concatemerized and sequenced These data are compared with a database of tags linked to individual transcripts to generate the frequency of each tag in the library, a measure of the expression level for that gene Prepare cDNA Probe Test Print Microarray Reference cDNA Library Label with Fluorescent Dyes Combine cDNAs Hybridize probe to microarray Scan s de Sli y a arr cro Mi Figure 16.2 The process of microarray hybridization using printed DNA probes A robotic printer deposits DNA in a regular array on a series of glass slides After they are processed, the slides are hybridized to a mixture of two cDNA pools derived from test and reference samples that have been labeled with spectrally distinct fluorochromes After stringency washes, the microarray is scanned in a laser-scanning device, and the image is processed to generate numerical data C O L O R P L AT E S Figure 16.9 In ArrayDB, query for outliers returns an image of the microarray, with outlying genes highlighted in the image and listed below the image, along with intensity data and clone identifiers Figure 16.10 Display of microarray results retrieved from FileMaker Pro This example illustrates the results of a query for genes upregulated in a series of cancer cell lines with ratios coded by a red to green color map (Khan et al., 1998) ... Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland David Landsman, Computational Biology Branch, National Center for Biotechnology Information, National Library... Institutes of Health, Bethesda, Maryland James K Bonfield, Medical Research Council, Laboratory of Molecular Biology, Cambridge, United Kingdom Fiona S L Brinkman, Department of Microbiology and Immunology,... Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland and Department of Molecular Biology and Genetics, The Johns Hopkins School of Medicine, Baltimore, Maryland

Bioinformatics, a practical guide to the analysis of genes and proteins 2nd ed a baxevanis , b ouellette (wiley sons, 2001)

Thông tin tài liệu

Từ khóa liên quan

Mục lục

Front Cover

CONTENTS

BIOINFORMATICS AND THE INTERNET 1

THE NCBI DATA MODEL 19

THE GENBANK SEQUENCE DATABASE 45

SUBMITTING DNA SEQUENCES TO THE DATABASES 65

STRUCTURE DATABASES 83

GENOMIC MAPPING AND MAPPING DATABASES 111

INFORMATION RETRIEVAL FROM BIOLOGICAL DATABASES 155

SEQUENCE ALIGNMENT AND DATABASE SEARCHING 187

CREATION AND ANALYSIS OF PROTEIN MULTIPLE SEQUENCE ALIGNMENTS 215

PREDICTIVE METHODS USING DNA SEQUENCES 233

PREDICTIVE METHODS USING PROTEIN SEQUENCES 253

EXPRESSED SEQUENCE TAGS (ESTs) 283

SEQUENCE ASSEMBLY AND FINISHING METHODS 303

PHYLOGENETIC ANALYSIS 323

COMPARATIVE GENOME ANALYSIS 359

LARGE-SCALE GENOME ANALYSIS 393

USING PERL TO FACILITATE BIOLOGICAL ANALYSIS 413

FOREWORD

PREFACE

Tài liệu cùng người dùng

Tài liệu liên quan