Báo cáo khoa học: The optimization of protein secondary structure determination with infrared and circular dichroism spectra docx

Thông tin tài liệu

The optimization of protein secondary structure determination with infrared and circular dichroism spectra Keith A. Oberg, Jean-Marie Ruysschaert and Erik Goormaghtigh Center for Structural Biology and Bioinformatics, Laboratory for the Structure and Function of Biological Membranes, Free University of Brussels (ULB), Belgium We have used the circular dichroism and infrared spectra of a specially designed 50 protein database [Oberg, K.A., Ruysschaert, J.M. & Goormaghtigh, E. (2003) Protein Sci. 12, 2015–2031] in order to optimize the accuracy of spectroscopic protein secondary structure determination using multivariate statistical analysis methods. The results demonstrate t hat w hen the proteins are c arefully selected for the diversity in their structure, no smaller subse t of the database contains th e necessary information to describe the entire set. One conclusion of the paper is therefore that large protein databases, observing stringent selection criteria, are necessary for the prediction of unknown proteins. A second important conclusion is that only the comparison of analyses run on circular dichroism and i nfrared spectra indep endently is able to identify failed solutions in the absence of known structure. Interestingly, it was also found in the c ourse of this study that the amide II band has h igh information content and could be used alone for secondary structure p rediction in place of a mide I. Keywords: circular dichroism; FTIR; PLS; protein secondary structure. Multivariate statistical analysis methods have proved to be powerful t ools f or the analysis of component concentrations in chemical mixtu res. Because o f their effectiveness i n systems where there are strongly overlapping bands, t hese chemometric m ethods have also proved effective in the analysis o f protein spectra [1–12]. In contrast to most typical applications of statistical a nalysis methods, the r eported accuracy in protein s tudies, especially those that t reat infrared (IR) spectra, varies widely. It is often assumed that differences result primarily from the analytical m ethods applied, and so new analytical methods have appeared continuously over the l ast two decades ( see references above, and [5,6,13–18]). In addition to the reported margins of error, there are a number of discrepancies to be found in IR studies, including the optimal data regions [8,19,20] a nd spectral p reprocessing methods [20]. It h as not been resolved whether these differences arise primarily from the analysis methods or t he protein b asis sets that have been used. Thus there is a need for a systematic evaluation of the steps involved in protein secondary structure analysis where the d ependence of the results on the protein basis set is minimized. The key to the effectiveness of s tatistical methods is concentration-dependent changes in s pectra that are directly related t o the concentrations of the chemical species being determined. In simple chemical quantification systems, this is typically an increase in signal intensity at certain positions in a spectrum t hat depends linearly on the analyte concentration. In some cases, interactions between the components of a m ixture may result in additional bands o r changes in the s ignal. In these c ases the concentration dependence becomes more complex. However, statistical analysis algorithms can usually model t he se complexities with linear systems, and thus provide accurate analysis results. The situation encountered in the analysis of protein spectra is less straightforward. This comes from a certain amount of independence i n the variation of protein spectra on the structure content. Such behavior arises in part from the way secondary structure is assigned from crystal structures. Assignment algorithms necessarily involve the simplification of c rystal structure d ata in the form of combining residues with somewhat different conformations into a single structure assignment. There are se veral possible w ays to handle structure content-independent variations in protein spectra. The primary example of such an analysis is that given in [21] in which protein amide I bands were analyzed by fitting with a s eries o f Gaussian curves. T he suc cess reported in the original paper was spectacular: the rms errors for a-helix Correspondence to E. Goormaghtigh, Structural Biology and Bioin- formatics Center, Structure and Function of Biological Membranes Laboratory, CP 206/2, Free University of Brussels (ULB), Bld du triomphe, Acce ` s 2 B-1050 Brussels, Belgium. Fax: + 32 2650 53 82, Tel.: + 32 2650 53 86, E-mail: egoor@ulb.ac.be Abbreviations: FC, fractional composition (percentage) of a secondary structure type in a protein; FTIR, Fourier transform infrared spectroscopy; PCA-MR, principal component analysis with multiple- regression algorithm; PLS, partial least-squares algorithm; PLS-1, weighted partial least-squares algorithm; R , correlation coefficient for a regression; RaSP, rationally selected proteins; RMSE, root-mean squared error r cs , rms deviation of FC values from a set o f crystal structures; H, a-helix; E, b-sheet; T, turn; G, 3 10 helix; I, p-helix; S, bend; B, an isolate residue with extended u/ angles; C, residues that have no secondary structure assignment (irregular structure). (Received 1 6 March 2004, revised 10 May 2004, accepted 19 May 2004) Eur. J. Biochem. 271, 2937–2948 (2004) Ó FEBS 2004 doi:10.1111/j.1432-1033.2004.04220.x and b-sheet were in the order of ± 2 .5%. The curve-fitting method compensates for band position variation by assigning all component bands found in given regions of the spectrum to a particular structure. This method can be highly effective when applied by one experienced in its use; however, curve-fitting requires a series of subjective deci- sions that can dramatically affect both t he results and the interpretation [22–25]. F urthermore, curve-fitting ha s a tendency to overestimate the b-sheet content of primarily helical proteins, and routinely finds 15–20% b-sheet for proteins that actually have none [21,26–30]. Statistical analyses are generally accepted a s being the best way to analyze protein CD spectra, but curve-fitting is still widely used for determining protein structure from infrared spectra. From reviewing the literature and consid- ering just the results f or a-helix determinations with IR data, it can b e seen that the reported a ccuracy (rms determination error) ranges from 3.9 to 10.3%, but the 3.9% value was obtained from a 17 protein s et, and two sets with nine a nd 21 proteins obtained rms errors of around 10%. The different algorithms used in these studies may indeed be responsible for the differences, but as statistical analyses depend both on the algorithm and reference spectra used, the source of the discrepancies cannot be unambiguously identified. It remains possible that these differences reflect the i nternal consistency of the spectra in each respective basis set rather than the expected general accuracy of the methods. In a recent paper, Sreerama & Woody [6] investigated the effect of the number of reference proteins (29–48) on the a ccuracy of the prediction obtained by three publicly available CD analysis software programs. In this paper we propose to extend this analysis to IR and combined CD/IR spectra. We took great care to use a p rotein database that presents the largest possible structure variety. We constructed a protein database that covers, as far as possible, the a/b space, the fold space as described b y CATH (class, architecture, topology and homology classification of proteins [31]) as well as other structural features such as helix length, and the number of chains in a sheet. We identified 50 commercially available proteins that can b e obtained with s ufficient purity and for which we assessed the quality of the crystal-derived structure. We call this set of 50 rationally selected reference proteins RaSP50 and details of this database have been published recently [32]. We report in this study the application of the RaSP50 set to spectroscopic protein structure determination. We have attempted here to establish an optimal approach to using existing methods. This was achieved by focusing on the input for statistical analysis algorithms, s uch as data types, spectral preprocessing, and secondary stru cture assignments. Because IR and CD are the most widely used spectroscopic tools for the determination of protein secondary structure, both were applied in this study. They were tested alone and in combination in an effort to evaluate their respective strengths, weaknesses and complementarities. It was found that the quality of the reference protein database rather than the algorithms used determines the efficiency of the secondary structure prediction. Clear complementarities between IR and CD spectra allow a further enhancement of the secondary structure accuracy. Experimental procedures Input data for analysis The s et of reference proteins used for th is study is an ÔoptimalÕ basis set, and has been described e lsewhere [32]. It represents a wide range of helix and sheet FC values as well as 60 different protein domain folds. The fin al set of 50 proteins fully spans several different Ôconformational spacesÕ, and has distributions of structures that reflect the natural abundances found in the PDB. The spectra of the 50 proteins are available on request from the authors. Protein secondary structure tabulation from DSSP output The secondary structures of the RaSP proteins were determined with the DSSP program [33]. There are eight assignments made by DSSP . Six are familiar to protein chemists: a-helix (denoted by H), 3 10 -helix (G), p-helix (I), b-sheet (E), turn (T), and unassigned structure (indicated b y a blank space in the DSSP program output, but which we denote with C). Unassigned structure has been referred to by many names, such as irregular, other, disordered or coil. The fractional composition of their secondary structures (FCs) were tabulated from the DSSP output. The a-helix assignment can be tabulated as a simple count of the residues assigned H by the DSSP program, or it can be divided into ÔorderedÕ (denoted by O) and ÔdisorderedÕ (do) helix by giving the disordered helix assignment to the two residues at each helix end (ends) [10] or to helical residues with less than one or two hydrogen bonds within the helix (denoted by < 1 or < 2), and giving all other a-helical residues the ordered assignment. W e also separated parallel and antiparallel b-sheet. Spectroscopic data collection and processing All protein preparations were desalted by dialysis or size- exclusion chromatography. CD spectra were collected on a JASCO J-710 CD spectrometer using filtered protein solutions in 2 m M Hepes pH 7.2 with an absorbance of % 0.5– 0.8 at 192 nm (% 0.1 mgÆmL À1 ) in a 0.1 cm cell. Each CD spectrum was the accumulation of eight scans at 50 nmÆ min À1 with a 1 nm slit width and a time c onstant of 0.5 s for a nominal resolution of 1.7 nm. Data was collected from 185 to 260 nm. CD spectra were background corrected and scaled to mean residue ellipticity based on the absorbance at 205 n m. The extinction c oefficient u sed was e 205 ¼ 5167 per peptide bond; this was determined using a combination of data from Scopes [34] and Hennessey and Johnson [35]. Infrared measurements were made on a dry-air purged Bruker IFS-55 FTIR spectrometer with an MCT detector. Data were collected at 2 cm À1 resolution; 512 scans were accumulated for each spectrum. Transmission IR spectra were collected from % 3% (w/v) solutions sandwiched between CaF 2 windows with a 5 lm Teflon spacer in a demountable cell. The protein signal was extracted from IR spectra by subtracting a buffer spectrum with a scaling factor determined by the method of Powell et al. [36]. The contribution of water vapor from infrared spectra was subtracted using a s caling factor determined from t he integrated absorbance of the 1717 or 1772 cm À1 bands. 2938 K. A. Oberg et al.(Eur. J. Biochem. 271) Ó FEBS 2004 Preprocessing of spectra for analysis When indicated, the contributions from amino acid s ide chains were subtracted for some analyses using data from Venyaminov [37]. For these subtractions, a synthetic side chain spectrum was generated using the amino acid composition of the protein listed in the SWISS-PROT database [38,39]. The side chain spectrum was then subtracted from the protein spectrum u sing a scaling factor determined by Fourier self-deconvolving the spectra, determining the tyrosine band a rea ratios at 1518 cm À1 ,andthen using this as the scaling factor for subtracting the original spectra [40]. Due to low tyrosine s ignal, the proteins BTE, FTN, MTH, PAB, SOD and T RO could not be processed in this manner, so the relative extinction coefficients were used to estimate the subtraction scaling factor [40]. Spectral scaling, baseline corrections and normalizations were carried out by custom routines added t o the PLSPLUS analysis software (discussed below). In the following discussion, intensity or point normalization refers to the scaling o f spectral regions to a constant maximum intensity of 1, area normalization refers to adjusting the intensity so that all spectra had the same integrated area in a chosen region (0.1 absorbance unitsÆcm À1 ). Before normalization, IR spectral regions were either baseline-corrected to bring both endpoints to zero, or if baseline-correction was not used, the first value in the band was subtracted from all other data points in order to bring the minimum to zero. Combination of CD and IR spectra To analyze combined C D and IR data, h ybrid spectra were made by placing CD and IR data in a single array. In these spectra, one unit on the x axis corresponds to one of the native units for each data type (nm for CD and cm À1 for IR), and the data point spacing is 1 per x unit. It w as necessary to scale the CD spectra to be consistent with the intensities of t he IR spe ctra (each spectrum was multiplied by 0.0015), but the exact type of unit in each region is less important than the fact that both d ata types were of similar intensity in the hybrid spectrum. This ensured that they had similar contributions in the m odel buildin g process. The limits of the data regions used were 1720–1600 cm À1 for the IR amide I band, 1590–1500 cm À1 for the amide II band, and 185–260 nm for far-UV CD data. Analysis methods The bulk of the analyses performed in this report w ere m ade with PLSPLUS version 2.1 (Galactic Industries, Salem, NH). PLSPLUS is integrated into GRAMS (Galactic Industries Corporation, Woburn, MA, USA), which uses the Array Basic TM programming language. It was therefore possible to create all other necessary software in a s ingle environment withacommondataformat. Although the PLS-1 algorithm used here extracts facto rs in order of t heir relevance to the structure being quantified, the problem o f selecting the total number of factors to use for each structure type remains. It was found that the first two-to-five fa ctors for a-helix or b-sheet typically accounted for 95% o f the variation in the R aSP50 s pectra. For turn and other structu res, up to 10 f actors were required to reach the 95% level, but the fi rst six usually accounted for 90% or more of the total variation. The factors themselves typically began to s how significant contributions from noise at the 6th or 7th factor. For automatic selection of the optimal number of factors for each model, the maximum was set at 10. However, if there was a smaller set of factors that provided similar accuracy (< 1.05 times the minimum rms), the algorithm would select this as the optimum. For each analysis, cross validation was performed and the rms deviation b etween the experimental and calculated set was determined for all possible numbers of factors. Unless otherwise indicated, all analysis results reported here are from c ross validations of the full RaSP50 s et. Cross validation is performed by removing each spectrum, in turn, from the reference set. The remaining spectra (49 in this case) are then used to ge nerate a statistical model, w hich is used to determine t he structure of the eliminated protein. Finally the calculated FC for each p rotein is compared with its actual structure, and the determination error is evaluated and stored. After the c ross validation i s complete, the r ms error o f all a nalyses a re determined with Ôn À 1Õ degrees of freedom in PLS-1, and Ôn À (s À 1)Õ degrees of freedom for simultaneous methods, where n is the number of s pectra and s is the number of structures b eing determined simultaneously. Results and Discussion Methods for evaluating analysis performance Cross validation, as explained i n the M aterials and methods, treats each protein a s a n unknown and evaluates its structure using the remaining proteins as a training set. For t he RaSP50 proteins, great care was taken to eliminate protein p reparations with impurities and to use only high- quality s pectra. Because of this i t can be assumed t hat a high cross validation error for any o f the RaSP50 proteins ar ises not from a problem with the sample, but from an inconsistency between its secondary structure assignment and the actual variations of structure that give rise to the protein spectrum. Because every sample in the basis set is analyzed as an unknown during cross validation, this procedure provides data that can be used t o estimate the expected accurac y for analyses of true unknowns by calculating the cross validation errors (determined FC – actual FC) for all basis samples and then taking the r msd o f t hese errors is obtained. The rms is potentially a good estimate of the overall performance of an analysis method because it is a summary of many unknown analyses (50 in this case). The rms represents the error bounds for % 2/3 of the training samples in cross validation. Thus, it can be expected that there is a 67% chance that FC determination results for an unknown protein will be within ± rms of the ÔtrueÕ value. The rms, if presented alo ne, is also uninformative. Assume that a hypo thetical cross validation of the RaSP50 basis set was performed and an rms for FC T of ± 5.0% T was reported. It would be natural to conclude that the turn determination accuracy for this method was quite good, but this is not the case. If we look at the crystal structures of the RaSP50 proteins and examine the distribution of t heir actual FC TS , we find a mean of Ó FEBS 2004 FTIR-CD spectroscopy of proteins (Eur. J. Biochem. 271) 2939 12.5% T and a standard deviation (r CS )of4.3%T.An inelegant way of estimating the turn content for an unknown could be just to guess that its FC T was 12.5% (the mean for the RaSP50 set). In fact, this value would probably be a more accurate estimate (± 4.3% T) than that provided by the hypothetical statistical analysis (± 5.0% T). This problem has been recognized previously [41]. To evaluate the information content present in the spectra, the rms f or each analysis will be reported together a Ôdetermination enhancemen tÕ score, f, which we define as r CS divided by rms. These scores compare the relative widths of the distributions for the protein crystal structure FCs (r CS ) and the cross validation errors. They can also be used to compare the analysis accuracy for different structures because they are corrected for the actual distribution of each structure type in the RaSP50 protein set. A f score o f one indicates that these distributions have the same width, i.e. t hat t he analysis of the spectrum is just as accurate as guess for the average value. Preprocessing of protein spectra for statistical analysis CD spectra of proteins were scaled based on the protein (amino acid residue) concentration of the sample used to collect a spectrum a s e xplained in the E xperimental procedures. For IR spectra the possible p rocessing steps include normalization, baseline correction, the subtraction of side chain contributions and artificial band narrowing (using Fourier self-deconvolution or differentiation). There is consensus in t he literature that b and narrowing does not improve statistical analysis results [20] therefore such procedures were not re-evaluated here. The subtraction of side-chain spectra before analysis has b een used in only one study [10]. Normalization of I R spectra is required before analysis. Typical normalization methods that have been used for IR spectra include scaling to t he same maximum intensity [19,20] or area [8,19,20]. In essence, these normalizations are intended to increase the correlation of IR band shape with protein secondary structure, and thus improve analysis accuracy. The effect of spectral normalization on analysis results. For the infrared spectra (Fig. 1), various normalization procedures were followed: normalization o f the band maximum to a constant intensity; normalization to a constant area; normalization of the combined amide I and II bands, with separate analyses of each band afterwards; and separate normalization of the amide I and II bands. All these normalizations were tested with and without baseline correction. For the CD spectra (Fig. 1), the mean residue ellipticity was used. The errors of cross validation (rms) of each spectroscopic Ôdata regionÕ used in this study (IR amide I, IR amide II, and CD) were obtained (not shown). It was immediately apparent that most of the normalizations do not radically change the analysis accuracy. Subtracting a baseline and normalizing separately on amide I and II was close to the best solution for every structure. Side chain signal subtraction. IR and CD protein spectra can contain significant contributions from amino acid side chain bands. In CD spectra, these contributions arise from aromatic groups or disulfides and are dictated by the local environment of each side chain [42,43]. It is therefore impossible to determine the exact nature of their contribution to a given protein spectrum a priori. For IR spectra, side chain contributions are consistent enough for the generation of synthetic spectra based on data from m odel compound stu dies [ 23,37,44,45]. There is much debate over the usefulness of subtracting side chain spectra, as t hey are typically broad, relatively featureless a nd may b e affected t o a small extent by the local environment of each residue. In fact, simple baseline correction often has an effect similar to subtraction [23]. Our data (not shown) indicate that side chain subtractions only moderately improved the rms values for some, but not all, secondary structures. It was not used further in this study. The sensitivity of different spectroscopic methods to protein secondary structure The relative sensitivities of different spectroscopic methods to protein secondary structure. It is usually accepted that CD measurements provide more accurate estimations of protein a-helix content whereas IR is thought to be more sensitive to b-sheets. The data in Table 1 support this, and provide some quantitative information about the extent to which it is true. In comparing the optimal rms values for the IR amide I and CD data types it was found that determinations of the a-helix content are (relatively) 18% better for CD than IR, and IR i s (relatively) 30% more accurate than CD in b-sheet determination. There is also a difference in sensitivity to the other structural classifications listed in the table. CD determination of the C + B + G + S assignment proved to be about (relatively) 10% more accurate than IR, but the I R rms for FC T analysis is lowe r than the rms for CD. Also noteworthy are the results from the amide II band normalization r esults. It has long been recognized that the amide II b and is conformationally sensitive. However, the dependence of amide II band shape o n s econdary structure is complex, so it has not been considered systematically for qualitative analyses. The results in Table 1 indicate that the PLS-1 a lgorithm was able to extract information from the isolated amide II band. In fact, amide II cross validation results for a-helix an d turn were more accurate than analysis using o nly the amide I region. This finding suggests, for the first time, that the amide II band could be used alone fo r protein FC H determination. Combining data regions for analysis. While the spectroscopic signals in the amide I and II regions arise from m(C ¼ O), d(N À H) and m(C À N), the CD data arise from electronic transitions. It is t herefore probable that they contain different independent structural information. Consequently, if they are combined into a single hybrid spectrum (Fig. 1), an analysis a lgorithm should be able to extract complementary information from each region and thus provide more accurate results. Such a combination has been tested before [8,19,20], including for vibrational and electronic CD spectra [41] or vibrational CD and IR [1] but the conclusions reached i n these studies are contradictory. This is presumably because different basis sets a nd/or different m athematical methods were used. To resolve this conflict, cross validations with different data region 2940 K. A. Oberg et al.(Eur. J. Biochem. 271) Ó FEBS 2004 combinations were performed with the RaSP50 protein spectra. The results presented in Table 1 are given with the optimal normalization strategy for each structure type. It was found that for the IR data, combining the amide I and II bands substantially improved determinations of the a-helix and C + B + G + S structure assignments. Combined IR and CD data was more accurate than either of these methods alone. The relative improvements in determination accuracy using all three data regions compared to IR amide I band alone (Table 1) are 48% for a-he lix, 5% for b-sheet (39% compared to CD alone), 12% for turn and 9% for the sum of the remaining assignment. Fig. 1. Concatenated CD (186–260 nm) a nd IR (1720–1500 cm À1 ) spe ctra o f the 50 pro teins described in the Exp erimental procedures. Spectra have been rescaled and offset along the y-axis for a better readability. Proteins are sorted according to th eir a-helix content. Ó FEBS 2004 FTIR-CD spectroscopy of proteins (Eur. J. Biochem. 271) 2941 The correspondence between secondary structure definitions and spectral features Now that optimum spectral processing m ethods have been established, the other major input for statistical analysis, structure assignments, can be explored. Bond angles , H-bonds, tertiary structure, resonance, exciton coupling and side chain signals are all factors that contribute to protein spectral band shapes. Reducing this natural com- plexity to a small s et of secondary structure assignments involves a simplification that obviously cannot accurately describe all aspects of IR or CD spectra. The DSSP program [33] is currently the most widely used method for assigning secondary structure types to individual r esidues. DSSP makes e ight structure assignments, each identified b y a single letter code, including three t ypes of helix (a-helix, H ; 3 10 -helix, G ; p-helix, I), b-shee t (E) , t urns ( T), a nd wha t we will re fer t o a s irregular structures (C). While the first four structures are periodic, the DSSP program assigns two aperiodic structures in addition to T and C; these are typically found within stretches of C. They are isolated–extended (B), a residue with u/ angles in the b-sheet range that does not participate in a b-sheet, and bend (S), a sharp turn in the protein chain that does not meet all the criteria for a T assignment. From the descriptions of these a ssign- ments, the question naturally arises: where would the signals from these structures be expected to appear? For example, should t he B s tructure give rise to a band characteristic of b-sheet, irregular structure, or will it have a unique signal? Similar questions apply t o o ther assignments as well. Optimal determination accuracy can only be a chieve d b y placing such residues i n an appropriate assignment group. A few structure assignment combinations have appeared in the literature [9,35]. To tackle these questions in a systematic manner, the performance of P LS-1 analysis for various DSSP assignment combinations and subclassifications was evaluated. Note that, due to their natural abundances, the FC variation (r CS ) of some structure types (G, S, B and I) in the RaSP50 set is s mall. Accordingly, statistical a nalysis cannot be accurate for t hese structures unless t heir FC/signal correlation is strong. Most proteins in the RaSP50 set h ave 12–13% turn, and the 3 10 -helix content is below 10% for all proteins except lysozyme and a-lactalbumin, with the majority having 3–6%. The p-helix (I) was essentially nonexistent in the RaSP50 proteins, and was therefore not considered a quantifiable structure. Individual secondary structure assignments and their combinations. First, let us consider the performance of the individual assignments made by DSSP .Thef scores given in Table 2 indicate how much information the analysis algorithm c ould extract from the r eference spectra. This analysis was performed for all the structure types, alone or combined. A sum mary o f t he results appears i n Table 2. The results sugg est that t he DSSP program overclassifies some secondary structures, at least as far as IR and CD spectra are concerned. That is, there are several structure types, which apparently do not give rise to unique, detectable spectral characteristics (for example, f % 1 for the G and B assignments). Note however, t hat the F C distributions (r cs )forthese structures are also narrow in the RaSP50 set. The only individual secondary structure assignments made by DSSP that are determined by the PLS algorithm with f ‡ % 2 are a-helix (H) and b-sheet (E), none of the f scores for other individual assigned structures are h igher than 1 .25 (not shown). Residues with different secondary structu re assignments can also be combined into a single assignment class. Such grouping has a possible advantage in secondary structure determination. Grouping residues with different assignments t hat may have similar spectroscopic signals may also increase the sensitivity of analysis. The results show that some f scores could be improved by combining structu res, such as C + S + B (compared to each of these structures alone), but none of the combinations tested had f scores comparable to f H or f E . In addition, combining other structures with a-helix (H + G or H + C) or b-sheet (E + B) did not improve the f scores compared to H and E determinations alone. The structu re combination that was found to have the strongest FC/signal correspondence is C + T + G + S + B. It can be hypothesized that grouping all the structures but the a-helix and b-sheet must yield a prediction correlated with the a-helix and b-sheet prediction, as the value for C + T + G + S + B is simply 100 À (H À E). Table 1. Optimal rms values and preprocessing techniques f or IR and CD spectroscopic data-region combinations. These rms values are from the optimal normalization for each secondary structure type. Data a-Helix (H) b-Sheet (E) Turn (T) Others (C+B+G+S) RMS f a RMS f RMS f RMS f IR Amide I 8.97 2.46 6.91 2.60 4.14 1.05 9.66 1.25 IR Amide II 7.53 2.93 10.83 1.66 3.91 1.11 10.43 1.15 CD 7.61 2.90 9.12 1.97 4.46 0.97 8.95 1.34 Amide I + II 6.67 3.31 7.09 2.53 3.88 1.12 9.09 1.32 Amide I + CD 6.30 3.51 6.58 2.73 4.05 1.07 8.84 1.36 Amide II + CD 6.97 3.17 8.72 2.06 3.72 1.17 9.06 1.33 Amide I + II + CD 6.06 3.65 6.58 2.73 3.71 1.17 8.85 1.36 a The information content score (f ¼ r CS /rms) described in the text. 2942 K. A. Oberg et al.(Eur. J. Biochem. 271) Ó FEBS 2004 Subclassification of helix and sheet assignments. Several authors h ave attempted to improve the accuracy of structure determination by dividing helix into ordered, H(O), or disordered, H(do), subclassifications or by dis- criminating between antiparallel, E(AP), and parallel, E(par), b-sheet [6,8,19]. This is motivated by the idea that the differences between the geometries, H-bonding, etc. of these structures should be sufficient to produce differing band shapes that could be modeled by statistical analysis algorithms. Typically, this practice results in lower rms values for the segregated structures (Table 2). A reduction of rms values for segregated structures was also observed in this study, but subdividing a-helix and b-sheet classifications also reduces the corresponding r CS values. The consequence is usually lower f scores for both s ubstructures. Therefore segregation actually degrades analysis performance in general. The o nly e xceptions were found to be the removal of kinks from a-h elices, H(O, < 1), which increased the determination accuracy for one of the subassignments (f H,IR ¼ 3.29 fi f H(O <1),IR ¼ 3.49, f H,CD ¼ 2.90 fi zgr; H(O <1),CD 2.98), and b-sheet segregation which improved the accuracy of IR determinations for antiparallel b-sheet (f E,IR ¼ 2.44 fi f E(AP ),IR % 2.8). The f scores for the counterparts o f these structures, f H(do < 1) and f E(par) , were close to one. As for t he remaining DSSP structures, a f score larger than 1.4 is obtained only w hen C and T are combined for analysis (C + T + G + S + B, and other c ombinations that include C + T). If more detailed structural information is desired, grouping sharp turns in the amide backbone (T + S or T + G + S) ma y provide some information if FC T+S or FC T+G+S are unusually high in the protein. Similarly,theC+B+S+G or the C+B (for IR) structure combinations also show moderate correlation to protein spectra band shapes (f % 1.3) . The complementarities of information in CD and IR spectra. By comparing t he f scores in the a mide I + II and CD columns of Table 2, it can be seen that their information contents are comparable for all structure types. However, the change in the f scores between cross validations from separate CD or IR spectra and the hybrid spectra (IR + CD) reveal that there can be c omplementa- ritiesbetweentheinformationcontainedineachdata region. This p oint is investigated further in t he next section of the paper. Estimating accuracy in the secondary structure determination of an unknown The rms, f scores and correlation coefficients presented in the tables above are all summaries of the overall performance characteristics of protein st ructure statistical analyses. While these values allow different methods to be compared, the question asked during the analysis of an unknown is typically not ÔHow accurate is the method in general?Õ, but rather ÔHowaccuratewastheanalysisforthisprotein?Õ.In theory, there are several quantities that can be derived f rom a statistical analysis that can assist in answering this question. An accuracy evaluation procedure would be most useful if it could define an expected margin of error for each unknown analysis. Presumably, it should be possible to derive additional information from the w ay that a statistical model reconstructs the spectrum of the unknown protein . Quality of the fit The match between the original a nd re constructed spectra can be evaluated by taking the difference between the two. The sum of the absolute values of the differences between the original and fit spectra at each point is used here to Table 2. Comparison of determination accuracies for secondary structure assignment combinations and segregations. The secondary structures determined by the DSSP program a nd combined or subdivided. H , a-helix; E , b-sheet; T , b-turns; C, unassigned structures; G, 3 10 -helix;B,isolated residues with b-sheet // angles; S, bend – residues around which the polypeptide chain makes a sh arp turn but do not meet all criteria for the turn assignment. Subdivisions of the se co ndary structure are designated as follows: H, a-helix; O , ordered a-helix; do, disordered a-helix. These were determined by giving all r esidues in an a-helix the o rdered, H(O), assignmentandthenreassigningresiduestothe disordered structure, H(do) < 1, reassigning residues with no backbone H-bo nds within the helix. b-Sheet subclasses are: AP, antiparallel; par, parallel; long, b-strands more than four residues long; short b-strands less than five residues long. V alues summarizin g the characteristics of the c rystal structures ch osen for th e full RaSP50 set. See [32] f or the PDB identification c odes. Mean, arithmetic mean o f all RaS P50 FCs for each listed structure; r CS , s tandard deviation of the FCs; v alues in the Range column indicate the full dynamic range of FC values found in the RaS P50 crystal structure secondary structure assignments (maximum FC minus minimum); rms, the best rm s obtained for this structure type. This value was the optimal rms for full-band normalization strategies. f, The information content score (f ¼ r CS /RMS) described in th e te xt. The h ighest (best) f scores for each group of structure types are in bold. Structure RaSP50 crystal structures CD Amide I + I I+ CD Amide I + II rCS Range Mean Fctrs SECV f Fctrs SECV f Fctrs SECV f H 22.09 74.56 27.5 2 7.61 2.9 1 6.06 3.65 4 6.72 3.29 H(O) < 1 21.82 74.06 26.48 2 7.33 2.98 1 5.98 3.65 4 6.25 3.49 H(do) < 1 0.97 3.57 1.02 2 0.96 1.01 4 0.91 1.06 3 0.87 1.11 E 17.97 61.58 22.35 4 9.22 1.95 4 6.68 2.69 1 7.25 2.48 E(AP) (global) 17.59 56.58 19.1 5 10.53 1.67 3 6.95 2.53 5 6.34 2.78 E(par) (global) 4.58 18.88 3.26 5 4.52 1.01 7 4.3 1.07 1 4.68 0.98 E(long) 15.32 53.68 14.49 4 10.22 1.5 5 8.43 1.82 4 8.36 1.83 E(short) 6.42 22.95 7.86 1 5.45 1.18 3 5.11 1.26 4 4.89 1.31 C+T+G+S+B 13.2 74.56 50.13 5 8.94 1.48 5 8.21 1.61 3 9.43 1.4 Ó FEBS 2004 FTIR-CD spectroscopy of proteins (Eur. J. Biochem. 271) 2943 characterize the residuals of the fit. By comparing the residuals of an unknown with t he residuals of the reference spectra obtained during cross validation, an estimation of reliability could be made. A small residual for a given analysis is usually considered as a good indication of an accurate, reliable analysis. Such statistical properties of chemometric analyses are meaningful in systems where spectra vary in a purely concentration-dependent manner, but because protein spectra do not necessarily follow this rule, the quality of t he fit may not be related to the accuracy of the analysis. In fact, the fit residuals provide what is perhaps t he most convincing indication that statistical analysis of protein spectra is fundamentally different than simpler quantitation systems. The situation is illustrated in Fig. 2A, in which the error for FC H determination in cross validation of t he RaSP50 set is plotted against the spectral residual for the fit t o each protein. Similar p lots were obtained for all other structure assignments (not shown). The correlation coefficient (a linear regression R 2 )forthe data plotted h ere is 0.024, indicating that the a bility of the algorithm to reconstruct the protein s pectrum has essentially no relationship to the accuracy of the analysis. The Mahalanobis distance of the factors Another accuracy validation criterion, based on values derived in the model-construction step, has appeared in protein structure analysis method reports. When using the Mahalanobis distance as a reliability evaluation criterion, the set of factor scores for each spectrum is treated as a vector in a coordinate system defined by the factors in a statisticalmodel(f-space).That is, each axis in f-space is one of the factors, and the coordinates of a spectrum i n f-space are its factor scores. Typically the score vectors for the reference spectra form an ellipsoid in f-space. The Maha- lanobis distance i s a measure of how far from the center of this ellipsoid the score vector for a given spectrum lies. If the score vector for an unknown falls significantly outside the ellipsoid formed by the reference spectra score vectors, the scores for the unknown therefore follow a different pattern than those of the basis set. It c an be suggested that a large Mahalanobis d istance for an unknown indicates that the statistical analysis algorithm was unable to properly evaluate the structure of the unknown protein. The Mahalanobis distan ces vs structure determination error for b-sheet for the RaSP50 set cross validation is shown in Fig. 2B. It is clear that the Mahalanobis distance is also not a useful validation method for protein spectra, at least w ithin the proteins of the RaSP50 b asis set. The finding that the Mahalanobis distance does not correlate with analysis accuracy for the RaSP50 set has important consequences for structure determination in that it shows that a novel pattern of factors needed to reconstruct the spectrum of an unknown protein is not a reliable indication Fig. 2. Test of potential p redic tors for the structure prediction accuracy. (A) Relationship betwe en th e FC H,det error (FC determination error for a-helix) and the spectral r esidual from reconstructing unknowns with the factors from a PLS-1 model. The residual is characterized by the sum of the absolute values of the d ifference between the a ctual and reconstructed spectra at each point. These d ata were obtained from a cross v alidation of hybrid RaSP50 IR + CD spectra. Spectral preprocessing parameters were optimal. (B) Mahalanobis distanc es for th e factor s cores (significant f actors only) in the FC d etermination o f FC E . These data were obtained from a cross validation of hybrid RaSP50 IR + CD spectra. Spectral preprocessing parameters were optimal for FC E determination. C. Comparison of the sum of all determined structures for an unknown (residual for SFC det )withFCdetermin- ation error. d,FC H,det errors of individual proteins; compared with s, FC E,det . These data were obtained from a cross validation of hybrid RaSP50 IR + CD spectra. 2944 K. A. Oberg et al.(Eur. J. Biochem. 271) Ó FEBS 2004 of a failed analysis. Conversely, the results are not necessarily reliab le f or unknowns whose s core vectors fall within the same region of f-space as the reference-spectra score vectors. Do the structure fractional contents total 100%? Another potential measure o f s tatistical analysis error for an unknown is the results themselves. Because the FC values should account fo r all the residues i n a protein, they should total 100%. If the total is not close to 100%, then it is reasonable to question the analysis results. The variable selection method [11], as well as others [35,41] use this as a criterion for evaluating the qu ality of analysis results. In particular, this was found to be very useful to build the SelCon method [5]. As for the residuals and t he scores, the determination errors for individual proteins are compared with this accuracy measure in Fig. 2C. Again, there is no apparent relationship between these quantities. Therefore the sum of FC det values cannot be used to diagnose analysis accuracy or failure. This finding is important: it indicates that the determination accuracy for each secondary structure type is independent of the other structures that have been analyzed. Therefore, it is not appropriate to disregard the analysis results in their entirety if a single determined FC is questionable. IR/CD comparison We propose t hat a more reliable method of evaluating analysis results is the consistency between analysis results from different s tructure-sensitive techn iques, s uch a s IR and CD. Because infrared and CD spectra depend on different phenomena, particular structural distortions are likely to have a different effect on each of these spectral types, and that these differences can be used to evaluate analysis accuracy. In the simplest case, it is necessarily true that when the FCs obtained from separate analyses of IR and CD spectra of a protein are very different, then at least one of the determined structures m ust be Ôincorrect.Õ For conveni- ence, we will refer to this difference, specifically FC IR À FC CD ,asDIRCD det . To illustrate the type of information that is available from DIRCD det ,theFC H determination e rrors for individual proteins from cross validation of IR-only and CD-only data is plotted against DIRCD det in Fig. 3. An intuitive relationship is revealed by this figure: when the FC H determined with CD alone is lower than the FC H from IR alone (DIRCD det is positive), then the CD analysis result tends to strongly underestimate the actual FC H . A similar relationship holds when the FC H from IR is the lower value. The linear regression correlation coefficients (R 2 ) for the data plotted in Fig. 2A,B are 0.345 and 0.434, respectively, which indicate a definite relationship. It appears that DIRCD det is the only quantity examined in this study which has any significant correlation with analysis error. In an attempt to evaluate the potentiality of the test provided by the DIRCD det measure, the DIRCD det value for each protein w as used to divide the RaSP50 members into two subsets with different analysis characteristics. The first subset was defined a s the proteins with |DIRCD det | (absolute value of DIRCD det ) smaller than 6%. The rms FC H determination errors calculated for this subset of proteins were rmsE H,IR ¼ ± 4.82% H and rmsE H,CD ¼ ± 4.46% H. Combining these results with the a-helix r CS,subset for these 27 proteins in the f score equation gives f subset values of 4.24 and 4.58, respectively (compare with data in Table 1). If the hybrid IR + CD spectra analysis results for these same proteins is considered, the rms error is ± 4.46% H. These results show that the margin of error for FC H determination is reduced when the results from separate IR and CD analyses a re similar (|DIRCD det |<6%). However, this accounts for just over half of the proteins in the RaSP50 set. If we consider the remaining proteins, overall the larger determined FC H was more accurate for 74% of the proteins in this second RaSP50 subset. For b-sheet, a similar observation was made for the DIRCD det > 6% E proteins, but IR was also more accurate for eight out of 11 proteins. Therefore, the more accurate result is likely to come from IR analysis. In conclusion, DIRCD det canbeusedtoidentifyproteins with anomalous spectra (|DIRCD det | > 6%), and therefore assist in the identification of failed analyses. Fig. 3. Comparison of FC det error and th e d ifference between s eparate CD and IR analysis results for individual proteins (DIRCD det ). The abscissa rep resents the difference be tween separate a nalyses of C D and IR spectra, DIRCD det (higher IR values are to the right) from cross validation. The ordinates represent the FC IR error (FC IR –FC CS ,top panel) o r FC CD errors (bot tom panel) obtained in cross validatio n. Individual proteins are identified by th eir RaSP codes. Ó FEBS 2004 FTIR-CD spectroscopy of proteins (Eur. J. Biochem. 271) 2945 Comparison of different statistical analysis algorithms Thus far, the discussion has focused on the optimization of input data for protein structure analysis methods. We will now briefly address the role that the algorithms th emselves play in analysis accuracy. Recently, Sreerama and Woody [6] demonstrated on a large set of CD spectra that the algorithm used (CONTIN, SELCON or CDSSTR) has littleeffectontherms.WehaveusedtheRaSP50setto compare different methods on the IR, CD and combined IR/CD on the broad range of structures represented in RaSP50. It was found (Table 3) that the choice of analysis method has only a small effect on analysis accuracy. By examining selected literature data, it can be observed that there is a relationship between the number of protein folds represented in a basis set and the rms. Contrary to what would be expected, the general t rend is for the error to increase with the number of proteins used (e.g [41]). For the CD analyses the relationships for FC H and FC E are well represented by straight lines (not shown). Combining this observation with the frequency of anomalous spectra just described suggests that those authors who have introduced more spectra into their reference sets have increased the number of proteins with anomalous spectra. Through this, they have degraded the quality of the spectra–structure relationship i n their statistical models. We suggest that for small protein basis sets we obtain primarily a measure of their internal consistency (lack of anomalous spectra) rather than their expected performance in g eneral. In o rder to test this hypothesis, a RaSP50 subset was a ssembled using 16 of the most c ommon proteins in the IR studies with attention given to maintaining a broad FC distribution. This set, RaSP16, w as tested both in cross validation and on the spectra of the full RaSP50 set. We found that the rms values for the RaSP50 s et are generally lower t han for the R aSP16 set. However, when the RaSP16 statistical model was used to determine the structures o f all the proteins in the R aSP50 set, its accuracy was % 28% (relatively) worse than when predictions were made with the RaSP50 m odel. This shows that there is information in the RaSP50 statistical model that is lacking from the RaSP16 model. In a second step, w e randomly generated hundreds of different other 16-protein databases. Even though the accuracy of the secondary structure prediction evaluated in cro ss validation yielded generally RMSs better than the R aSP50, none of them was able to satisfactorily describe the RaSP50 proteins left out when building the 16-protein subset. It is impossible t hat the proteins in the RaSP50 set are representative of a ll possible structural distortions, so one can ask how much information may be lacking from the RaSP50 model. Of course, a definitive answer to this question cannot be given, but it is possible to estimate the amount of information d escribing anomalous signal is contained in the RaSP50 statistical model. This can be done by comparing the standard errors from cross validation and self-validation of t he set. Consider that when an anomalous spectrum is removed from the set during cross validation, if the information needed to model that spectrum is not contained i n the remaining spectra then the FC determination error for that protein will be high. This in turn will increase the rms. However, in self validation, all the information contained in the full basis set can be used to model each s pectrum. Therefore, the difference between the standard errors of self and cross validations gives an indication of the completeness of the information c ontained i n t he basis spectra. F or example, the standard errors of self valid ation for the RaSP16 set were 3.45% H and 2 .58% E which are essentially half as large as the RaSP50 rms values. The latter analyses demonstrate that when the proteins are carefully selected for the diversity in their structure, no small subset of the database contains the necessary information to describe the entire set. One conclusion of the paper is therefore that large protein databases, observing stringent sele ction criteria, are necessary for the prediction Table 3. Performance comparison of different analysis algorithms with the RaSP50 set. PCA-MR, principal component analysis followed by multiple regression, constrained to a 100% total; P LS, simultaneous partial least-squares analyses of all structure classes, constrained to a 100% total; PLS-1, sep arate partial least-squares analyses of each structure t ype with the use of weighting during the spec tral decomposition step. SelCon has been described in detail in [5]. Data Algorithm a-Helix (H) b-Sheet (E) Turn (T) S Other (C + G +B+S) RMS f R a RMS f R a RMS f R a RMS f R a IR (Amide I + II) SelCon3 8.5 2.6 0.92 7.73 2.34 0.90 3.52 1.23 0.38 11.58 1.04 0.26 PCA-MR 6.91 3.2 0.95 7.64 2.35 0.91 4.38 0.99 0.22 9.27 1.3 0.64 PLS 7.29 3.03 0.94 7.58 2.37 0.91 4.36 1 0.21 9.48 1.27 0.62 PLS-1 7.16 3.09 0.95 7.36 2.44 0.91 4.31 1.01 0.13 9.49 1.27 0.62 IR + CD SelCon3 7.57 2.91 0.939 7.97 2.27 0.90 3.97 1.09 0.36 10.30 1.17 0.47 PCA-MR 6.83 3.24 0.95 7.23 2.48 0.92 4.23 1.02 0.29 9.26 1.3 0.64 PLS 6.8 3.25 0.95 6.97 2.58 0.92 4.3 1.01 0.27 9.06 1.33 0.66 PLS-1 6.73 3.28 0.95 6.68 2.69 0.93 4.45 0.97 0.03 9.16 1.31 0.66 CD SelCon3 8.15 2.71 0.91 10.43 1.73 0.82 4.74 0.91 0.00 9.70 1.24 0.55 PCA-MR 7.97 2.77 0.93 9.37 1.92 0.85 4.55 0.95 0.14 8.93 1.35 0.68 PLS 7.72 2.86 0.94 9.47 1.9 0.85 4.47 0.97 0.14 8.96 1.34 0.67 PLS-1 7.7 2.87 0.94 9.22 1.95 0.89 4.47 0.97 0.00 9.03 1.33 0.67 a The correlation coefficient (R) between the determined and actual FCs for the full RaSP50 set. 2946 K. A. Oberg et al.(Eur. J. Biochem. 271) Ó FEBS 2004 [...]... membrane proteins that there is no systematic difference in the CD spectra of soluble and membrane proteins Yet, they reported that increasing the number of proteins in the CD spectrum database from soluble proteins is an important factor to improve the prediction Similarly, the additional inclusion of the CD spectra of membrane proteins brings a small but significant additional improvement In the field of infrared. .. circular dichroism spectra of proteins applied in a calibration of protein secondary structure Anal Biochem 223, 26–34 5 Sreerama, N & Woody, R.W (1993) A self-consistent method for the analysis of protein secondary structure from circular dichroism Anal Biochem 209, 32–44 6 Sreerama, N & Woody, R.W (2000) Estimation of protein secondary structure from circular dichroism spectra: Comparison of CONTIN,... that the transmembrane and peripheral helices could be distinguished on the basis of their deconvolved CD spectrum [46,47] Wallace investigated the performance of soluble protein sets of CD spectra in analyzing membrane protein CD spectra The conclusion was that the soluble protein reference set of CD spectra yields inaccurate results for membrane protein CD spectra [48,49] Conversely, Sreerama and. .. (1996) Determination of Protein Secondary Structure In Circular Dichroism and the Conformational Analysis of Biomolecules (Fasman, G.D., ed.), pp 69–107 Plenum Press, New York 16 Venyaminov, S.Y & Vassilenko, K.S (1994) Determination of protein tertiary structure class from circular dichroism spectra Anal Biochem 222, 176–184 17 Andrade, M.A., Chacon, P., Merelo, J.J & Moran, F (1993) Evaluation of secondary. .. structure from Fourier transform infrared and/ or circular dichroism spectra Anal Biochem 214, 366–378 20 Lee, D.C., Haris, P.I., Chapman, D & Mitchell, R.C (1990) Determination of protein secondary structure using factor analysis of infrared spectra Biochemistry 29, 9185–9193 21 Byler, D.M & Susi, H (1986) Examination of the secondary structure of proteins by deconvolved FTIR spectra Biopolymers 25, 469–487... (H2O) solutions III Estimation of the protein secondary structure Biopolymers 30, 1273–1280 11 Manavalan, P & Johnson, W.C (1987) Variable selection method improves the prediction of protein secondary structure from circular dichroism spectra Anal Biochem 167, 76–85 12 Provencher, S.W & Glockner, J (1981) Estimation of globular protein secondary structure from circular dichroism Biochemistry 20, 33–37... Director at the Belgian National Fund For Scientific Research, Belgium References 1 Baumruk, V., Pancoska, P & Keiderling, T.A (1996) Predictions of secondary structure using statistical analyses of electronic and vibrational circular dichroism and Fourier transform infrared spectra of proteins in H2O J Mol Biol 259, 774–791 2 Rahmelow, K & Hubner, W (1996) Secondary structure ¨ determination of proteins... FEBS 2004 FTIR-CD spectroscopy of proteins (Eur J Biochem 271) 2947 of unknown proteins A second important conclusion of the paper is that only the comparison of analyses run on CD and IR spectra independently is able to identify failed solutions in the absence of known structure As far as the specific case of membrane protein is concerned, the issue has been raised a number of times but it is now definitively... SELCON, and CDSSTR methods with an expanded reference set Anal Biochem 287, 252–260 7 Perczel, A., Hollosi, M., Tusnady, G & Fasman, G.D (1991) Convex constraint analysis: a natural deconvolution of circular dichroism curves of proteins Protein Eng 4, 669–679 ´ 8 Dousseau, F & Pezolet, M (1990) Determination of the secondary structure content of proteins in aqueous solutions from their amide I and amide... reflection Fourier transform infrared spectroscopy: significance of the pH Appl Spectrosc 50, 1519–1527 41 Pancoska, P., Bitto, E., Janota, V., Urbanova, M., Gupta, V.P & Keiderling, T.A (1995) Comparison of and limits of accuracy for statistical analyses of vibrational and electronic circular dichroism spectra in terms of correlations to and predictions of protein secondary structure Protein Sci 4, 1384–1401 . The optimization of protein secondary structure determination with infrared and circular dichroism spectra Keith A. Oberg, Jean-Marie Ruysschaert and. transmembrane helices and peripheral helices by the deconvolution of circular dichroism spectra of membrane proteins. In Circular Dichroism and the Conformation of Biomolecules

Ngày đăng: 16/03/2014, 18:20

Xem thêm: Báo cáo khoa học: The optimization of protein secondary structure determination with infrared and circular dichroism spectra docx, Báo cáo khoa học: The optimization of protein secondary structure determination with infrared and circular dichroism spectra docx

Báo cáo khoa học: The optimization of protein secondary structure determination with infrared and circular dichroism spectra docx

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan