Data based system design and network analysis tools for chemical and biological processes

DATA BASED SYSTEM DESIGN AND NETWORK ANALYSIS TOOLS FOR CHEMICAL AND BIOLOGICAL PROCESSES RAO RAGHURAJ (M.Tech., I.I.T. Bombay, India) (B.Engg., K.R.E.C., Surathkal, India) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF CHEMICAL AND BIOMOLECULAR ENGINEERING NATIONAL UNIVERSITY OF SINGAPORE 2008 with infinite gratitude, respect and affection DEDICATED TO BHAGOJI TEACHER (my high school teacher, mentor and God mother) for your relentless, lifelong, philanthropic effort in grooming hundreds of novices like me Acknowledgement ‘Thanks’ will be a mere word to express my immense gratitude to all those who have helped me in my research progress and more so in shaping my PhD into an enriching and memorable experience. I wish to sincerely acknowledge here, all the encouragement and support I received directly or indirectly from different persons at different times. The research guidance that I got through Dr. Lakshminarayanan Samavedham at NUS was much more than what I wished for before coming to NUS. I express my sincere gratitude and countless thanks to him for being a splendid supervisor. Without his immense support, timely inputs, precise guidance and encouragement my progress was impossible. I fall short of words to explain his influence on me and my research. In him, I have realized a guide, a mentor, very good motivator, a good friend for life and more than all a complete human being I can look up to. I express my feelings and infinite respect to this complete teacher using a divine saying “Guru saakshaat parabhrahma tasmai shree Guruvennamaha ”. Thank you Sir, thanks a lot for everything. I express my sincere thanks to Dr. Pawan Dhar (principal scientist, synthetic biology lab, RIKEN research institute, Yokohama, Japan) and members of his team for providing me a valuable opportunity to carry out internship at RIKEN. I surely learnt a lot about systems biology from you all. Special thanks to Kyaw for all the help and support during my stay in Japan. It was indeed a great experience working with biologists involved in interesting investigations. I must and thank Prof. Sanjay and the other group members of Small Molecular Biology Lab, Department of Biological Sciences, NUS (especially Gauri and Sheela) for associating me in their work and utilizing some of the data analysis tools developed in this research for their project. Similarly, I wish to thank my friend Umid Joshi and his supervisor Prof. Rajasekar Balasubramanian (Div. of Environmental Science and Engineering) for involving me as data analyst for their project. I extend my thanks to Prof. M.S. Chiu and Prof. Sanjay Swarup for their kind acceptance to be on the panel of examiners and for valuable suggestions for planning this research during the qualifying exam. I also thank the final reviewers for spending time on evaluating this thesis. I wish to admire and thank the unknown reviewers of our publications, who gave constructive feedbacks on all our manuscripts and helped us to bring out the best of this research to the community. I wondered many times, what would have happened if there were no publications, no public databases?. Yes, I must take this opportunity to appreciate and thank all those dedicated researchers who have made their findings publicly available in the true spirit of knowledge sharing. Their contributions in the form of literature, notes on their websites, email correspondence and freely available ready to use online datasets have indirectly strengthened this research work. I also express my sincere gratitude to all the professors at ChBE/NUS whose valuable lectures/seminars have put some intriguing thoughts in me, contributing good ideas to this research. My experience as part time research assistant to the ChemBioSys group was truly enriching. I specially thank Prof. Rangaiah, Prof. Karimi, Prof. Chiu and Dr. Laksh for involving me in the new projects and providing me a good chance to learn more. It was indeed a previledge to work with you all and a great value addition. Special thanks to my department (ChBE) for giving me an opportunity to teach undergraduate students (which truly added a color to my experiences at NUS) and also for financially supporting my conference visits and internship at RIKEN. I also thank my supervisor, my department and NUS for recognizing my performance and for awarding prestigious Presidents Graduate Fellowship. It is surely an honor that I will cherish long. My affectionate thanks to all my labmates and other friends at NUS for their fantastic company and useful interactions. I additionally thank Balaji for being a motivating flagship PhD student of our group. Friends, thanks a lot for making the IPC lab a great place to work and enjoy research. You all have been part of my wonderful times in NUS. At this moment when I am going for my highest qualification, I remember and thank all my professors, students and precious friends who trained, tuned and inspired me to be what I am today. The eventful journey, so far, would not have been wonderful without all your contributions. My family members are always the source of inspiration. They continue to motivate me to better in what ever I am doing. Their blessings and faith in me were the main driving force during the course of my PhD. I am ever grateful and indebted to you all. My dear wife Gaana. Well, I know she is beside me in all the above acknowledgments, yet my heart longs for a special note for her. Love you for your invaluable support in all my endeavors. You have indeed been a special gift in my life. Rao Raghuraj TABLE OF CONTENTS Page LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi NOMENCLATURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Information revolution and it’s impact . . . . . . . . . . . . . . . . . . . . 1.2 ChemBioSys - a new paradigm of systems research . . . . . . . . . . . . . 1.3 Analysis techniques in the data rich IT era . . . . . . . . . . . . . . . . . . 1.4 Motivation for current research . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Scope of the present work . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.6 Organization of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 System Design and Characterization - An Overview . . . . . . . . . . . . . . . . 13 2.1 Process Systems and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.1 Challenges of modern process systems analysis . . . . . . . . . . . . 16 Biological Systems and Analysis . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.1 Challenges for analyzing biological systems . . . . . . . . . . . . . . 20 2.2.2 Computational biology . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2.3 Systems biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Complex systems and network analysis . . . . . . . . . . . . . . . . . . . . 31 2.3.1 Challenges in analyzing complex networks . . . . . . . . . . . . . . 33 2.3.2 Networks in biological systems . . . . . . . . . . . . . . . . . . . . . 34 Chemical Engineering in Biology . . . . . . . . . . . . . . . . . . . . . . . 36 2.4.1 Possible PSE contributions in systems biology . . . . . . . . . . . . 37 Systems Analysis Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.5.1 System modeling approaches: . . . . . . . . . . . . . . . . . . . . . 39 2.5.2 Data analysis tools and techniques . . . . . . . . . . . . . . . . . . 44 Variable Selection Tools for Data Analysis . . . . . . . . . . . . . . . . . . . . . 52 2.2 2.3 2.4 2.5 3.1 Variable selection problem - overview . . . . . . . . . . . . . . . . . . . . . i 52 Page 3.2 Variable Interaction Network based variable selection - new concept . . . . 55 3.2.1 Concept of partial correlations . . . . . . . . . . . . . . . . . . . . . 56 3.2.2 Partial correlation based VIN synthesis . . . . . . . . . . . . . . . . 60 3.2.3 VIN based graph theoretic variable importance measure . . . . . . 62 3.2.4 VIN based variable selection algorithm . . . . . . . . . . . . . . . . 64 VIN based variable selection for Classification . . . . . . . . . . . . . . . . 66 3.3.1 Implementation of VIN algorithm for classification problems . . . . 71 3.3.2 Classifiers used for analysis . . . . . . . . . . . . . . . . . . . . . . 73 3.3.3 Variable selection methods used for comparison . . . . . . . . . . . 75 3.3.4 Case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.3.5 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 80 VIN based variable selection for Multi-Variate Calibration . . . . . . . . . 99 3.4.1 Multi-Variate Calibration - important chemometric tool . . . . . . . 99 3.4.2 Implementation of VIN algorithm for MVC problems . . . . . . . . 101 3.4.3 Methods used for calibration and comparison . . . . . . . . . . . . 103 3.4.4 Illustration - VIN approach for multivariate calibration . . . . . . . 107 3.4.5 Case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 3.4.6 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Classification Tools for Discriminant Analysis . . . . . . . . . . . . . . . . . . . 126 3.3 3.4 4.1 4.2 4.3 Data Classification - overview . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.1.1 Existing classification techniques . . . . . . . . . . . . . . . . . . . 128 4.1.2 Motivation and Objectives for designing a new classifier . . . . . . . 133 4.1.3 Variable Dependency Structure based classification approach . . . . 135 Discriminant Partial Correlation Coefficient Metric-DPCCM classifier . . . 139 4.2.1 PCCM for classification - DPCCM approach . . . . . . . . . . . . . 140 4.2.2 DPCCM Algorithm and Implementation . . . . . . . . . . . . . . . 143 4.2.3 DPCCM illustration with Iris data . . . . . . . . . . . . . . . . . . 144 4.2.4 Analysis of product quality - DPCCM case study . . . . . . . . . . 147 Variable Predictive Model based Class Discrimination - VPMCD classifier . 154 4.3.1 154 Concept of Variable Predictive Models . . . . . . . . . . . . . . . . ii Page 4.4 4.5 4.3.2 VPMCD approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 4.3.3 Geometric Interpretation of VPMCD approach . . . . . . . . . . . . 160 4.3.4 VPMCD implementation . . . . . . . . . . . . . . . . . . . . . . . . 161 4.3.5 VPMCD illustration with Iris Data . . . . . . . . . . . . . . . . . . 163 4.3.6 Illustration of effect of variable associations on classifier . . . . . . . 165 4.3.7 Protein structure prediction - VPMCD case study . . . . . . . . . . 167 Genetic Programming Model based Class Discrimination - GPMCD classifier 175 4.4.1 Genetic Programming - overview . . . . . . . . . . . . . . . . . . . 175 4.4.2 Genetic Programming Models - alternate VPM concept . . . . . . . 177 4.4.3 GPMCD approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 4.4.4 Important ChemBioSys classification problems - GPMCD case studies180 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 Design Tools for Network Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . 191 5.1 Network Design - important system biology problem . . . . . . . . . . . . . 191 5.1.1 Protein-Protein Interaction Network: overview . . . . . . . . . . . . 192 Aminoacid Residue Association based PPI prediction: VIN-NS technique . 194 5.2.1 Establishing residue-residue correlations for protein pairs . . . . . . 195 5.2.2 Aminoacid Residue Association (ARA) models for PPI prediction . 198 PPI prediction case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 5.3.1 Collection and preparation of PPI datasets . . . . . . . . . . . . . . 200 5.3.2 PPI prediction performance measures . . . . . . . . . . . . . . . . . 203 5.3.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 204 Observations and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 216 Complex Network Analysis Techniques . . . . . . . . . . . . . . . . . . . . . . . 218 5.2 5.3 5.4 6.1 Complex Networks - overview . . . . . . . . . . . . . . . . . . . . . . . . . 218 6.1.1 Network terminology and properties . . . . . . . . . . . . . . . . . . 219 6.1.2 Network complexity measures . . . . . . . . . . . . . . . . . . . . . 222 6.1.3 Classes of complex networks . . . . . . . . . . . . . . . . . . . . . . 223 6.1.4 Stability analysis of networks . . . . . . . . . . . . . . . . . . . . . 225 6.1.5 Motivation for new complexity measures . . . . . . . . . . . . . . . 225 iii iv Page 6.2 Complexity measures based on cyclical network motifs . . . . . . . . . . . 226 6.2.1 Definition of new complexity indices . . . . . . . . . . . . . . . . . 226 6.2.2 Cycle complexity based network analysis . . . . . . . . . . . . . . . 228 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 6.3.1 Complexity analysis of simulated networks . . . . . . . . . . . . . . 229 6.3.2 Complexity analysis of real world networks . . . . . . . . . . . . . . 231 6.3.3 Robustness in biological networks - CyC analysis . . . . . . . . . . 233 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Contributions and Recommendations . . . . . . . . . . . . . . . . . . . . . . . . 236 6.3 6.4 7.1 Summary of research contributions . . . . . . . . . . . . . . . . . . . . . . 236 7.2 Contributions to other collaborative projects . . . . . . . . . . . . . . . . . 238 7.3 Recommendations for future work . . . . . . . . . . . . . . . . . . . . . . . 241 LIST OF REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 A Public Domain Datasets and ChemBioSys relevant Online Literature . . . . . . 263 B Computational Resources available Online . . . . . . . . . . . . . . . . . . . . . 264 PUBLICATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 LIST OF TABLES Table Page 1.1 Information revolution and its impact - important changes in last three decades 3.1 Sample correlation coefficient matrix RV IN for variable ranking - Wine classification data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.2 VIN based variable selection algorithm results for re-substitution test . . . . . 84 3.3 Comparison of VIN method with other variable selection algorithms - Cross validation test results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.4 Details of MVC datasets used and corresponding VIN-PLS tuning results . . . 114 3.5 Prediction test results (RM SEP ) for VIN-PLS analysis for different case studies118 4.1 DPCCM performance analysis for WINE data vis.a.vis other classifiers . . . . 150 4.2 DPCCM performance analysis for CHEESE data vis.a.vis other classifiers . . . 150 4.3 List and model details for various possible VPMs used to construct VIN for VPMCD classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 4.4 Group wise VPM design and VPMCD analysis for Iris Data . . . . . . . . . . 164 4.5 Resubstitution test results using different classifiers for protein datasets . . . . 171 4.6 Jackknife (LOOCV) test results using different classifiers for protein datasets . 171 4.7 Effect of model order r on VPMCD (with QI model type) performance for SCOP277 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 4.8 Effect of model types on VPMCD(r = 3) performance for SCOP277 dataset . 172 4.9 VPMCD performance for low homology data compared with best results reported by [276] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 4.10 GPMCD case studies: classification problems and dataset details . . . . . . . 4.11 Sample GP Mik generated during GPMCD analysis for different case studies 181 . 185 . . . . 189 5.1 Positively interacting protein datasets used for PPI prediction . . . . . . . . . 202 5.2 Performance of ARA model based PPI prediction for different organisms . . . 211 6.1 Complex networks used for analysis . . . . . . . . . . . . . . . . . . . . . . . . 228 6.2 Complexity analysis using different measures on selected networks . . . . . . . 232 4.12 GPMCD performance analysis in comparison with existing classifiers v a LIST OF FIGURES Figure Page 1.1 Scope of the present work - research depth, breadth and width . . . . . . . . . 12 2.1 Modeling Approaches : different strategies for systems representation . . . . . 40 3.1 Hypothetical VIN representing different schemes of variable association a) Undirected VIN b) directed VIN with all nodes influencing Xi c) Xi influencing all the nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.2 Two variable scatter plots for Fisher Iris data. a) SW vs. SL b) PL vs. PW. . 70 3.3 VIN variable selection approach as implemented for data classification . . . . . 72 3.4 Variable Interaction Network for WINE data. Generated using the matrix in Table 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.5 Effect of partial correlation order r on the VIN-LDA analysis for Wine data . 83 3.6 Variable selection analysis result for Iris data (case study I). a) RI distribution for variables b) LDA re-substitution test results using different algorithms . . 87 Variable selection analysis result for FDD data (case study II) a) RI distribution b) LDA re-substution performance using different algorithms . . . . . . . . . . 89 Variable selection analysis for Cancer tumor classification using full set(Case study III) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Variable selection analysis for Cancer tumor classification (Case study IIIA) using PCA dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 3.10 Variable selection analysis for Cancer tumor classification (Case study IIIB) using cluster average gene expression a) RI distribution b) LDA re-substution performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 3.11 Variable selection analysis for Wine data (case study IV) a) RI distribution b) LDA re-substution test performance using different algorithms . . . . . . . . . 94 3.12 VIN analysis using CART classifier on Wine dataset . . . . . . . . . . . . . . 96 3.13 VIN analysis using ANN classifier on Wine dataset . . . . . . . . . . . . . . . 97 3.14 Centroid analysis for Iris data. Profile of variable averages for the three classes 98 3.15 Generalized flow chart describing steps involved in VIN based variable ranking method for multivariate calibration . . . . . . . . . . . . . . . . . . . . . . . . 104 3.16 Sample profiles for simulated multivariate calibration dataset X [100 × 11] . . 108 3.17 Variable interaction network details for simulated MVC problem a) VIN using r = b) VIN for r = . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 3.18 Variable selection analysis for simulated MVC data using PLS calibration . . . 112 3.19 Relative Importance distribution for variables in ANALYTE data . . . . . . . 117 3.7 3.8 3.9 vi 254 [172] B. Yu, “Embracing statistical challenges in the Information Technology age,” Technometrics, vol. 49, no. 3, pp. 237 – 248, 2007. [173] V. Barnett and T. Lewis, Outliers in statistical data. Chichester (West Sussex), New York: Wiley, 3rd ed., 1994. [174] J. P. C. Kleijnen and J. C. Helton, “Statistical analysis of scatterplots to identify important factors in large scale simulations : Review and comparison of techniques,” Reliability Engineering and Systems Safet, vol. 65, pp. 147 – 185, 1999. [175] R. R. Sokal and F. J. Rohlf, Biometry : The principles and practice of statistics in biological research. NewYork: W.H.Freeman and co., 1995. [176] P. Iglesias, H. Jorquera, and W. Palma, “Data analysis using regression models with missing observations and long-memory: an application study,” Computational Statistics and Data Analysis, vol. 50, no. 8, pp. 2028–2043, 2006. [177] J. Robert, Elementary Statistics. Boston: PWS-KENT, 5th ed., 1988. [178] Y. Kharin and E. Zhuk, “Filtering of multivariate samples containing outliers for clustering,” Pattern Recognition Letters, vol. 19, pp. 1077 – 1085, 1998. [179] R. G. Brown and P. Y. C. Hwang, Introduction to random signals and applied Kalman filtering : with MATLAB exercises and solutions. New York: Wiley, 3rd ed., 1997. [180] T. Beibarth, K. Fellenberg, B. Brors, R. Arribas Prat, J. M. Boer, N. C. Hauser, M. Scheiderler, and M. Vingron, “Processing and quality control of DNA array hybridization data,” Bioinformatics, vol. 16, no. 11, pp. 1014 –1022, 2000. [181] D. Hand, H. Mannila, and P. Smyth, Principles of Data Mining. MA: MIT Press, 2001. [182] G. J. McLachlan, Discriminant analysis and statistical pattern recognition. New York: Wiley Interscience, 1992. [183] I. S. Dhillon, D. S. Modha, and W. S. Spangler, “Class visualization of high-dimensional data with applications,” Computational Statistics and Data Analysis, vol. 41, pp. 59–90, 2002. [184] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification. New York: Wiley, 2nd edition ed., 2000. [185] E. Narayanan, J. Bino, M. Nebojsa, F. Andras, A. I. Valentin, P. Ursula, C. S. Ashley, A. M. Marc, M. S. Madhusudhan, Y. Bozidar, and S. Andrej, “Tools for comparative protein structure modeling and analysis,” Nucleic Acids Research, vol. 31, pp. 3375–3380, 2003. [186] E. Martz, “Protein Explorer: easy yet powerful macromolecular visualization,” Trends in Biochemical Sciences, vol. 27, pp. 107–109, 2002. [187] G. N. Ramachandran, C. Ramakrishnan, and D. Prashant, “Conformation of polypeptide and proteins,” Journal of Molecular Biology, vol. 7, pp. 95–99, 1963. [188] S. S. Sheik, P. Sundararajan, A. S. Z. Hussain, and K. Sekar, “Ramachandran plot on the web,” Bioinformatics, vol. 18, pp. 1548–1549, 2002. [189] E. Oja and S. Kaski, Kohonen maps. New York: Elsevier, 1999. [190] M. J. Greenacre and J. Blasius, Correspondence analysis in the social sciences : recent developments and applications. London: Academic Press, 1994. 255 [191] L. H. Chiang, E. L. Russell, and R. D. Braatz, “Fault diagnosis in chemical processes using Fisher Discriminant Analysis, Discriminant Partial Least Squares, and Principal Component Analysis,” Chemometrics and Intelligent Laboratory Systems, vol. 50, pp. 243 – 252, 2000. [192] Y. Tominaga, “Comparative study of class data analysis with PCA-LDA, SIMCA, PLS, ANN and k-NN,” Chemometrics and Intelligent Laboratory Systems, vol. 49, pp. 105 – 115, 1999. [193] H. Schmidli, “Multivariate prediction for QSAR,” Chemometrics and Intelligent Laboratory Systems, vol. 37, pp. 125–134, 1997. [194] Y. Bard, Nonlinear parameter estimation. New York: Academic Press, 1974. [195] R. L. Iman and W. J. Conover, “A measure of top-down correlation,” Technometrics, vol. 29, no. 3, pp. 351–357, 1987. [196] J. Pearl, Causality: Models, Reasoning, and Inference. Cambridge, UK: Cambridge University Press, 2000. [197] R. Steuer, J. Kurths, O. Fiehn, and W. Weckwerth, “Interpreting correlations in metabolomic networks.,” Bioch. Soc. Trans, vol. 31, pp. 1476–1478, 2003. [198] M. Li, Y. Huang, and Y. Xiao, “Nonlinear correlations of protein sequences and symmetries of their structures,” Chinese Physics Letters, vol. 22, no. 4, pp. 1006–1009, 2005. [199] R. Steuer, J. Kurths, O. Fiehn, and W. Weckwerth, “Observing and interpreting correlations in metabolomic networks,” Bioinformatics, vol. 19, pp. 1019–1026, 2003. [200] B. Huang and S. L. Shah, Performance assessment of control loops : theory and applications. New York: Springer, 1999. [201] T. J. Harris, C. T. Seppala, and L. D. Desborough, “A review of performance monitoring and assessment techniques for univariate and multivariate control systems,” Journal of Process Control, vol. 9, pp. 1–17, 1999. [202] M. Jain and S. Lakshminarayanan, “Estimating performance enhancement with alternate control strategies for multiloop control systems,” Chemical Engineering Science, vol. 62, pp. 4644–4658, 2007. [203] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gassenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, D. D. Bloomfield, and E. S. Lander, “Molecular classification of Cancer: class discovery and class prediction by gene expression monitoring,” Science, vol. 286, no. 15, pp. 531–537, 1999. [204] P. T. Spellman, G. Sherlock, M. Q. Zhang, V. R. Iyer, K. Anders, M. B. Eisen, P. O. Brown, D. Botstein, and B. Futcher, “Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization,” Mol. Biol. Cell, vol. 9, no. 12, pp. 3273 – 3297, 1998. [205] S. Dudoit, J. P. Shaffer, and J. C. Boldrick, “Multiple hypothesis testing in microarray experiments,” Statistical Science, vol. 18, pp. 71–103, 2003. [206] A. Lindlof and B. Olsson, “Could correlation-based methods be used to derive genetic association networks?,” Information Sciences, vol. 146, pp. 103–113, 2002. [207] L. Zhang and A. K. Nandi, “Fault classification using genetic programming,” Mechanical Systems and Signal Processing, vol. 21, pp. 1273 – 1284, 2007. [208] J. K. Kishore, L. M. Patnaik, V. Mani, and V. K. Agrawal, “Application of genetic programming for multicategory pattern classification,” IEEE transactions on evolutionary computation, vol. 4, no. 3, pp. 242 – 258, 2000. 256 [209] A. Jagota, Data Analysis and Classification for Bioinformatics. London: Bay Press, 2000. [210] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees. California: Wadsworth International Group, 1984. [211] J. David and N. Marchette, Random Graphs for Statistical Pattern Recognition. Willey Inter Science, 2004. [212] G. P. McCabe, “Principal variables,” Technometrics, vol. 26, pp. 137–144, 1984. [213] I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” Journal of Machine Learning Research, vol. 3, pp. 1157–1182, 2003. [214] I. Chong and H. J. Chi, “Performance of some variable selection methods when multicollinearity is present,” Chemometrics and Intelligent Laboratory Systems, vol. 78, pp. 103– 112, 2005. [215] P. Spirtes, C. Glymour, and R. Scheines, Causation, Prediction, and Search. Cambride, USA: MIT Press, ed., 2000. [216] Y. Saeys, I. Inza, and P. Larranaga, “A review of feature selection techniques in bioinformatics,” Bioinformatics Advance Access, 2007. [217] V. Vapnik, Statistical Learning Theory. New York: Wiley-Interscience, 1998. [218] R. A. Fisher, “The use of multiple measurements in taxonomic problems,” Annals of Eugenics, vol. 7, pp. 179–188, 1936. [219] D. J. Newman, S. Hettich, C. L. Blake, and C. J. Merz, UCI Repository of machine learning databases. CA: University of California, Department of Information and Computer Science, 1998. [220] W. Wu, Q. Guo, D. J. Rimbaud, and D. L. Massart, “Using contrasts as data pretreatment method in pattern recognition of multivariate data,” Chemometrics and Intelligent Laboratory Systems, vol. 45, pp. 39–53, 1999. [221] W. Wu, Q. Guo, B. C. Massart, D.L., and S. Jong, “Structure preserving feature selection in PARAFAC using a genetic algorithm and Procrustes analysis,” Chemometrics and Intelligent Laboratory Systems, vol. 65, pp. 83–95, 2003. [222] M. H. Zhang, Q. S. Xu, F. Daeyaert, P. J. Lewi, and D. L. Massart, “Application of boosting to classification problems in chemometrics,” Analytica Chimica Acta, vol. 544, pp. 167–176, 2005. [223] F. Questier, R. Put, D. Coomans, B. Walczak, and Y. V. Heydena, “The use of CART and multivariate regression trees for supervised and unsupervised feature selection,” Chemometrics and Intelligent Laboratory Systems, vol. 76, pp. 45–54, 2005. [224] J. Hua, Z. Xiong, J. Lowey, E. Suh, and E. R. Dougherty, “Optimal number of features as a function of sample size for various classification rules,” Bioinformatics, vol. 21, pp. 1509– 1515, 2005. [225] I. E. Frank and R. Todeschini, The Data Analysis Handbook. New York: Elsevier, 1994. [226] R. Leardi and A. L. Gonzalez, “Genetic algorithms applied to feature selection in PLS regression: how and when to use them,” Chemometrics and Intelligent Laboratory Systems, vol. 41, pp. 195–207, 1998. [227] H. Yoshida, R. Leardi, K. Funatsu, and K. Varmuza, “Feature selection by genetic algorithms for mass spectral classifiers,” Analytica Chimica Acta, vol. 446, pp. 483–492, 2001. 257 [228] M. Dettling and P. Bulhmann, “Boosting for tumor classification with gene expression data,” Bioinformatics, vol. 19, no. 9, pp. 1061–1069, 2003. [229] C. K. Yoo and P. A. Vanroleghem, “Interpreting patterns and analysis of acute leukemia gene expression data by multivariate fuzzy statistical analysis,” Computers and Chemical Engineering, vol. 29, pp. 1345–1356, 2005. [230] H. Martens and T. Naes, Multivariate Calibration. Chichester: John Wiley and Sons, 1989. [231] T. Naes and H. Martens, “Multivariate calibration II. Chemometric methods,” Trends in Analytical Chemistry, vol. 3, pp. 266–271, 1984. [232] A. Bello, F. Bianchi, M. Careri, M. Giannetto, G. Mori, and M. Musci, “Multivariate calibration on NIR data: Development of a model for the rapid evaluation of ethanol content in bakery products,” Analytica Chimica Acta, vol. 603, pp. 8–12, 2007. [233] S. Armenta, S. Garrigues, and M. d. l. Guardia, “Determination of edible oil parameters by near infrared spectrometry,” Analytica Chimica Acta, vol. 596, pp. 330–337, 2007. [234] M. M. Reis, H. H. Araujo, C. Sayer, and R. Giudici, “Spectroscopic on-line monitoring of reactions in dispersed medium: Chemometric challenges,” Analytica Chimica Acta, vol. 595, pp. 257–265, 2007. [235] S. A. Dodds and W. P. Heath, “Construction of an online reduced-spectrum NIR calibration model from full spectrum data,” Chemometrics and Intelligent Laboratory Systems, vol. 76, pp. 37–43, 2005. [236] E. Zamprogna, M. Barolo, and D. E. Seborg, “Estimating product composition profiles in batch distillation via partial least squares regression,” Control Engineering Practice, vol. 12, pp. 917–929, 2004. [237] K. Tang and T. Li, “Comparison of different partial least-squares methods in quantitative structureactivity relationships,” Analytica Chimica Acta, vol. 476, pp. 85–92, 2003. [238] Q. Shen, J. H. Jiang, and Y. R. Q. Shen, G. L., “Variable selection by an evolution algorithm using modified Cp based on MLR and PLS modeling: QSAR studies of carcinogenicity of aromatic amines,” Analytical and Bioanalytical Chemistry, vol. 375, pp. 248–254, 2003. [239] B. K. Alsberg, A. M. Woodward, M. K. Winsona, J. J. Rowland, and D. B. Kell, “Variable selection in wavelet regression models,” Analytica Chimica Acta, vol. 368, pp. 29–44, 1998. [240] F. Bieterle, S. Busche, and G. Gauglitz, “Growing neural networks for a multivariate calibration and variable selection of time-resolved measurements,” Analytica Chimica Acta, vol. 490, pp. 71–83, 2003. [241] A. Olivieri, H. Goicoechea, and F. Inon, “MVC1: an integrated MatLab toolbox for firstorder multivariate calibration,” Chemometrics and Intelligent Laboratory Systems, vol. 73, pp. 189–197, 2004. [242] C. Abrahamsson, J. Johansson, A. Sparen, and F. Lindgren, “Comparison of different variable selection methods conducted on NIR transmission measurements on intact tablets,” Chemometrics and Intelligent Laboratory Systems, vol. 69, pp. 3–12, 2003. [243] M. Forina, S. Lanteri, M. C. Oliveros, and C. P. Millan, “Selection of useful predictors in multivariate calibration,” Analytical and Bioanalytical Chemistry, vol. 380, pp. 397–418, 2004. [244] A. D. Walmsley, “Improved variable selection procedure for multivariate linera regression,” Analytica Chimica Acta, vol. 354, pp. 225–232, 1997. 258 [245] R. Leardi, M. B. Seasholtz, and R. J. Pell, “Variable selection for multivariate calibration using a genetic algorithm: prediction of additive concentrations in polymer films from Fourier transform-infrared spectral data,” Analytica Chimica Acta, vol. 461, pp. 189–200, 2002. [246] I. Esteban-Diez, J. Gonzalez-Saiz, C. Pizarro, and M. Forina, “GA-ACE: Alternating conditional expectations regression with selection of significant predictors by genetic algorithms,” Analytica Chimica Acta, vol. 555, pp. 96–106, 2006. [247] D. Broadhurst, R. Goodacre, A. Jones, J. J. Rowland, and D. B. Kell, “Genetic algorithms as a method for variable selection in multiple linear regression and partial least squares regression, with applications to pyrolysis mass spectrometry,” Analytica Chimica Acta, vol. 348, pp. 71–86, 1997. [248] Matlab, MATLAB 7.0.4. MA, Natick, USA: The MathWorks Inc., 2005. [249] R. Kramer, Chemometric Techniques for Quantitative Analysis. New York: Marcel Dekker, 1998. [250] P. Geladi and B. R. Kowalski, “Partial Least Squares regression: a tutorial,” Analytica Chimica Acta, vol. 185, pp. 1–17, 1986. [251] S. Lakshminarayanan, S. L. Shah, and K. Nandakumar, “Modeling and control of multivariable processes: dyamic PLS approach,” A.I.Ch.E. Journal, vol. 43, pp. 2307–2322, 1997. [252] E. Russel, L. H. Chiang, and R. D. Braatz, Data-driven Methods for Fault Detection and Diagnosis in Chemical Processes. London: Springer, 2000. [253] H. Kubinyi, “Evolutionary variable selection in regression and PLS analyses,” Journal of Chemometrics, vol. 10, pp. 119–133, 1996. [254] H. C. Goicoechea and A. C. Olivieri, “Wavelength selection for multivariate calibration using a genetic algorithm: a novel initialization strategy,” Journal of Chemical Informatics and Computer Sciences, vol. 42, pp. 1146–1153, 2002. [255] J. P. Gauchi and P. Chagnon, “Comparison of selection methods of explanatory variables in PLS regression with application to manufacturing process data,” Chemometrics and Intelligent Laboratory Systems, vol. 58, pp. 171–193, 2001. [256] M. Forina, G. Drava, C. Armanino, R. Boggia, S. Lanteri, R. Leardi, P. Corti, P. Conti, R. Giangiacomo, C. Galliena, R. Bigoni, I. Quartari, C. Serra, D. Ferri, O. Leoni, and L. Lazzeri, “Transfer of calibration function in near-infrared spectroscopy,” Chemometrics and Intelligent Laboratory Systems, vol. 27, pp. 189–203, 1995. [257] J. V. Andres, D. L. Massart, C. Menardo, and C. Sterna, “Correction of non-linearities in spectroscopic multivariate calibration by using transformed original variables. Part II. Application to principal component regression,” Analytica Chimica Acta, vol. 389, pp. 115– 130, 1999. [258] G. L. Cueto, M. Ostra, and C. Ubide, “Linear and non-linear multivariate calibration methods for multicomponent kinetic determinations in cases of severe non-linearity,” Analytica Chimica Acta, vol. 405, pp. 285–295, 2000. [259] L. Zhang, Y. Z. Liang, J. H. Jiang, R. Q. Yu, and K. T. Fang, “Uniform design applied to nonlinear multivariate calibration by ANN,” Analytica Chimica Acta, vol. 370, pp. 65–77, 1998. [260] T. M. Mitchell, Machine Learning. New York: McGraw-Hill, 1997. 259 [261] A. K. Jain, R. P. W. Duin, and J. Mao, “Statistical pattern recognition: a review,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, pp. 4–37, 2000. [262] D. J. Hand and K. Yu, “Idiot’s Bayes - not so stupid after all,” International Statistical Review, vol. 69, pp. 385–399, 2001. [263] S. Kotsiantis and P. Pintelas, “Increasing the classification accuracy of simple Bayesian classifier,” Lecture Notes in Artificial Intelligence, vol. 3192, pp. 198–207, 2004. [264] N. H. Friedman, D. Feiger, and M. Goldszmidt, “Bayesian network classifier,” Machine Learning, vol. 29, pp. 131–163, 1997. [265] N. H. Friendman, N. Goldszmidth, and T. Lee, “Bayesian network classification with continuous attributes: getting the best of both discretization and parametric fitting,” Proceedings of the fifteenth international conference on Machine Learning, pp. 179–187, 1998. [266] N. H. Friendman, “Reguralized discriminant analysis,” Journal of the Americal Statistical Association, vol. 84, pp. 165–175, 1989. [267] J. R. Quinlan, C4.5: programs for machine learning. San Franscisco: Morgan Kaufmann, 1993. [268] J. H. Freidman, Greedy Function Approximation: A Gradient Boosting Machine ; technical report on Treenet. USA: http://www.salfordsystems.com/doc/ GreedyFuncApproxSS.pdf, 1999. [269] R. G. Brereton, Chemometrics: data analysis for the laboratory and chemical plant. Chichester, England: John Wiley and Sons Ltd., 2002. [270] P. M. Granitto, F. Gasperi, F. Biasioli, E. Trainotti, and C. Furlanello, “Modern data mining tools in descriptive sensory analysis: A case study with a Random forest approach,” Food Quality and Preference, vol. 18, pp. 681–689, 2007. [271] S. Canu, Y. Grandvalet, V. Guigue, and A. Rakotomamonjy, SVM and Kernel methods Matlab Toolbox. INSA de Rouen - France: Perception Systemes et Information, 2005. [272] R. K. Rao and S. Lakshminarayanan, “Partial correlation based variable selection approach for multivariate data classification methods,” Chemometrics and Intelligent Laboratory Systems, vol. 86, pp. 68–81, 2007. [273] H. M. Berman, J. Westbrook, Z. Feng, G. Gillliland, T. N. Bhat, and H. Weissig, “The Protein Data Bank,” Nucleic Acids Research, vol. 28, pp. 235–242, 2000. [274] A. Murzin, S. Brenner, T. Hubbard, and C. Chothia, “SCOP: a structural classification of protein database for the investigation of sequence and structures,” Journal of Molecular Biology, vol. 247, pp. 536–540, http://scop.mrc-lmb.cam.ac.uk/scop/, 1995. [275] NCBI, National Center for Biotechnology Information. http://www.ncbi.nlm.nih.gov/: U.S. National Library of Medicine, USA, 1988. [276] L. A. Kurgan and L. Homaeian, “Prediction of structural classes for protein sequences and domains - Impact of prediction algorithms, sequence representation and homology, and test procedures on accuracy,” Pattern Recognition, vol. 39, pp. 2323–2343, 2006. [277] K. C. Chou, “Prediction of protein structural classes and sub cellular locations,” Current opinions in Protein Petide Science, vol. 1, pp. 171–208, 2000. [278] G. P. Zhou, “An Intriguing Controversy over Protein Structural Class Prediction,” Journal of Protein Chemistry, vol. 17, pp. 729–738, 1998. [279] K. Nagano, “Triplet information in helix prediction applied to the analysis of supersecondary structures,” Journal of Molecular Biology, vol. 109, pp. 251–274, 1977. 260 [280] A. Lupas, M. V. Dyke, and J. Stock., “Predicting coiled coils from protein sequences,” Science, vol. 252, pp. 1162–1164, 1991. [281] K. C. Chou, “A key driving force in determination of protein structural classes,” Biochemical and Biophysics Reseach Comm , vol. 264, pp. 216 –224, 1999. [282] N. Qian and T. Sejnowski, “Predicting the secondary structure of globular proteins using neural network models,” Journal of Molecular Biology, vol. 202, no. 4, pp. 865–884, 1988. [283] Y. D. Cai and G. P. Zhou, “Prediction of protein structural classes by neural network,” Biochimie, vol. 82, pp. 783–785, 2000. [284] Y. D. Cai, K. R. Feng, W. C. Lu, and K. C. Chou, “Using logitboost classifier to predict protein structural classes,” Journal of theoretical biology, vol. 232, pp. 1–5, 2005. [285] S. Muggleton, R. D. King, and M. J. E. Sternberg, “Protein secondary structure prediction using logic-based machine learning,” Protein Engineering, vol. 5, pp. 647–657, 1992. [286] Y. Cao, S. Liu, L. Zhang, J. Qin, J. Wang, and K. Tang, “Prediction of protein structural class with Rough Sets,” BMC Bioinformatics, p. 7:20, 2006. [287] J. F. Gibrat, J. Garnier, and B. Robson, “Further developments of protein secondary structure prediction using information theory,” Journal of Molecular Biology, vol. 198, pp. 425– 443, 1987. [288] K. D. Kedarisetti, L. Kurgan, and S. Dick, “A comment on ’Prediction of protein structural classes by a new measure of information discrepancy’,” Computational Biology and Chemistry, vol. 30, pp. 393–394, 2006. [289] Y. D. Cai, X. J. Liu, X. B. Xu, and G. P. Zhou, “Support Vector Machines for predicting protein structural class,” BMC Bioinformatics, p. 2:3, 2001. [290] P. Y. Chou, Prediction of protein structural classes from amino acid composition, In Prediction of protein structure and the principles of protein conformation. New York: Plenum Press, 1989. [291] H. Nakashima, K. Nishikawa, and T. Ooi, “The folding type of protein is relevant to the amino acid composition,” Journal of Biochemistry, vol. 99, pp. 152–162, 1989. [292] L. Edler, J. Grassmann, and S. Suhai, “Role and results of statistical methods in protein fold class prediction. Mathematical and Computer Modeling,” Mathematical and Computer Modeling, vol. 33, pp. 1401–1417, 2001. [293] E. A. Weathers, M. E. Paulaitis, T. B. Woolf, and H. Hog, “Reduced amino acid alphabet is sufficient to accurately recognize intrinsically disordered protein,” FEBS letters, vol. 576, pp. 348–352, 2004. [294] C. H. Q. Ding and I. Dubchak, “Multi-class protein fold recognition using support vector machines and neural networks,” Bioinformatics, vol. 17, pp. 349–358, 2001. [295] G. J. Gray, D. J. Murray-Smith, Y. Li, and K. C. Sharman, “Nonlinear model structure identification using genetic programming and a block diagram oriented simulation tool,” Electronic Letters, vol. 32, pp. 1422 – 1424, 1996. [296] M. P. Hinchliffe and M. J. Willis, “Dynamic systems modeling using genetic programming,” Computers and Chemical Engineering, vol. 27, pp. 1841 – 1854, 2003. [297] K. Tun and S. Lakshminarayanan, “Nonlinear system identification using genetic programming,” in RSCE 2002 and SOMChE2002, (Kuala Lumpur, Malaysia), October 2002. [298] Y. Zhang and S. Bhattacharyya, “Genetic programming in classifying large-scale data: an ensemble method,” Information Sciences, vol. 163, pp. 85 – 101, 2004. 261 [299] L. M. Fua and C. S. Fu-Liu, “Multi-class cancer subtype classification based on gene expression signatures with reliability analysis,” FEBS Letters, vol. 561, pp. 186 –190, 2004. [300] P. S. Shelokar, V. K. Jayaraman, and B. D. Kulkarni, “An ant colony classifier system: application to some process engineering problems,” Computers and Chemical Engineering, vol. 28, pp. 1577 – 1584, 2004. [301] B. Alberts, D. Bray, J. Lewis, M. Raff, and J. D. Watson, Molecular Biology of the Cell. New York: Garland, ed., 1989. [302] A. Dove, “Proteomics: translating genes into products?,” Nature Biotechnology, vol. 17, pp. 233–236, 1999. [303] S. Feilds and O. K. Song, “A novel genetic system to detect protein-protein interactions,” Nature, vol. 340, pp. 245–246, 1999. [304] H. Zhu, M. Bilgin, R. Bangham, and D. Hall, “Global Analysis of Protein Activities Using Proteome Chips,” Science, vol. 293, pp. 2101–2105, 2001. [305] A. Szilagyi, V. Grimm, A. K. Arakaki, and J. Skolnick, “Prediction of physical proteinprotein interactions,” Physical Biology, vol. 2, pp. S1–S16, 2005. [306] J. Yu and F. Fotouhi, “Computational approaches for predicting Protein-Protein Interactions: a survey,” Journal of Medical Systems, vol. 30, pp. 39–44, 2006. [307] H. Yu, N. H. Luscombe, H. X. Lu, X. Zhu, Y. Xia, J. Han, N. Bertin, S. Chung, M. Vidal, and M. Gerstein, “Annotation Transfer Between Genomes: Protein-Protein Interologs and Protein-DNA Regulogs,” Genome Research, vol. 14, pp. 1107–1118, 2004. [308] T. Dandekar, B. Snel, M. Huynen, and P. Bork, “Conservation of gene order: a fingerprint of proteins that physically interact,” Trends in Biochemical Sciences, vol. 23, pp. 324–328, 1998. [309] D. Barker and M. Pagel, “Predicting functional gene links from phylogenetic-statistical analyses of whole genomes,” PLOS computational biology, vol. 1, pp. 24–31, 2005. [310] A. J. Enright, I. Iliopoulos, N. C. Kyrpides, and C. A. Ouzounis, “Protein interaction maps for complete genomes based on gene fusion events,” Nature, vol. 402, pp. 86–90, 1999. [311] S. A. Benner, a. L. D. A. Chamberlin, S. G., S. Govindarajan, and L. Knecht, “unctional inferences from reconstructed evolutionary biology involving rectified databases - an evolutionarily grounded approach to functional genomics,” Research in Microbiology, vol. 151, pp. 97–106, 2000. [312] Y. Ofran and B. Rost, “Analysing six types of protein-protein interfaces,” Journal of Molecular Biology, vol. 325, pp. 377–387, 2003. [313] S. Jones and J. M. Thornton, “Analysis of protein-protein interaction sites using surface patches,” Journal of Molecular Biology, vol. 272, pp. 121–132, 1997. [314] P. Aloy and R. B. Russell, “Interrogating protein interaction networks through structural biology,” PNAS-USA, vol. 99, pp. 5896–5901, 2002. [315] L. Han, J. Cui, H. Lin, Z. Ji, Z. Cao, Y. Li, and Y. Chen, “Recent progresses in the application of machine learning approach for predicting protein functional class independent of sequence similarity,” Proteomics, vol. 6, pp. 4023–4037, 2006. [316] I. Xenarios, D. W. Rice, L. Salwinski, M. K. Baron, E. M. Marcotte, and D. Eisenberg, “DIP: the database of interacting proteins,” Nulceic Acid Research, vol. 28, pp. 289–291, http://dip.doe-mbi.ucla.edu, 2000. 262 [317] H. W. Mewes, F. D., U. Guldener, G. Mannhaupt, and e. a. Mayer, K., “MIPS: A database for genomes and protein sequences,” Nucleic Acid Research, vol. 30, pp. 31–34, 2002. [318] C. Stark, B. J. Breitkreutz, T. Rguly, L. Boucher, A. Breitkreutz, and M. Tyers, “BioGRID: a general repository for interaction datasets,” Nucleic Acid Research, vol. 34, pp. D535– D539, 2006. [319] S. B. Needleman and C. D. Wunsch, “General method applicable to the search for similarities in the amino acid sequence of two proteins,” Journal of Molecular Biology, vol. 48, pp. 443– 453, 1970. [320] J. R. Bock and D. A. Gough, “Predicting protein-protein interactions from primary structure,” Bioinformatics, vol. 17, pp. 455–460, 2001. [321] S. L. Lo, C. A. Cai, Y. Z. Chen, and C. M. Chung, “Effect of training datasets on support vector machine prediction of protein-protein interactions,” Proteomics, vol. 5, pp. 876–884, 2006. [322] L. Zhang, S. Wong, O. D. King, and F. Roth, “Predicting co-complexed protein pairs using genomic and proteomic data integration,” BMC Bioinformatics, vol. 5, p. 38, 2004. [323] Y. Qi, Z. B. Joseph, and J. K. Seetharaman1, “Evaluation of different biological data and computational classification methods for use in protein interaction prediction,” Proteins: Structure, Function, and Bioinformatics, vol. 63, pp. 490–500, 2006. [324] G. D. Bader, D. Betel, and C. W. Hogue, “BIND: the Biomolecular Interaction Network Database,” Nucleic Acid Research, vol. 31, pp. 248–250, http://bind.ca, 2003. [325] B. Bollobas, Random Graphs. London: Acadamy Press, 1985. [326] A. Barabasi and R. Albert, “Emergence of scaling in random networks.,” Science, vol. 286, pp. 509–512, 1999. [327] K. Shivakumar and S. Narasimhan, “A Robust and Efficient NLP Formulation Using Graph Theoretic Principles for Synthesis of Heat Exchanger Networks,” Computers and Chemical Engineering, vol. 26, pp. 1517–1532, 2002. [328] H. M. Ortmanns, “Functional complexity measure for networks,” Physica A, vol. 337, pp. 679–690, 2004. [329] S. Jain and S. Krishna, “Emergence and growth of complex networks in adaptive systems,” Computer Physics Communications, vol. 121-122, pp. 116–121, 1999. [330] J. G. White, E. Southgate, J. N. Thompson, and S. Brenner, “The structure of the nervous system of the nematode C. elegans,” Phil. Trans. of the Royal Society of London, vol. 314, pp. 1–340, 1986. [331] S. Shen-Orr, R. Milo, S. Mangan, and U. Alon, “Network motifs in the transcriptional regulation network of Escherichia coli,” Nature Genetics, vol. 31, pp. 64 – 68, 2002. APPENDIX 263 A. PUBLIC DOMAIN DATASETS AND CHEMBIOSYS RELEVANT ONLINE LITERATURE • Machine Learning datasets (UCI) Major pattern recognition datasets - http://archive.ics.uci.edu/ml/ (PR Archive) Very good resource for datasets and pattern recognition tools Link: http://www.qi.tnw.tudelft.nl/PRInfo/prarchives.html • ‘omic’ datasets (PDB)- RCSB Protein Data Bank : http://www.rcsb.org/pdb/home/home.do (NCBI) - Comprehensive Genomics/Proteomics/Taxanomy data and software Link: http://www.ncbi.nlm.nih.gov/ (BIND/BOND) - Biological molecular interaction database Link: http://bond.unleashedinformatics.com/Action? (BioGRID) - General repository for interaction datasets : http://www.thebiogrid.org/ (DIP) - Database of Interacting Proteins : http://dip.doe-mbi.ucla.edu/ (KEGG) - Kyoto Encyclopedia of Genes and Genomes : http://www.genome.ad.jp/kegg (EcoCyc)- For everything on Escherichia coli K-12 : http://ecocyc.pangeasystems.com/ecocyc (ENZYME) - Major proteins and Enzyme nomenclature : http://www.expasy.ch/enzyme/ • For Micro Array data analysis : http://www.stat.wisc.edu/ yandell/statgen/reference/ • Comprehensive listing of Gene Regulatory Network related data and literature : http://www.stat.wisc.edu/ yandell/statgen/reference/array.html#intro • A Glossary for Systems Biology : http://sysbio.ist.uni-stuttgart.de/projects/glossary/ • Glossory of MultiVariateStatistics (MVS) terminology : http://www.okstate.edu/artsci/botany/ordinate/glossary.htm • For statistical methods : http://www2.chass.ncsu.edu/garson/pa765/statnote.htm • Good resources for data mining : http://www.kdnuggets.com/ WEBSITES valid as on 25th February, 2008 264 B. COMPUTATIONAL RESOURCES AVAILABLE ONLINE • Weka 3: Data Mining Software in Java: Weka is a collection of machine learning algorithms for data mining tasks. An userfriend software freely downloadable for academic use. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Line: http://www.cs.waikato.ac.nz/ml/weka/ • MVSP 3.1 (Kovach Computing Services - UK): MVSP is a MultiVariate Statistical Package. Options for PCA/CA/CCA. It can also perform cluster analysis, using 20 different distance or similarity measures and seven clustering strategies. Diversity indices may be calculated on ecological data; these include Simpson’s, Shannon’s, and Brillouin’s indices. Thirty days free Trial Version : http://www.kovcomp.com/ • Cellware 3.0.1 - BII - Singapore: Cellware - a grid based modeling and simulation tool, is being developed by the systems biology group at the BioInformatics Institute (BII), Singapore. Link: http://www.bii.a-star.edu.sg/sbg/cellware. • CellDesigner 2.5 - (The Systems Biology Institute, Tokyo, Japan): CellDesigner is a structured diagram editor for drawing gene-regulatory and biochemical networks. Networks are drawn based on the process diagram, with graphical notation system proposed by Kitano and are stored using the Systems Biology Markup Language (SBML). Link: http://systems-biology.org/002/001.html • MetaFluxNet 1.8 - Dept. of Chemical & Biomolecular Engineering, KAIST, South Korea.): MetaFluxNet is a program package for managing information on the metabolic reaction network and for quantitatively analyzing metabolic fluxes in an interactive and customized way. Quantitative in silico simulations of metabolic pathways can be carried out to understand the metabolic status and to design the metabolic engineering strategies. Free ware to be used online. Link: http://mbel.kaist.ac.kr/index en.html • eXPatGen : Online gene micro array data simulator - (Univ. of Delaware - DE): Simulates gene expression patterns, modeled after the microarray experiments, in order to evaluate different analysis methods, such as clustering, principle component analysis (PCA), and self-organized maps (SOMs). Takes simple inputes to provide predefined structure for GRN which can be used for performance verification of analytical methods. Free ware to be used online : http://www.che.udel.edu/eXPatGen/ • KINSolver: A simulator for computing large ensembles of biochemical and gene regulatory networks. Design and analysis. Supports SBML. Free download : http://lsdis.cs.uga.edu/ aleman/kinsolver/ • Gepasi - 3.30 (Bio) Kinetics Simulation Software: Gepasi simulates the steady-state and time-course behaviour of reactions in several compartments of different volumes. The program then builds the differential equations that govern the behaviour of the system and solves them. Gepasi can also use various nonlinear optimisation algorithms. Free download link : http://www.gepasi.org • Pajek 1.23 (Large Network Analysis and Visualization tool): Very useful tool for quickly plotting and determining different network properties. Freely downloadable at : http://pajek.imfm.si/doku.php 265 PUBLICATIONS Journal Publications: • R. Rao and S. Lakshminarayanan, “VPMCD: variable interaction modeling approach for class discrimination in biological system”, FEBS Letters, vol. 581(5), pp 826 - 830, 2007. • R. K. Rao and S. Lakshminarayanan, “Partial correlation based variable selection approach for multivariate data classification method”, Chemometrics & Intelligent Lab. Systems, vol. 86(1), pp 68 - 81, 2007. • R. Rao and S. Lakshminarayanan, “Variable interaction network based variable selection for multivariate calibration”, Analytica Chimica Acta, vol. 599(1), pp 24-35, 2007. • R. Rao and S. Lakshminarayanan, “Variable predictive model based classification algorithm for effective separation of protein structural classes”, Computational Biology and Chemistry, vol. 32 (4), pp 302-306, 2008. • R. Rao and S. Lakshminarayanan, “Variable Predictive Models - a new multivariate classification approach for pattern recognition applications”, Pattern Recognition, vol. 42(1), pp 7-16, 2009. • M. A. Setiawan, R. Rao and S. Lakshminarayanan, “Partial correlation metric based classifier for food products”, Journal of Food Engineering, vol. 90(2), pp 146-152, 2009. • R. Rao, K. Tun and S. Lakshminarayanan, “Genetic Programming based Variable Interaction Models for Classification of Process and Biological Systems”, Industrial and Engineering Chemistry Research, (submitted), 2008. • R. Rao, K. Tun, S. Lakshminarayanan, Y. Makita and P. K. Dhar, “Amino Acid association patterns - potential key to protein-protein interactions”, Bioinformatics, (submitted), 2008. Peer reviewed conferece papers: • R. Rao, S. Lakshminarayanan and K. Tun, “Genetic Programming Models for Classification of Data from Biological Systems”, Proceedings of the IEEE Conference on Evolutionary Computation, Singapore, September 25-28, 2007. • R. Rao and S. Lakshminarayanan, “Alternate complexity measures and stability analysis of process and biological networks”, APCCHE 2006 proceedings, Malaysia, August 27-30, 2006. Conferece presentations: • “Variable Predictive Modeling for Class Discrimination - systems applications”, PSE- ASIA 2007, Xi’an, China, August 15-18, 2007. (Key Note Speech) • “Modeling based discriminant algorithm to solve classification problems in Proteomics”, Joint 3rd AOHUPO and 4th Structural Biology and Functional Genomics Conference, Singapore December 2006. (Best Poster Award) • “Model Predictive Discrimination Approach For Classification Of Process And Biological Data”, A.I.Ch.E. Annual Meeting - 2006, San Francisco, USA, November 13-17, 2006. • “Protein Structure and Fold Recognition Using Amino Acid Interaction Models”, A.I.Ch.E. Annual Meeting - 2006, San Francisco, USA, November 13- 17, 2006. • “Partial Correlation Coefficient Matrix Approach for Protein Classification”, INFORMS 2006, Hong Kong, June 25-28, 2006. 266 VITA RAO RAGHURAJ KESHAVA. E5-03-30, Engineering Drive 4, Dept. of Chemical and Biomolecular Engineering, National University of Singapore, Singapore - 117576 Email: raghuraj@nus.edu.sg, Tel: +65- 94520454 (HP), +65-65165802 (Lab) Website : http://cheed.nus.edu.sg/˜chels/DACSPHOTOS/RAGHUHP.html Present Employment (2008 onwards) Research Fellow, Singapore Delft Water Alliance, NUS, Singapore E1-08-25, Engineering Drive 2, Dept. of Civil Engineering, National University of Singapore, Singapore - 117576 Email: cverr@nus.edu.sg, Tel: +65-65168304 (Office) Website : www.sdwa.nus.edu.sg Research Interests Data mining, multivariate statistics, machine learning, systems modeling, computational and systems biology. Education M.Tech. (Chemical Engineering) - Indian Institute of Technology, Bombay (India) - 1998. B. Engg. (Chemical) - Karnataka Regional Engineering College, Surathkal (India) - 1996. Professional Experience • Visiting scientist (3 months, 2007) - RIKEN genomic sciences center, Japan: sequence and data analysis for metagenomics project, developed new algorithms to predict protein-protein interaction networks. • Research Assistant (part time: months 2006/07) - ChemBioSys group, NUS: Systems Biomedical Engineering, Research contribution leading to new ideas in Hemodialysis and transdermal drug delivery, prepared two A*STAR grant proposals. • Research collaborator (2006/2007) - Dept. of Biological Sciences, NUS: collaboration for two independent projects in small molecular lab. Design of experiments and data analysis support for Metabolomics investigation. Experience in Protocol optimization, LC/MS data mining, Biomarker detection. • Teaching Assistant (two semesters: 2005/06) - Dept. of Chemical & Biomolecular Engg., NUS: Tutor for two undergraduate modules(Chemical engineering principles & Process dynamics and control). Adjudged best tutor for PDC based on student feedback. • Lecturer (Oct. 1999 - July. 2004): Dept. of Chemical Engineering, K.L.E. Society’s College of Engg. & Tech. - Belgaum (India) : Floated new electives, setup new labs, supervised 22 students for research projects. Major R & D consultancy projects - “PROSIM” - process simulator & flow sheet building software, design of experiments & scale up studies for “BIO-DIESEL” manufacturing, retrofitting of “gas burners”, neural network for cement kiln modeling and control. Nominated for best teacher award(2003). • Process Engineer (April 1998 - Sept. 1999): Cement plant projects, Larsen & Toubro Ltd. - Mumbai. Cement plant commissioning (on site), design and implementation of interlocks for startup and shutdown, cause and effect analysis of plant operations for intelligent alarm management system. 267 Journal Publications • Rao Raghuraj and S. Lakshminarayanan, “Variable Predictive Models - a new multivariate classification approach for pattern recognition applications”, Pattern Recognition, vol. 42(1), pp 7-16, 2009. • Setiawan, M.A., Rao R. and Lakshminarayanan, S. “Partial correlation metric based classifier for food products”, Journal of Food Engineering, vol. 90(2), pp 146-152, 2009. • Rao Raghuraj and S. Lakshminarayanan, “Variable predictive model based classification algorithm for effective separation of protein structural classes”, Computational Biology and Chemistry, vol. 32 (4), pp 302-306, 2008. • Rao Raghuraj and S. Lakshminarayanan,“VPMCD: Variable interaction modeling approach for class discrimination in biological systems”, FEBS Letters, Vol. 581 (5), pp 826-830, 2007. • R. Rao and S. Lakshminarayanan, “Variable interaction network based variable selection for multivariate calibration”, Analytica Chimica Acta, vol. 599(1), pp 24-35, 2007. • Rao Raghuraj and S. Lakshminarayanan,“Partial correlation based variable selection approach for multivariate data classification methods”, Chemometrics and Intelligent Lab. Systems, Vol. 86 (1), pp 68-81, 2007. • Raghuraj Rao, Manibhushan T. and Raghunathan R., “Sensor locations in complex chemical plants based on fault diagnostic observability criteria”, A. I. Ch. E. Journal, Vol 45 (2), pp 310, 1999. • R. Rao, K. Tun and S. Lakshminarayanan, “Genetic Programming based Variable Interaction Models for Classification of Process and Biological Systems”, Industrial and Engineering Chemistry Research, (submitted), 2008. • R. Rao, K. Tun, S. Lakshminarayanan, Y. Makita and P. K. Dhar, “Amino Acid association patterns - potential key to protein-protein interactions”, Bioinformatics, (submitted), 2008. Peer reviewed Proceedings Publications • Rao Raghuraj, S. Lakshminarayanan and Kyaw Tun, “Genetic Programming Models for Classification of Data from Biological Systems”, Proceedings of the IEEE Conference on Evolutionary Computation, Singapore, September 25-28, 2007. • Rao Raghuraj and S. Lakshminarayanan, “Alternate complexity measures and stability analysis of process and biological networks”, APCCHE 2006 proceedings, Malaysia, August 27-30, 2006. (Key Note Lecture) • S. Lakshminarayanan, Rao Raghuraj and S. Balaji, “CONSIM - MS EXCEL based student friendly simulator for teaching process control theory”, APCCHE 2006 proceedings, Kuala Lumpur, Malaysia, August 27-30, 2006. • Raghuraj Rao and R. Raghunathan, “An Algorithm for Sensor Location based on fault diagnosibility criteria.” - Proceedings of I.I.Ch.E. Golden Jubilee congress - Vol. II, 1076 - 1086, CHEMCON 1997, New Delhi, India, December 1997. Conference presentations • Modeling of hemodialysis system - WACBE 2007, Bangkok. • System engineering approach for body heat regulation - WACBE 2007, Bangkok. • Classification in Proteomics -Structural Biology Conference 2006, Singapore. (best poster) • Model Predictive Discrimination Approach - A.I.Ch.E. Annual Meeting 2006, San Francisco. • Protein Structure and Fold Recognition - A.I.Ch.E. Annual Meeting 2006, San Francisco. • Partial Correlation analysis for Protein Classification - INFORMS 2006, Hong Kong. 268 Awards & Achievements • NUS President’s Graduate Fellowship (2007): given to top 1% graduate students • Best Tutor Award (2006): 128 students voted performance with average feedback 4.6/5. • Best poster award: top posters out of 238 posters on seven interdisciplinary subjects. • Invited Key note speaker: at APCChE 2006 for session “chemical engineering fundamentals” • Invited as session chair: for APRU - Doctoral Students Conference 2006, Singapore. • Govt. of India MHRD GATE scholarship (M.Tech.) + National Merit Scholarship (B.Engg.) Computational Skills • Bioinformatics: sequence analysis, BLAST search/formatting, phylogenetic (tree), genomics (micro-array) and proteomics (LC-MS) data analysis. Metagenomics studies. • Systems Biology: mathematical modeling of cellular mechanisms, systems physiology, PDPK studies, drug delivery systems, survival analysis using clinical diagnostic data. • Software Packages : MATLAB, R, SIMULINK • Programming with C/C++ , applications using Visual Basic, VB-Access and VB-Excel Additional Training • DIA workshop on clinical data management, Tokyo, Japan, January 29-30, 2007. • Communication skills workshop, NUS, October 18-25, 2006 • Microteaching and Tutoring skills, October, 2004. • Numerical techniques in process engineering, I.I.Sc., Bangalore, India, July, 2003. REFERENCES • Dr. S. Lakshminarayanan (Ph.D. Supervisor at NUS) Asst.Prof., Dept. of Chemical & Biomolecular Engg., NUS, Singapore. Tel: +65-65168484, Email : chels@nus.edu.sg • Dr. Pawan Dhar (Advisor during my Internship at RIKEN, Japan) Senior Scientist and Lab Head, Synthetic Genomics, RIKEN Genomic Science Centre, Japan Tel: +81-45-503-9111. Email: pkdhar@gsc.riken.jp • Dr. R. Raghunathan (M. Tech. Advisor at I.I.T. Bombay, India) Assoc.Prof., Dept. of Chemical Engineering, Clarkson University, New York. Email: raghu@clarkson.edu [...]... for process and biological systems, especially with reference to complex systems • Identifying potential areas of biological systems analysis for employing and expanding Process Systems Engineering concepts and tools • Understanding the limitations of existing methods and developing new tools/ techniques necessary to solve the related problems • Evaluating the new concepts and establishing the performance... different tools and techniques available in literature for design and characterization of the same Various aspects of systems analysis 15 including existing gaps in knowledge and opportunities for research are brought out especially as applied to process and biological systems 2.1 Process Systems and Analysis Process systems mainly encompass a wide range of unit operations and processes involving physical and. .. process and biological phenomena Design and analysis of networks: Complex networks are inherent to many process and natural systems Due to modularity of bio-systems, their functions and structures are well exhibited using networks of smaller modules Complex network analysis is in itself a major area of research demanding new measures and concepts The network synthesis techniques used to represent biological. .. approach in understanding their structural and functional behavior As many biological processes exhibit higher similarities with chemical systems, Process Systems Engineering with its expertise in applied research is considered as a potential way of addressing many problems in computational and systems biology Various systems and data analysis issues common to complex chemical and biological processes have... involving physical and chemical changes Process Systems Engineering (PSE) aims to develop tools and techniques required to design and analysis of complex process engineering systems The tools enable systematic development of processes and products across a wide span of systems from molecular and genetic phenomena to manufacturing and allied business processes PSE has a long history [10, 11] and over the last... experimental design, multivariate data analysis, systems modeling, simulation, network synthesis and network analysis techniques Motivated from these unresolved aspects of ChemBioSys analysis, the main objectives of this research include; reviewing and identifying potential unresolved issues pertaining xi to modern chemical and biological processes Understanding the limitations of existing methods and developing... principles, data based modeling (system identification), statistical analysis for process monitoring and product characterization, process control and optimization are the highly attentive areas of PSE Expertise have been achieved on large domain of system tools in these areas and have been successfully tested for large scale real systems Indeed, tools and techniques have become so accurate, fast and inexpensive... complex process and biological systems are studied New concepts and alternate techniques are proposed and evaluated for different aspects of data based systems design and analysis 1.5 Scope of the present work Basic scientific research is one which is directed towards the increase of knowledge in the domain Being part of an emerging and increasingly challenging area of ChemBioSystems engineering and suitably... complexity of data available today, has posed different challenges for developing tools and techniques to analyze them Textual data (in the form of sequence information for biological systems) needs special string analysis techniques Image/graphical data require special pattern recognition techniques, categorical and nonhomogeneous data types with multivariate interaction between the system variables... The characteristics of process and biological systems are introduced in chapter 2 Various systems design and analysis techniques are also reviewed Different challenges and scope for systems research are highlighted Variable selection problem for data analysis is introduced in chapter 3 A new feature selection algorithm is proposed and its application to classification and multivariate calibration problems . DATA BASED SYSTEM DESIGN AND NETWORK ANALYSIS TOOLS FOR CHEMICAL AND BIOLOGICAL PROCESSES RAO RAGHURAJ (M.Tech., I.I.T. Bombay, India) (B.Engg.,. computational and systems biology. Various systems and data analysis issues common to complex chemical and biological processes have initiated a new paradigm called ChemBioSys (Chemical and Bioprocess Systems). opportunity for a comprehensive study of many chemical and biological processes. High complexity and modular behavior of such processes emphasize the need for system e ngineering approach in understanding

Data based system design and network analysis tools for chemical and biological processes

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan