Effective use of data mining technologies on biological and clinical data

198 463 0
Effective use of data mining technologies on biological and clinical data

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

EFFECTIVE USE OF DATA MINING TECHNOLOGIES ON BIOLOGICAL AND CLINICAL DATA LIU HUIQING National University of Singapore 2004 EFFECTIVE USE OF DATA MINING TECHNOLOGIES ON BIOLOGICAL AND CLINICAL DATA LIU HUIQING (M.Science, National University of Singapore, Singapore) (M.Engineering, Xidian University, PRC) (B.Economics, Huazhong University of Science and Technology, PRC) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY INSTITUTE FOR INFOCOMM RESEARCH NATIONAL UNIVERSITY OF SINGAPORE 2004 In memory of my mother, and to my father Acknowledgements First and foremost I would like to acknowledge my supervisor, Professor Wong Limsoon, for his patient, tireless and prompt help Limsoon always provides me complete freedom to explore and work on the research topics that I have interests in Although it was difficult for me to make quick progress at the beginning, I must appreciate the wisdom of his supervision when I started to think for myself and become relatively independent On the other hand, he never delays to answer my questions It is my luck to study and work under his guidance I also thank my Ph.D advisory committee members, Dr Li Jinyan and Dr Wynne Hsu, for many valuable discussions During the past three years, my colleagues in the Knowledge Discovery Department of the Institute for Infocomm Research (I¾ R) have provided me much appreciated help in my daily work I would like to thank all of them for their generous assistance, valuable suggestions and friendship Special acknowledgements go to Mr Han Hao for his collaborations on dealing with problems from sequence data, and my department head, Dr Brusic Vladimir, for his encouragements on my study I thank staff in the Graduate Office, School of Computing, National University of Singapore, and graduate studies support staff in the Human Resource Department of I¾ R They always gave me very quick responses when I encountered problems during my past four years of study I can not finish my thesis work without the strong support from my family In the middle of my study, I lost my dearest mother, who once was my closest person in this world and died from lung cancer in year 2002 In order to let me concentrate on my study, she tried her best to take care of the whole family though she was very weak herself Even during her last days in this world, she still cared about my research progress I owed her too much Besides my mother, I have a great father as well He has provided and is still providing his unconditional support and encouragement on my research work Without his love and help, I might have given up the study i when my mother passed away Special thanks must go to my two lovely daughters — Yugege and Yungege — who are my angels and the source of my happiness Together with them is my hubby, Hongming, who is always there to provide me his support through both the highs and lows of my time ii Contents Acknowledgements i List of Tables vii List of Figures x Summary xii Introduction 1.1 Motivation 1.2 Work and Contribution 1.3 Structure Classification — Supervised Learning 2.1 Data Representation 10 2.2 Results Evaluation 10 2.3 Algorithms 13 2.3.1 2.3.2 Support vector machines 15 2.3.3 Decision trees 19 2.3.4 2.4 Ã -nearest neighbour Ensemble of decision trees 21 14 Chapter Summary 28 Feature Selection for Data Mining 29 3.1 Categorization of Feature Selection Techniques 29 3.2 Feature Selection Algorithms 30 3.2.1 Ì -test, signal-to-noise and Fisher criterion statistical measures 3.2.2 Wilcoxon rank sum test 33 iii 31 3.2.3 ¾ statistical measure 35 3.2.4 Entropy based feature selection algorithms 36 3.2.5 Principal components analysis 41 3.2.6 Correlation-based feature selection 43 3.2.7 Feature type transformation 44 3.3 ERCOF: Entropy-based Rank sum test and COrrelation Filtering 44 3.4 Use of Feature Selection in Bioinformatics 47 3.5 Chapter Summary 50 Literature Review on Microarray Gene Expression Data Analysis 4.1 51 Preprocessing of Expression Data 52 4.1.1 4.1.2 Missing value management 54 4.1.3 4.2 Scale transformation and normalization 53 A web-based preprocessing tool 56 Gene Identification and Supervised Learning 56 4.2.1 Gene identification 57 4.2.2 Supervised learning to classify samples 60 4.2.3 Combing two procedures — wrapper approach 62 4.3 Applying Clustering Techniques to Analyse Data 64 4.4 Patient Survival Analysis 65 4.5 Chapter Summary 66 Experiments on Microarray Data — Phenotype Classification 5.1 69 Experimental Design 69 5.1.1 5.1.2 Entropy-based feature selection 71 5.1.3 5.2 Classifiers and their parameter settings 70 Performance evaluation 71 Experimental Results 72 5.2.1 Colon tumor 72 5.2.2 Prostate cancer 74 5.2.3 Lung cancer 76 5.2.4 Ovarian cancer 78 5.2.5 Diffuse large B-cell lymphoma 82 5.2.6 ALL-AML leukemia 84 5.2.7 Subtypes of pediatric acute lymphoblastic leukemia 87 iv 5.3 Comparisons and Discussions 96 5.3.1 5.3.2 Feature selection methods 102 5.3.3 5.4 Classification algorithms 97 Classifiers versus feature selection 106 Chapter Summary 109 Experiments on Microarray Data — Patient Survival Prediction 6.1 111 Methods 111 6.1.1 6.1.2 Construction of an SVM scoring function 113 6.1.3 6.2 Selection of informative training samples 112 Kaplan-Meier analysis 114 Experiments and Results 116 6.2.1 Lymphoma 116 6.2.2 Lung adenocarcinoma 119 6.3 Discussions 122 6.4 Chapter Summary 126 Recognition of Functional Sites in Biological Sequences 7.1 127 Method Description 129 7.1.1 7.1.2 7.2 Feature generation 129 Feature selection and integration 130 Translation Initiation Site Prediction 131 7.2.1 7.2.2 Data 132 7.2.3 Feature generation and sequence transformation 134 7.2.4 Experiments and results 135 7.2.5 7.3 Background 131 Discussion 140 Polyadenylation Signal Prediction 143 7.3.1 7.3.2 Data 145 7.3.3 7.4 Background 143 Experiments and Results 145 Chapter Summary 149 Conclusions 8.1 151 Summary 151 v 8.2 Conclusions 153 8.3 Future Work 153 References 155 A Lists of Genes Identified in Chapter 167 B Some Resources 181 B.1 Kent Ridge Biomedical Data Set Repository 181 B.2 DNAFSMiner 181 vi List of Tables 2.1 An example of gene expression data 10 5.1 Colon tumor data set results (22 normal versus 40 tumor) on LOOCV and 10-fold cross validation 73 5.2 common genes selected by each fold of ERCOF in 10-fold cross validation test for colon tumor data set 74 5.3 Prostate cancer data set results (52 tumor versus 50 normal) on 10-fold cross validation 76 5.4 Classification errors on the validation set of lung cancer data 76 5.5 16 genes with zero entropy measure in the training set of lung cancer data 78 5.6 GenBank accession number and name of 16 genes with zero entropy measure in the training set of lung cancer data 79 5.7 10-fold cross validation results on whole lung cancer data set, consisting of 31 MPM and 150 ADCA samples 79 5.8 10-fold cross validation results on “6-19-02” ovarian proteomic data set, consisting of 162 ovarian cancer versus 91 control samples 81 5.9 10-fold cross validation results on DLBCL data set, consisting of 24 germinal center B-like DLBCL versus 23 activated B-like DLBCL 83 5.10 common genes selected by each fold of ERCOF in 10-fold cross validation test on DLBCL data set 84 5.11 ALL-AML leukemia data set results (ALL versus AML) on testing samples, as well as 10-fold cross validation and LOOCV on the entire set 86 5.12 ALL-AML leukemia data set results (ALL versus AML) on testing samples by using top genes ranked by SAM score 86 5.13 Number of samples in each of subtypes in pediatric acute lymphoblastic leukemia data set 88 5.14 Pediatric ALL data set results (T-ALL versus OTHERS) on 112 testing samples, as well as 10-fold cross validation on the entire 327 cases 89 5.15 Top 20 genes selected by entropy measure from the training data set of T-ALL versus OTHERS in subtypes of pediatric ALL study 90 vii Table A.1: 54 common genes selected by each fold of ERCOF in 10-fold cross validation test for prostate cancer data set Probe name 40435 40419 31444 37720 32634 34608 33137 40436 34784 at at s at at s at at at g at at Accession number J03592 X85116 M62895 M22382 U38260 M24194 Y13622 J03592 Z83844 1676 s at 36587 at 33614 at 38814 at 33668 at 40024 at 39756 g at M55409 Z11692 X80822 AF038954 AF037643 D86640 Z93930 34853 at 33820 g at 40856 at 31538 at 36601 at 33134 at 32076 at 31545 at 33328 at 39416 at 40607 at 769 s at 32412 at 37819 at 1521 at AB007865 X13794 U29953 M17885 M33308 AB011083 D83407 AL031228 W28612 U90913 U97105 D00017 M13934 AF007130 X17620 1513 at 39939 at 35776 at 31527 at 33408 at 34840 at 39315 at 35119 at 575 s at D21337 AF064243 X17206 AB023151 AI700633 D13628 X56932 M93036 262 at 37639 at 32243 g at 36864 at 38044 at 38098 at 39366 at M21154 X07732 AL038340 AJ001625 AF035283 D80010 N36638 Gene name Human ADP/ATP translocase mRNA, 3’end, clone pHAT8 H.sapiens epb72 gene exon Human lipocortin (LIP) pseudogene mRNA, complete cds-like region Human mitochondrial matrix protein P1 (nuclear encoded) mRNA, complete cds Human islet cell autoantigen ICAp69 mRNA, complete cds Human MHC protein homologous to chicken B complex protein mRNA, complete cds Homo sapiens mRNA for latent transforming growth factor-beta binding protein-4 Human ADP/ATP translocase mRNA, 3’end, clone pHAT8 Human DNA sequence from clone 37E16 on chromosome 22 Contains a novel gene, a gene similar to SH3-binding protein, LGALS1 (14 kDa beta-galactoside-binding lectin) gene, part of a gene similar to mouse p116Rip, ESTs, STSs, GSSs and two CpG islands Homo sapiens pancreatic tumor-related protein mRNA, partial cds H.sapiens mRNA for elongation factor H.sapiens mRNA for ORF Homo sapiens vacuolar H(+)-ATPase subunit mRNA, complete cds Homo sapiens 60S ribosomal protein L12 (RPL12) pseudogene, partial sequence Homo sapiens mRNA for stac, complete cds Human DNA sequence from clone 292E10 on chromosome 22q11-12 Contains the XBP1 gene for X-box binding protein (TREB5), ESTs, STSs, GSSs and a putative CpG island Homo sapiens KIAA0405 mRNA, complete cds H.sapiens lactate dehydrogenase B gene exon and Human pigment epithelium-derived factor gene, complete cds Human acidic ribosomal phosphoprotein P0 mRNA, complete cds Human vinculin mRNA, complete cds Homo sapiens mRNA for KIAA0511 protein, partial cds ZAKI-4 mRNA in human skin fibroblast, complete cds dJ1033B10.4 (40S ribosomal protein S18 (RPS18, KE-3)) 49b3 Homo sapiens cDNA Human clone 23665 mRNA sequence Homo sapiens N2A3 mRNA, complete cds Homo sapiens mRNA for lipocortin II, complete cds Human ribosomal protein S14 gene, complete cds Homo sapiens clone 23750 unknown mRNA, partial cds Human mRNA for Nm23 protein, involved in developmental regulation (homolog to Drosophila Awd protein) Antigen, Prostate Specific, Alt Splice Form Human mRNA for collagen Homo sapiens intersectin short form mRNA, complete cds Human mRNA for LLRep3 Homo sapiens mRNA for KIAA0934 protein, partial cds we38g03.x1 Homo sapiens cDNA, 3’end Human mRNA for KIAA0003 gene, complete cds H.sapiens mRNA for 23 kD highly basic protein Human (clone 21726) carcinoma-associated antigen GA733-2 (GA733-2) mRNA, exon and complete cds Human S-adenosylmethionine decarboxylase mRNA, complete cds Human hepatoma mRNA for serine protease hepsin DKFZp566K192 s1 Homo sapiens cDNA, 3’end Homo sapiens mRNA for Pex3 protein Homo sapiens clone 23916 mRNA sequence Human mRNA for KIAA0188 gene, partial cds yx88f05.r1 Homo sapiens cDNA, 5’end 168 Table A.2: 54 common genes selected by each fold of ERCOF in 10-fold cross validation test for prostate cancer data set (continued 1) Probe name 32206 at 39550 at 34304 s at 37730 at 41288 at 31583 at 172 at Accession number AB007920 AB011156 AL050290 U22055 AL036744 X67247 U57650 Gene name Homo sapiens mRNA for KIAA0451 protein, complete Homo sapiens mRNA for KIAA0584 protein, partial Homo sapiens mRNA; cDNA DKFZp586G1923 (from clone DKFZp586G1923) Human 100 kDa coactivator mRNA, complete cds DKFZp564I1663 r1 Homo sapiens cDNA, 5’end H.sapiens rpS8 gene for ribosomal protein S8 Human SH2-containing inositol 5-phosphatase (hSHIP) mRNA, complete cds 169 Table A.3: 39 common m/z identities among top 50 entropy measure selected features in 10-fold cross validation on ovarian cancer proteomic profiling Their corresponding Wilcoxon test Ô-values are derived from paper [118] M/Z identity 244.95245 245.8296 245.24466 244.66041 245.53704 435.46452 246.41524 246.12233 247.00158 417.73207 434.68588 435.07512 435.85411 246.70832 261.88643 418.11364 247.295 247.88239 434.29682 262.18857 261.58446 247.58861 244.36855 436.24386 464.76404 464.36174 222.69673 417.35068 463.95962 465.16651 222.41828 222.14001 418.49538 262.49088 436.63379 25.589892 463.55767 4003.6449 220.75125 Wicoxon Ô-value 1.16115E-30 7.59262E-30 1.59454E-30 1.30324E-30 2.25194E-30 5.16697E-30 3.70287E-29 1.70497E-29 1.00124E-28 1.03527E-27 1.7291E-29 3.1774E-30 1.65702E-29 6.49125E-29 6.58307E-29 6.48304E-27 1.45824E-28 1.30577E-27 9.27807E-28 2.34772E-27 1.5817E-27 2.33737E-28 2.11132E-26 5.43042E-28 5.64673E-26 2.34956E-26 7.50798E-26 1.30456E-26 1.13655E-25 5.44957E-25 4.20921E-27 3.27501E-25 9.26396E-25 2.6516E-23 2.16083E-25 1.80877E-24 4.42152E-23 5.09873E-22 3.25692E-24 170 Entropy measure 0.13998 0.16299 0.17846 0.18037 0.18209 0.23104 0.23574 0.23743 0.23968 0.25183 0.25791 0.25839 0.26475 0.27451 0.28096 0.28419 0.30174 0.31365 0.31648 0.32680 0.33865 0.34268 0.34343 0.34656 0.35072 0.36228 0.37045 0.37647 0.38043 0.38914 0.39731 0.40447 0.40599 0.41769 0.42559 0.43315 0.44623 0.45153 0.46876 Table A.4: 280 genes identified by ERCOF from training samples on ALL-AML leukaemia data set Probes with bold font were also reported in [41] Probe X95735 at M55150 at M31166 at M27891 at U46499 at L09209 s at X70297 at M77142 at J03930 at M92287 at U22376 cds2 s at M27783 s at D14874 at M16038 at U50136 rna1 at M98399 s at M21551 rna1 at Y12670 at M83652 s at M23197 at U46751 at D88422 at M54995 at U02020 at M31523 at X04085 rna1 at M81933 at U12471 cds1 at M91432 at X59417 at M12959 s at X74262 at L27584 s at HG4316-HT4586 at J05243 at M31303 rna1 at X62654 rna1 at X90858 at M84526 at J04615 at D26308 at L08177 at X14008 rna1 f at X87613 at M80254 at M96326 rna1 at J04990 at U62136 at D10495 at X52142 at U73737 at X74801 at U32944 at X15949 at Gene name Zyxin FAH Fumarylacetoacetate PTX3 Pentaxin-related gene, rapidly induced by IL-1 beta CST3 Cystatin C (amyloid angiopathy and cerebral hemorrhage) GLUTATHIONE S-TRANSFERASE, MICROSOMAL APLP2 Amyloid beta (A4) precursor-like protein CHRNA7 Cholinergic receptor, nicotinic, alpha polypeptide NUCLEOLYSIN TIA-1 ALKALINE PHOSPHATASE, INTESTINAL PRECURSOR CCND3 Cyclin D3 C-myb gene extracted from Human (c-myb) gene, complete primary cds, and five complete alternatively spliced cds ELA2 Elastatse 2, neutrophil ADM Adrenomedullin LYN V-yes-1 Yamaguchi sarcoma viral related oncogene homolog Leukotriene C4 synthase (LTC4S) gene CD36 CD36 antigen (collagen type I receptor, thrombospondin receptor) Neuromedin B mRNA LEPR Leptin receptor PFC Properdin P factor, complement CD33 CD33 antigen (differentiation antigen) Phosphotyrosine independent ligand p62 for the Lck SH2 domain mRNA CYSTATIN A PPBP Connective tissue activation peptide III Pre-B cell enhancing factor (PBEF) mRNA TCF3 Transcription factor (E2A immunoglobulin enhancer binding factors E12/E47) Catalase (EC 1.11.1.6) 5’flank and exon mapping to chromosome 11, band p13 (and joined CDS) CDC25A Cell division cycle 25A Thrombospondin-p50 gene extracted from Human thrombospondin-1 gene, partial cds ACADM Acyl-Coenzyme A dehydrogenase, C-4 to C-12 straight chain PROTEASOME IOTA CHAIN TCRA T cell receptor alpha-chain RETINOBLASTOMA BINDING PROTEIN P48 CAB3b mRNA for calcium channel beta3 subunit Transketolase-Like Protein SPTAN1 Spectrin, alpha, non-erythrocytic (alpha-fodrin) Oncoprotein 18 (Op18) gene ME491 gene extracted from H.sapiens gene for Me491/CD63 antigen Uridine phosphorylase DF D component of complement (adipsin) SNRPN Small nuclear ribonucleoprotein polypeptide N NADPH-flavin reductase CMKBR7 Chemokine (C-C) receptor Lysozyme gene (EC 3.2.1.17) Skeletal muscle abundant protein PEPTIDYL-PROLYL CIS-TRANS ISOMERASE, MITOCHONDRIAL PRECURSOR Azurocidin gene CATHEPSIN G PRECURSOR Putative enterocyte differentiation promoting factor mRNA, partial cds PRKCD Protein kinase C, delta CTPS CTP synthetase GTBP DNA GT mismatch-binding protein T-COMPLEX PROTEIN 1, GAMMA SUBUNIT Cytoplasmic dynein light chain (hdlc1) mRNA IRF2 Interferon regulatory factor 171 Table A.5: 280 genes identified by ERCOF from training samples on ALL-AML leukaemia data set Probes with bold font were also reported in [41] (continued 1) Probe M31158 at M15780 at X62320 at D49950 at U37055 rna1 s at D88378 at X61587 at X07743 at AFFX-HUMTFRR/M11507 at L42572 at Z69881 at M63138 at M28170 at L41870 at D26156 s at M11722 at U09087 s at M29540 at L47738 at D38073 at HG4321-HT4591 at U41813 at X85116 rna1 s at X58431 rna2 s at M28130 rna1 s at Y00787 s at U82759 at U16954 at Z48501 s at M62762 at M22960 at M28209 at U85767 at M13792 at L05148 at L08246 at M19045 f at M20203 s at U67963 at J03801 f at X51521 at M13452 s at D87076 at L07648 at HG2810-HT2921 at L38608 at L28821 at U73960 at M94633 at S50223 at Z15115 at U84487 at U65928 at U53468 at U72936 s at Gene name PRKAR2B Protein kinase, cAMP-dependent, regulatory, type II, beta GB DEF = DNA/endogenous human papillomavirus type 16 (HPV) DNA, right flank and viral host junction GRN Granulin Liver mRNA for interferon-gamma inducing factor(IGIF) Hepatocyte growth factor-like protein gene Proteasome inhibitor hPI31 subunit ARHG Ras homolog gene family, member G (rho G) PLECKSTRIN AFFX-HUMTFRR/M11507 at (endogenous control) Motor protein Adenosine triphosphatase, calcium CTSD Cathepsin D (lysosomal aspartyl protease) CD19 CD19 antigen RB1 Retinoblastoma (including osteosarcoma) Transcriptional activator hSNF2b Terminal transferase mRNA Thymopoietin beta mRNA CARCINOEMBRYONIC ANTIGEN PRECURSOR Inducible protein mRNA MCM3 Minichromosome maintenance deficient (S cerevisiae) Ahnak-Related Sequence HOXA9 Homeo box A9 Epb72 gene exon HOX 2.2 gene extracted from Human Hox2.2 gene for a homeobox protein Interleukin (IL8) gene INTERLEUKIN-8 PRECURSOR GB DEF = Homeodomain protein HoxA9 mRNA (AF1q) mRNA GB DEF = Polyadenylate binding protein II ATP6C Vacuolar H+ ATPase proton channel subunit PPGB Protective protein for beta-galactosidase (galactosialidosis) RAS-RELATED PROTEIN RAB-1A Myeloid progenitor inhibitory factor-1 MPIF-1 mRNA ADA Adenosine deaminase Protein tyrosine kinase related mRNA sequence INDUCED MYELOID LEUKEMIA CELL DIFFERENTIATION PROTEIN MCL1 LYZ Lysozyme GB DEF = Neutrophil elastase gene, exon Lysophospholipase homolog (HU-K5) mRNA LYZ Lysozyme VIL2 Villin (ezrin) LMNA Lamin A KIAA0239 gene, partial cds MXI1 mRNA Homeotic Protein Pl2 ALCAM Activated leucocyte cell adhesion molecule MANA2 Alpha mannosidase II isozyme ADP-ribosylation factor-like protein mRNA GB DEF = Recombination acitivating protein (RAG2) gene, last exon HKR-T1 TOP2B Topoisomerase (DNA) II beta (180kD) CX3C chemokine precursor, mRNA, alternatively spliced JUN V-jun avian sarcoma virus 17 oncogene homolog NADH:ubiquinone oxidoreductase subunit B13 (B13) mRNA X-LINKED HELICASE II 172 Table A.6: 280 genes identified by ERCOF from training samples on ALL-AML leukaemia data set Probes with bold font were also reported in [41] (continued 2) Probe X66401 cds1 at X66533 at AF009426 at U90546 at U28833 at M63488 at U02493 at D86479 at M31211 s at U26266 s at U05259 rna1 at M58297 at D63880 at U38846 at M81695 s at D14664 at X16546 at HG627-HT5097 s at M22324 at HG2981-HT3127 s at Z49194 at HG1612-HT1612 at X77533 at U20998 at X17042 at HG2788-HT2896 at HG2855-HT2995 at U29175 at J03589 at U41767 s at X06182 s at M57731 s at M24400 at M69043 at D43950 at M19507 at M59820 at D83785 at U50733 at D80001 at M29696 at U72621 at M63438 s at X62535 at M84371 rna1 s at L13278 at X14850 at J03473 at U79274 at D86983 at X63469 at D88270 at X59350 at U35451 at X61970 at Gene name LMP2 gene extracted from H.sapiens genes TAP1, TAP2, LMP2, LMP7 and DOB GUANYLATE CYCLASE SOLUBLE, BETA-1 CHAIN Clone 22 mRNA, alternative splice variant alpha-1 Butyrophilin (BTF4) mRNA Down syndrome critical region protein (DSCR1) mRNA RPA1 Replication protein A1 (70kD) 54 kDa protein mRNA Non-lens beta gamma-crystallin like protein (AIM1) mRNA, partial cds MYL1 Myosin light chain (alkali) DHPS Deoxyhypusine synthase MB-1 gene ZNF42 Zinc finger protein 42 (myeloid-specific retinoic acid-responsive) KIAA0159 gene Stimulator of TAR RNA binding (SRB) mRNA ITGAX Integrin, alpha X (antigen CD11C (p150), alpha polypeptide) KIAA0022 gene RNS2 Ribonuclease (eosinophil-derived neurotoxin; EDN) Rhesus (Rh) Blood Group System Ce-Antigen, Alt Splice 2, Rhvi ANPEP Alanyl (membrane) aminopeptidase (aminopeptidase N, aminopeptidase M, microsomal aminopeptidase, CD13) Epican, Alt Splice 11 OBF-1 mRNA for octamer binding factor Macmarcks Activin type II receptor SRP9 Signal recognition particle kD protein PRG1 Proteoglycan 1, secretory granule Calcyclin Heat Shock Protein, 70 Kda (Gb:Y00371) Transcriptional activator hSNF2b UBIQUITIN-LIKE PROTEIN GDX Metargidin precursor mRNA KIT V-kit Hardy-Zuckerman feline sarcoma viral oncogene homolog GRO2 GRO2 oncogene CTRB1 Chymotrypsinogen B1 MAJOR HISTOCOMPATIBILITY COMPLEX ENHANCER-BINDING PROTEIN MAD3 T-COMPLEX PROTEIN 1, EPSILON SUBUNIT MPO Myeloperoxidase CSF3R Colony stimulating factor receptor (granulocyte) KIAA0200 gene Dynamitin mRNA KIAA0179 gene, partial cds IL7R Interleukin receptor LOT1 mRNA GLUL Glutamate-ammonia ligase (glutamine synthase) DAGK1 Diacylglycerol kinase, alpha (80kD) CD19 gene CRYZ Crystallin zeta (quinone reductase) HISTONE H2A.X ADPRT ADP-ribosyltransferase (NAD+; poly (ADP-ribose) polymerase) Clone 23733 mRNA KIAA0230 gene, partial cds GTF2E2 General transcription factor TFIIE beta subunit, 34 kD GB DEF = (lambda) DNA for immunoglobin light chain CD22 CD22 antigen Heterochromatin protein p25 mRNA PROTEASOME ZETA CHAIN 173 Table A.7: 280 genes identified by ERCOF from training samples on ALL-AML leukaemia data set Probes with bold font were also reported in [41] (continued 3) Probe U66838 at U94836 at X54326 at D55654 at U31556 at X83490 s at M83667 rna1 s at D38522 at Z68747 at X64072 s at M65214 s at M29194 at M86406 at U16307 at U26173 s at L11669 at X15573 at X56411 rna1 at X96752 at U90552 at HG4582-HT4987 at AF005043 at U47077 at M83233 at X16832 at D00763 at U27460 at X63753 at Z21507 at U57721 at S68134 s at U81556 at X97335 at D86967 at X66899 at M37435 at J03798 at U30521 at U50939 at U83410 at X59543 at S71043 rna1 s at L49229 f at M95678 at U49020 cds2 s at U00802 s at M93056 at M95178 at L25931 s at M32304 s at D38128 at Gene name Cyclin A1 mRNA ERPROT 213-21 mRNA MULTIFUNCTIONAL AMINOACYL-TRNA SYNTHETASE MDH1 Malate dehydrogenase 1, NAD (soluble) E2F5 E2F transcription factor 5, p130-binding GB DEF = Fas/Apo-1 (clone pCRTM11-Fasdelta(3,4)) NF-IL6-beta protein mRNA KIAA0080 gene, partial cds GB DEF = Imogen 38 SELL Leukocyte adhesion protein beta subunit TCF3 Transcription factor (E2A immunoglobulin enhancer binding factors E12/E47) LIPC Lipase, hepatic ACTN2 Actinin alpha Glioma pathogenesis-related protein (GliPR) mRNA BZIP protein NF-IL3A (IL3BP1) mRNA Tetracycline transporter-like protein mRNA PFKL Phosphofructokinase (liver type) ADH4 gene for class II alcohol dehydrogenase (pi subunit), exon L-3-hydroxyacyl-CoA dehydrogenase Butyrophilin (BTF5) mRNA Glucocorticoid Receptor, Beta Poly(ADP-ribose) glycohydrolase (hPARG) mRNA DNA-dependent protein kinase catalytic subunit (DNA-PKcs) mRNA TCF12 Transcription factor 12 (HTF4, helix-loop-helix transcription factors 4) CTSH Cathepsin H GAPD Glyceraldehyde-3-phosphate dehydrogenase Uridine diphosphoglucose pyrophosphorylase mRNA SON SON DNA binding protein EEF1D Eukaryotic translation elongation factor delta (guanine nucleotide exchange protein) L-kynurenine hydrolase mRNA GB DEF = CREM=cyclic AMP-responsive element modulator beta isoform [human, mRNA, 1030 nt] Hypothetical protein A4 mRNA Kinase A anchor protein KIAA0212 gene EWSR1 Ewing sarcoma breakpoint region CSF1 Colony-stimulating factor (M-CSF) SMALL NUCLEAR RIBONUCLEOPROTEIN SM D1 FRAP FK506 binding protein 12-rapamycin associated protein Amyloid precursor protein-binding protein mRNA CUL-2 (cul-2) mRNA RIBONUCLEOSIDE-DIPHOSPHATE REDUCTASE M1 CHAIN Ig alpha 2=immunoglobulin A heavy chain allotype constant region, germ line [human, peripheral blood neutrophils, Genomic, 1799 nt] GB DEF = Retinoblastoma susceptibility protein (RB1) gene, with a bp deletion in exon 22 (L11910 bases 161855-162161) PLCB2 Phospholipase C, beta MEF2A gene (myocyte-specific enhancer factor 2A, C9 form) extracted from Human myocyte-specific enhancer factor 2A (MEF2A) gene, first coding Drebrin E LEUKOCYTE ELASTASE INHIBITOR ALPHA-ACTININ 1, CYTOSKELETAL ISOFORM LBR Lamin B receptor TIMP2 Tissue inhibitor of metalloproteinase PTGIR Prostaglandin I2 (prostacyclin) receptor (IP) 174 Table A.8: 280 genes identified by ERCOF from training samples on ALL-AML leukaemia data set Probes with bold font were also reported in [41] (continued 4) Probe D87742 at M63379 at X80907 at AF012024 s at J04621 at M80899 at U97105 at M30703 s at U43292 at U05572 s at D31887 at X97748 s at Y00339 s at X52056 at M92357 at AFFX-HUMTFRR/M11507 M at X66610 at U07139 at HG4535-HT4940 s at X64364 at HG3162-HT3339 at X51420 at D50918 at AJ000480 at J04027 at S76638 at U28042 at M11147 at HG4755-HT5203 s at X65644 at D26579 at U88964 at U07132 at L20941 at M83221 at L09235 at Z32765 at M57710 at L22075 at K03195 at M21119 s at U61836 at U77396 at L41067 at L33930 s at M22898 at M92439 at M61853 at X66171 at AF015913 at Gene name KIAA0268 gene, partial cds CLU Clusterin (complement lysis inhibitor; testosterone-repressed prostate message 2; apolipoprotein J) GB DEF = P85 beta subunit of phosphatidyl-inositol-3-kinase Integrin cytoplasmic domain associated protein (Icap-1a) mRNA SDC2 Syndecan (heparan sulfate proteoglycan 1, cell surface-associated, fibroglycan) AHNAK AHNAK nucleoprotein (desmoyokin) Dihydropyrimidinase related protein-2 Amphiregulin (AR) gene MDS1B (MDS1) mRNA MANB Mannosidase alpha-B (lysosomal) KIAA0062 gene, partial cds GB DEF = PTX3 gene promotor region CA2 Carbonic anhydrase II SPI1 Spleen focus forming virus (SFFV) proviral integration oncogene spi1 B94 PROTEIN AFFX-HUMTFRR/M11507 M at (endogenous control) ALPHA ENOLASE, LUNG SPECIFIC CAB3b mRNA for calcium channel beta3 subunit Dematin BSG Basigin Transcription Factor Iia TYRP1 Tyrosinase-related protein KIAA0128 gene, partial cds GB DEF = C8FW phosphoprotein Adenosine triphosphatase mRNA NFKB2 Nuclear factor of kappa light polypeptide gene enhancer in B-cells (p49/p100) DEAD box RNA helicase-like protein mRNA FTL Ferritin, light polypeptide Spinal Muscular Atrophy IMMUNODEFICIENCY VIRUS TYPE I ENHANCER-BINDING PROTEIN Transmembrane protein HEM45 mRNA Orphan receptor mRNA, partial cds FTH1 Ferritin heavy chain TRANSCRIPTION FACTOR RELB ATP6A1 ATPase, H+ transporting, lysosomal (vacuolar proton pump), alpha polypeptide, 70kD, isoform GB DEF = CD36 gene exon 15 LGALS3 Lectin, galactoside-binding, soluble, (galectin 3) Guanine nucleotide regulatory protein (G13) mRNA (HepG2) glucose transporter gene mRNA LYZ Lysozyme Putative cyclin G1 interacting protein mRNA, partial sequence No cluster in current Unigene and no Genbank entry for U77396 (qualifier U77396 at) Transcription factor NFATx mRNA CD24 signal transducer mRNA and 3’ region TP53 Tumor protein p53 (Li-Fraumeni syndrome) 130 KD LEUCINE-RICH PROTEIN CYP2C18 Cytochrome P450, subfamily IIC (mephenytoin 4-hydroxylase), polypeptide 18 CMRF35 mRNA GB DEF = SKB1Hs mRNA 175 Table A.9: 280 genes identified by ERCOF from training samples on ALL-AML leukaemia data set Probes with bold font were also reported in [41] (continued 5) Probe U50928 at D63874 at X82240 rna1 at U79285 at U21858 at L76702 at M19888 at U31814 at X77307 at U49844 at U65410 at D14658 at Y07604 at M60527 at X58072 at Gene name PKD2 Autosomal dominant polycystic kidney disease type II HMG1 High-mobility group (nonhistone chromosomal) protein TCL1 gene (T cell leukemia) extracted from H.sapiens mRNA for Tcell leukemia/lymphoma GLYCYLPEPTIDE N-TETRADECANOYLTRANSFERASE HISTONE H3.3 Protein phosphatase 2A 74 kDa regulatory subunit (delta or B”” subunit) SPRR1B Small proline-rich protein 1B (cornifin) Transcriptional regulator homolog RPD3 mRNA 5-HYDROXYTRYPTAMINE 2B RECEPTOR Protein kinase ATR mRNA Mitotic feedback control protein Madp2 homolog mRNA KIAA0102 gene Nucleoside-diphosphate kinase DCK Deoxycytidine kinase GATA3 GATA-binding protein 176 Table A.10: Thirty-seven genes selected by ERCOF on training samples and reported in [140] to separate TEL-AML1 from other subtypes of ALL cases in pediatric ALL study All these genes are relatively highly expressed (above the mean value aross all the samples) in TEL-AML1 samples Probe 34481 at 36239 at 37470 at 38203 at 38570 at 38578 at 38906 at 40745 at 41381 at 41442 at 31898 at 32660 at 34194 at 35614 at 35665 at 36524 at 36537 at 37280 at 41200 at 32224 at 36985 at 38124 at 40570 at 41498 at 41814 at 32579 at 33162 at 1779 s at 1488 at 1336 s at 1299 at 1217 g at 932 i at 880 at 755 at 577 at 160029 at Accession No AF030227 Z49194 AF013249 U69883 X03066 M63928 M61877 L13939 AB002306 AB010419 D86967 AB002340 AL049313 AB012124 Z46973 AB029035 AB011093 U59912 Z22555 AB018312 X17025 X55110 AF032885 AB020718 M29877 U29175 X02160 M16750 L77886 X06318 X93512 X07109 L11672 M34539 D26070 M94250 X07109 Description vav proto-oncogene, exon 27 H.sapiens mRNA for oct-binding factor Homo sapiens leukocyte-associated Ig-like receptor-1 (LAIR-1) mRNA Human calcium-activated potassium channel hSK1 (SK) mRNA Human mRNA for HLA-D class II antigen DO beta chain Homo sapiens T cell activation antigen (CD27) mRNA Human erythroid alpha-spectrin (SPTA1) mRNA Homo sapiens beta adaptin (BAM22) mRNA Human mRNA for KIAA0308 gene Homo sapiens mRNA for MTG8-related protein MTG16a Human mRNA for KIAA0212 gene Human mRNA for KIAA0342 gene Homo sapiens mRNA; cDNA DKFZp564B076 (from clone DKFZp564B076) Homo sapiens TCFL5 mRNA for transcription factor-like H.sapiens mRNA for phosphatidylinositol 3-kinase Homo sapiens mRNA for KIAA1112 protein Homo sapiens mRNA for KIAA0521 protein Human chromosome Mad homolog Smad1 mRNA H.sapiens encoding CLA-1 mRNA Homo sapiens mRNA for KIAA0769 protein Human homolog of yeast IPP isomerase Human mRNA for neurite outgrowth-promoting protein Homo sapiens forkhead protein (FKHR) mRNA Homo sapiens mRNA for KIAA0911 protein Human alpha-L-fucosidase Human transcriptional activator (BRG1) mRNA Human mRNA for insulin receptor precursor Human pim-1 oncogene mRNA Human protein tyrosine phosphatase mRNA Human mRNA for protein kinase C (PKC) type beta I H.sapiens mRNA for telomeric DNA binding protein (orf2) Human mRNA for protein kinase C (PKC) type beta II Human Kruppel related zinc finger protein (HTF10) mRNA Human FK506-binding protein (FKBP) mRNA Human mRNA for type inositol 1,4,5-trisphosphate receptor Human retinoic acid inducible factor (MK) gene exons 1-5 protein kinase C beta 177 Table A.11: Top 20 genes selected by entropy measure on training samples to separate MLL from other subtypes of ALL cases in pediatric ALL study The last column indicates the sample class in which the gene is highly expressed (above the mean value aross all the samples) Probe 34306 at 36777 at Accession No AB007888 AJ001687 33412 at 657 at AI535946 L11373 32207 at M64925 33847 s at 34337 s at 1389 at AI304854 AJ010014 J03779 34861 at 40518 at D63997 Y00062 40913 at 31898 at 38413 at 2062 at 794 at 40519 at 41747 s at 38160 at 37692 at 40797 at W28589 D86967 D15057 L19182 X62055 Y00638 U49020 AF011333 AI557240 AF009615 Description Homo sapiens KIAA0428 mRNA Homo sapiens NKG2D gene, exons 2-5 and joined mRNA and CDS vicpro2.D07.r Homo sapiens cDNA, 5’ end Human protocadherin 43 mRNA, complete cds for abbreviated PC43 Human palmitoylated erythrocyte membrane protein (MPP1) mRNA Homo sapiens cDNA, 3’ end Homo sapiens mRNA for M96A protein Human common acute lymphoblastic leukemia antigen (CALLA) mRNA Homo sapiens mRNA for GCP170 Human mRNA for T200 leukocyte common antigen (CD45, LC-A) Homo sapiens cDNA Human mRNA for KIAA0212 gene Human mRNA for DAD-1 Human MAC25 mRNA H.sapiens PTP1C mRNA for protein-tyrosine phosphatase 1C Human mRNA for leukocyte common antigen (T200) Human myocyte-specific enhancer factor 2A (MEF2A) gene Homo sapiens DEC-205 mRNA Homo sapiens cDNA, 3’ end Homo sapiens ADAM10 (ADAM10) mRNA 178 HighlyExp MLL MLL MLL MLL OTHERS MLL OTHERS OTHERS OTHERS MLL OTHERS OTHERS MLL MLL MLL MLL MLL MLL MLL MLL Table A.12: Twenty-four genes selected by ERCOF on training samples and reported in [140] to separate MLL from other subtypes of ALL cases in pediatric ALL study All these genes are relatively highly expressed (above the mean value aross all the samples) in MLL samples except U70321 (accession number) Genes with bold font are among top 20 features selected by entropy measure and can be found in Table A.11 as well Probe 36777 at 39424 at 40076 at 40493 at 40506 s at 40763 at 40797 at 40798 s at 41747 s at 32193 at 32215 i at 33412 at 34306 at 34785 at 35298 at 37675 at 38391 at 38413 at 2062 at 2036 s at 1914 at 1126 s at Accession No AJ001687 U70321 AF004430 L05424 U75686 U85707 AF009615 Z48579 U49020 AF030339 AB020685 AI535946 AB007888 AB028948 U54558 X60036 M94345 D15057 L19182 M59040 U66838 L05424 1102 s at 657 at M10901 L11373 Description Homo sapiens NKG2D gene, exons 2-5 and joined mRNA and CDS Human herpesvirus entry mediator mRNA Homo sapiens hD54+ins2 isoform (hD54) mRNA Human hyaluronate receptor (CD44) gene Homo sapiens polyadenylate binding protein mRNA Human leukemogenic homolog protein (MEIS1) mRNA Homo sapiens ADAM10 (ADAM10) mRNA H.sapiens mRNA for disintegrin-metalloprotease (partial) Human myocyte-specific enhancer factor 2A (MEF2A) gene, first coding Homo sapiens receptor for viral semaphorin protein (VESPR) mRNA Homo sapiens mRNA for KIAA0878 protein Homo sapiens cDNA, 5’ end Homo sapiens KIAA0428 mRNA Homo sapiens mRNA for KIAA1025 protein Homo sapiens translation initiation factor eIF3 p66 subunit mRNA H.sapiens mRNA for mitochondrial phosphate carrier protein Homo sapiens macrophage capping protein mRNA Human mRNA for DAD-1 Human MAC25 mRNA Human cell adhesion molecule (CD44) mRNA Human cyclin A1 mRNA Human cell surface glycoprotein CD44 (CD44) gene, 3’ end of long tailed isoform Human glucocorticoid receptor alpha mRNA Human protocadherin 43 mRNA, complete cds for abbreviated PC43 179 Table A.13: Nineteen genes selected by ERCOF on training samples and reported in [140] to separate Hyperdip 50 from other subtypes of ALL cases in pediatric ALL study All these genes are relatively highly expressed (above the mean value aross all the samples) in Hyperdip 50 samples Probe 38518 39628 31863 37543 38968 39039 39329 39389 32207 32236 32251 36620 36937 37350 38738 39168 40903 at at at at at s at at at at at at at s at at at at at 32572 at 306 s at Accession No Y18004 AI671547 D80001 D25304 AB005047 AI557497 X15804 M38690 M64925 AF032456 AA149307 X02317 U90878 AL031177 X99584 AB018328 AL049929 X98296 J02621 Description Homo sapiens mRNA for SCML2 protein Homo sapiens cDNA, 3’ end Human mRNA for KIAA0179 gene Human mRNA for KIAA0006 gene Homo sapiens mRNA for SH3 binding protein Homo sapiens cDNA, 3’ end Human mRNA for alpha-actinin Human CD9 antigen mRNA Human palmitoylated erythrocyte membrane protein (MPP1) mRNA Homo sapiens ubiquitin conjugating enzyme G2 (UBE2G2) mRNA Homo sapiens cDNA, 3’ end Human mRNA for Cu/Zn superoxide dismutase (SOD) Homo sapiens carboxyl terminal LIM domain protein (CLIM1) mRNA 26S Proteasome subunit p28 (Ankyrin repeat protein)) (isoform 1) H.sapiens mRNA for SMT3A protein Homo sapiens mRNA for KIAA0785 protein Homo sapiens mRNA; cDNA DKFZp547O0510 (from clone DKFZp547O0510) H.sapiens mRNA for ubiquitin hydrolase Human non-histone chromosomal protein HMG-14 mRNA 180 Appendix B Some Resources B.1 Kent Ridge Biomedical Data Set Repository All the gene expression profiles and proteomic data described in Chapter 5, and some DNA sequences used in Chapter can be found in the Kent Ridge Biomedical Data Set Repository at http://sdmc.i2r.a-star.edu.sg/rp/ In this data repository, we have collected gene expression data, protein profiling data and genomic sequence data that are related to classification and are published recently in Science, Nature and other prestigious journals As the file formats of these original raw data are different from common ones used in most of machine learning softwares, we have transformed these data sets into the standard data and names format and stored them in this repository Besides, we also provide data in arff format which is used by Weka, a machine learning software package developed at University of Waikato in New Zealand Detailed information of Weka can be found at http://www.cs.waikato.ac.nz/˜ml/ weka/ B.2 DNAFSMiner The DNAFSMiner (DNA Functional Site Miner) is a web-based toolbox for recognition of functional sites in DNA sequences It was built on the technologies presented in Chapter and written in Java and Perl languages It can be accessed via http://sdmc.i2r.a-star.edu.sg/ DNAFSMiner/ Currently, it can be used to identify translation initiation site (TISMiner) in ver- 181 tebrate mRNA, cDNA, and DNA sequences and polyadenylation signal (Poly(A) Signal Miner) in human sequences 182 ... by data mining 1.2 Work and Contribution To make use of original biological and clinical data in the data mining process, we follow the regular process flow in data mining but with emphasis on. . .EFFECTIVE USE OF DATA MINING TECHNOLOGIES ON BIOLOGICAL AND CLINICAL DATA LIU HUIQING (M.Science, National University of Singapore, Singapore) (M.Engineering, Xidian University, PRC) (B.Economics,... as state -of- the-art Bagging [19], Boosting [38], and Random forests [20] More than one thousand tests are conducted on six published gene expression profiling data sets and one proteomic data set

Ngày đăng: 16/09/2015, 17:12

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan