IT training temporal data mining mitsa 2010 03 10

Temporal Data Mining © 2010 by Taylor and Francis Group, LLC C9765_C000.indd 2/4/10 9:46:30 AM Chapman & Hall/CRC Data Mining and Knowledge Discovery Series SERIES EDITOR Vipin Kumar University of Minnesota Department of Computer Science and Engineering Minneapolis, Minnesota, U.S.A AIMS AND SCOPE This series aims to capture new developments and applications in data mining and knowledge discovery, while summarizing the computational tools and techniques useful in data analysis This series encourages the integration of mathematical, statistical, and computational methods and techniques through the publication of a broad range of textbooks, reference works, and handbooks The inclusion of concrete examples and applications is highly encouraged The scope of the series includes, but is not limited to, titles in the areas of data mining and knowledge discovery methods and applications, modeling, algorithms, theory and foundations, data and knowledge visualization, data mining systems and tools, and privacy and security issues PUBLISHED TITLES UNDERSTANDING COMPLEX DATASETS: Data Mining with Matrix Decompositions David Skillicorn COMPUTATIONAL METHODS OF FEATURE SELECTION Huan Liu and Hiroshi Motoda CONSTRAINED CLUSTERING: Advances in Algorithms, Theory, and Applications Sugato Basu, Ian Davidson, and Kiri L Wagstaff KNOWLEDGE DISCOVERY FOR COUNTERTERRORISM AND LAW ENFORCEMENT David Skillicorn MULTIMEDIA DATA MINING: A Systematic Introduction to Concepts and Theory Zhongfei Zhang and Ruofei Zhang NEXT GENERATION OF DATA MINING Hillol Kargupta, Jiawei Han, Philip S Yu, Rajeev Motwani, and Vipin Kumar DATA MINING FOR DESIGN AND MARKETING Yukio Ohsawa and Katsutoshi Yada THE TOP TEN ALGORITHMS IN DATA MINING Xindong Wu and Vipin Kumar GEOGRAPHIC DATA MINING AND KNOWLEDGE DISCOVERY, Second Edition Harvey J Miller and Jiawei Han TEXT MINING: CLASSIFICATION, CLUSTERING, AND APPLICATIONS Ashok N Srivastava and Mehran Sahami BIOLOGICAL DATA MINING Jake Y Chen and Stefano Lonardi INFORMATION DISCOVERY ON ELECTRONIC HEALTH RECORDS Vagelis Hristidis TEMPORAL DATA MINING Theophano Mitsa © 2010 by Taylor and Francis Group, LLC C9765_C000.indd 2/4/10 9:46:30 AM Chapman & Hall/CRC Data Mining and Knowledge Discovery Series Temporal Data Mining Theophano Mitsa © 2010 by Taylor and Francis Group, LLC C9765_C000.indd 2/4/10 9:46:31 AM MATLAB® is a trademark of The MathWorks, Inc and is used with permission The MathWorks does not warrant the accuracy of the text or exercises in this book This book’s use or discussion of MATLAB® software or related products does not constitute endorsement or sponsorship by The MathWorks of a particular pedagogical approach or particular use of the MATLAB® software Chapman & Hall/CRC Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2010 by Taylor and Francis Group, LLC Chapman & Hall/CRC is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S Government works Printed in the United States of America on acid-free paper 10 International Standard Book Number: 978-1-4200-8976-9 (Hardback) This book contains information obtained from authentic and highly regarded sources Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint Except as permitted under U.S Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers For permission to photocopy or use material electronically from this work, please access www.copyright com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400 CCC is a not-for-profit organization that provides licenses and registration for a variety of users For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe Library of Congress Cataloging-in-Publication Data Mitsa, Theophano Temporal data mining / Theophano Mitsa p cm (Chapman & Hall/CRC data mining and knowledge discovery series) Includes bibliographical references and index ISBN 978-1-4200-8976-9 (hardcover : alk paper) Data mining Temporal databases I Title II Series QA76.9.D343M593 2010 005.75’3 dc22 2009048856 Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com © 2010 by Taylor and Francis Group, LLC C9765_C000.indd 2/4/10 9:46:31 AM To my parents, who taught me to spend every moment wisely, and to the Eternal One, who taught me that every moment is infinitely important © 2010 by Taylor and Francis Group, LLC C9765_C000e.indd 2/2/10 5:30:39 PM Table of Contents Preface, xix Chapter ▪ Temporal Databases and Mediators 1.1 Time in Databases 1 1.1.1 Database Concepts 1.1.2 Temporal Databases 1.1.3 Time Representation in SQL 1.1.4 Time in Data Warehouses 1.1.5 Temporal Constraints and Temporal Relations 1.1.6 Requirements for a Temporal KnowledgeBased Management System 1.1.7 Using XML for Temporal Data 1.1.8 Temporal Entity Relationship Models 1.2 Database Mediators 1.2.1 Temporal Relation Discovery 10 1.2.2 Semantic Queries on Temporal Data 12 1.3 Additional Bibliography 15 1.3.1 Additional Bibliography on Temporal Primitives 15 1.3.2 Additional Bibliography on Temporal Constraints and Logic 15 vii © 2010 by Taylor and Francis Group, LLC C9765_C000toc.indd 2/4/10 9:50:33 AM viii ◾ Table of Contents 1.3.3 Additional Bibliography on Temporal Languages and Frameworks References Chapter ▪ T emporal Data Similarity Computation, Representation, and Summarization 2.1 Temporal Data Types and Preprocessing 16 17 21 22 2.1.1 Temporal Data Types 22 2.1.2 Temporal Data Preprocessing 22 2.1.2.1 Data Cleaning 22 2.1.2.2 Data Normalization 25 2.2 Time Series Similarity Measures 26 2.2.1 Distance-Based Similarity 27 2.2.1.1 Euclidean Distance 27 2.2.1.2 Absolute Difference 28 2.2.1.3 Maximum Distance Metric 28 2.2.2 Dynamic Time Warping 28 2.2.3 The Longest Common Subsequence 31 2.2.4 Other Time Series Similarity Metrics 31 2.3 Time Series Representation 2.3.1 Nonadaptive Representation Methods 33 33 2.3.1.1 Discrete Fourier Transform 34 2.3.1.2 Discrete Wavelet Transform 34 2.3.1.3 Piecewise Aggregate Composition 37 2.3.2 Data-Adaptive Representation Methods 38 2.3.2.1 Singular Value Decomposition of Time Sequences 38 2.3.2.2 Shape Definition Language and CAPSUL 39 2.3.2.3 Landmark-Based Representation 40 2.3.2.4 Symbolic Aggregate Approximation (SAX) and iSAX 42 2.3.2.5 Adaptive Piecewise Constant Approximation (APCA) 43 © 2010 by Taylor and Francis Group, LLC C9765_C000toc.indd 2/4/10 9:50:34 AM Table of Contents ◾ ix 2.3.2.6 Piecewise Linear Representation (PLA) 2.3.3 Model-Based Representation Methods 2.3.3.1 Markov Models for Representation and Analysis of Time Series 2.3.4 Data Dictated Representation Methods 2.3.4.1 Clipping 43 44 44 45 45 2.3.5 Comparison of Representation Schemes and Distance Measures 45 2.3.6 Need for Time Series Data Mining Benchmarks 46 2.4 Time Series Summarization Methods 2.4.1 Statistics-Based Summarization 46 47 2.4.1.1 Mean 47 2.4.1.2 Median 47 2.4.1.3 Mode 47 2.4.1.4 Variance 47 2.4.2 Fractal Dimension–Based Summarization 48 2.4.3 Run-Length–Based Signature 48 2.4.3.1 Short Run-Length Emphasis 49 2.4.3.2 Long Run-Length Emphasis 49 2.4.4 Histogram-Based Signature and Statistical Measures 50 2.4.5 Local Trend-Based Summarization 51 2.5 Temporal Event Representation 52 2.5.1 Event Representation Using Markov Models 52 2.5.2 A Formalism for Temporal Objects and Repetitions 53 2.6 Similarity Computation of Semantic Temporal Objects 54 2.7 Temporal Knowledge Representation in Case-Based Reasoning Systems 55 2.8 Additional Bibliography 56 2.8.1 Similarity Measures 56 2.8.2 Dimensionality Reduction 57 © 2010 by Taylor and Francis Group, LLC C9765_C000toc.indd 2/4/10 9:50:34 AM x ◾ Table of Contents 2.8.3 Representation and Summarization Techniques 58 2.8.4 Similarity and Query of Data Streams 59 References Chapter ▪ Temporal Data Classification and Clustering 3.1 Classification Techniques 3.1.1 Distance-Based Classifiers 59 67 68 68 3.1.1.1 K–Nearest Neighbors 69 3.1.1.2 Exemplar-Based Nearest Neighbor 72 3.1.2 Bayes Classifier 72 3.1.3 Decision Trees 78 3.1.4 Support Vector Machines in Classification 81 3.1.5 Neural Networks in Classification 82 3.1.6 Classification Issues 83 3.1.6.1 Classification Error Types 83 3.1.6.2 Classifier Success Measures 84 3.1.6.3 Generation of the Testing and Training Sets 85 3.1.6.4 Comparison of Classification Approaches 85 3.1.6.5 Feature Processing 85 3.1.6.6 Feature Selection 86 3.2 Clustering 3.2.1 Clustering via Partitioning 86 87 3.2.1.1 K-Means Clustering 87 3.2.1.2 K-Medoids Clustering 88 3.2.2 Hierarchical Clustering 90 3.2.2.1 The COBWEB Algorithm 92 3.2.2.2 The BIRCH Algorithm 92 3.2.2.3 The CURE Algorithm 93 3.2.3 Density-Based Clustering 93 © 2010 by Taylor and Francis Group, LLC C9765_C000toc.indd 10 2/4/10 9:50:34 AM 336 ◾ Temporal Data Mining [Pap02] Papadias, D et al., Indexing Spatiotemporal Data Warehouses, Proceedings of ICDE, pp 166–175, 2002 [Pra07] Praing, R and M Schneider, Modeling Historical and Future Movements of Spatio-temporal Objects in Moving Object Databases, Proceedings of the ACM Conference on Information and Knowledge Management (CIKM), pp. 183–192, Lisboa, Portugal, 2007 [Rod00] Roddick, J.F., K Hornsby, and M Spyliopoulou, Proceedings of the First International TSDM Workshop, Lyon, France, 2000 [Rod01] Roddick, J.F and B G Lees, Paradigms for Spatial and Spatiotemporal Data Mining, Geographic Data Mining and Knowledge Discovery, H Miller and J Han, eds., Taylor & Francis 2001 [She08] Sherkat, R and D Rafiei, On Efficiently Searching Trajectories and Archival Data for Historical Similarities, Proceedings of the VLDB Conference, pp 896–908, 2008 [Shu08] Shu, H., X Zhu, and S Dai, Mining Association Rules in Geographical Spatial Data, www.isprs.org/congresses/beijing2008/proceedings/2_pdf/2_ WG- II-2/10.pdf, 2008 [Sin09] Singh, J.P and P Dutta, Temporal Behavior Analysis of Mobile Ad Hoc Network with Different Mobility Patterns, International Conference on Advances in Computing, Communication, and Control, pp 696–702, 2009 [Ste02] Steinbach, M., P.N Tan, and V Kumar, Temporal Data Mining for the Discovery and Analysis of Ocean Climate Indices, Proceedings of the KDD Temporal Data Mining Workshop, pp 2–3, 2002 [Tan08] Tang, X., Liu, Y., and W Kainz, Advances in Spatiotemporal Analysis, Taylor & Francis, 2008 [Tao05] Tao,Y and D Papadias, Historical Spatio-temporal Aggregation, ACM Transactions on Information Systems, vol 23, no 1, pp 61–102, January 2005 [Tre02] Treiber, M and D Helbing, Reconstructing the Spatio-Temporal Traffic Dynamics from Stationary Detector Data, Cooperative Transportation Dynamics, vol 1, pp 3.1–3.24, 2002 [Try03] Tryfona, N., R Price, and C.S Jensen, Conceptual Model for Spatio-temporal Applications, Lecture Notes on Computer Science 2520, Spatiotemporal Databases, T Sellis et al., eds pp 79–116, Springer, 2003 [Ver04] Verbesselt, J et al., Biophysical Drought Metrics Extraction by Time Series Analysis of SPOT Vegetation Data, IEEE International Geoscience and Remote Sensing Symposium, vol 3, pp 2026–2065, 2004 [Ver08] Verhein, F and S Chawla, Mining Spatiotemporal Patterns in Object Mobility Databases, Data Mining and Knowledge Discovery Journal, vol 16, no.1, pp 5–38, 2008 [Viq07] Viqueira, J R and N.A Lorentzos, SQL Extension for Spatiotemporal Data, The VLDB Journal, vol 16, no 2, pp 179–200, 2007 [Vla02a] Vlachos, M., D Gunopulos, and G Kollios, Robust Similarity Measures for Mobile Object Trajectories, DEXA Workshops, pp 721–728, 2002 [Vla02b] Vlachos, M., G Kollios, and D Gunopulos, Discovering Similar Multidimensional Trajectories, Proceedings of the International Conference on Data Engineering (ICDE), pp 673–684, 2002 © 2010 by Taylor and Francis Group, LLC C9765_C009.indd 336 2/2/10 12:04:25 PM Spatiotemporal Data Mining ◾ 337 [Vla03] Vlachos M et al., Indexing Multi-dimensional Time Series with Support for Multiple Distance Measures, Proceedings of the ACM SIGKDD Conference, Washington DC, pp 216–225, August 2003 [Wan04] Wang, J et al., FlowMiner: Finding Flow Patterns in Spatio-temporal Databases, Proceedings of ICTAI, pp 14–21, 2004 [Wan05] Wang, J., W Hsu, and M.L Lee, A Framework for Mining Topological Patterns in Spatio-temporal Databases, Proceedings of CIKM, pp 429–436, 2005 [Yan05] Yang, H., S Parthasarathy, and S Mehta, A Generalized Framework for Mining Spatio-temporal Patterns in Scientific Data, Proceedings of the KDD Conference, Chicago, IL, pp 716–721, 2005 [Yao03] Yao, X Research Issues in Spatio-Temporal Mining, http://www.ucgis org/visualization/whitepapers/yao-KDVIS2003.pdf (white paper submitted to the UCGIS workshop 2003) © 2010 by Taylor and Francis Group, LLC C9765_C009.indd 337 2/2/10 12:04:25 PM Appendix A In A.1, we discuss interpretation of data mining results from two different aspects: (1) how the derived results represent the entire population and (2) how data mining fits in the overall goals of an organization The first is important because data mining is based on a sample of data from a certain population (customers, patients); however, in most cases we are interested in deriving conclusions about the entire population In A.2, Internet sites that contain time series data sets are referenced, for readers interested in performing temporal data mining research A.1 Interpretation of Data Mining Results Having examined different temporal data representation schemes, let us now consider what the data mining results themselves represent in the larger scheme of an organization’s plan We will examine two aspects of the data mining results interpretation A.1.1 How Representative Are the Mined Results of the Actual Targeted Population? Data mining is performed on database data that usually represent samples from a population (e.g., a population of patients, a population of customers) Whatever the data mining operation might be (clustering, classification, etc.), it is based on these samples and not on the entire population For example, an Internet shopping site has collected data on the average age of its shoppers since its launching day two months ago It found that the average age is 29 years A question that arises often is how different the sample mean is from the population mean The error in estimating the population mean through the sample mean is known as the standard error (SE) and given by [Alb06], [Gla05]: s SE = n 339 © 2010 by Taylor and Francis Group, LLC C9765_A001.indd 339 2/2/10 12:12:58 PM 340 ◾ Appendix A where s is the standard deviation of the sample and n is the size of the sample Another frequent question is related to the confidence interval for the population mean, which is given by Sample Mean ± multiple × SE Specifically, the 95% confidence interval for the mean is Sample Mean ± 2s n Another issue that often comes into focus, when one uses samples to make estimates about an underlying population, is the kind of assumptions one makes about the population A common assumption is that of normality, i.e., that the underlying data distribution is normal Consider the following example: A Web site owner has devised a classification scheme, according to which users of the site are placed in two classes: (1) Class A, the user is likely to buy upgrade membership for the site; (2) Class B, the user is not likely to buy the upgrade Each user in each class is represented with a feature vector Then the owner of the Web site wants to know whether the mean feature vectors of the two classes are different in a statistically significant way For this purpose, he or she performs analysis of variance [Gla05], for which it is assumed that the underlying populations are normally distributed There are several ways one can test for normality: • Chi-square test for normality (histogram-based) [Alb06], Kolmogorov– Smirnov test (cumulative distribution function-based) [Pre02], Shapiro–Wilk test [Fie09] The main idea of these tests is to compare the sample data against data of a normal distribution with the same mean and standard deviation • Visual inspection of the data and computation of measures, such as skewness and kurtosis (discussed in Chapter 2), which show deviation in the distribution’s shape from a normal distribution [Fie09] Finally, if one wants to generallize the results of regression to an entire population, a number of assumptions must be true Besides the assumptions discussed in Chapter 4, such as normally distributed errors, other assumptions including uncorrelated independent variables with external variables and linearity must be checked For further discussion on this topic, see [Fie09] © 2010 by Taylor and Francis Group, LLC C9765_A001.indd 340 2/2/10 12:12:59 PM Appendix A ◾ 341 A.1.2 What Is the Goal of Temporal Data Mining? The goal of temporal data mining is knowledge discovery Two common reasons for beginning a knowledge discovery process are (1) prediction and (2) hypothesis testing [Leh08], [Fie09], [Gla05], [Alb06] Prediction refers to the ability to forecast the behavior of an entity, such as a company stock, and it is examined in detail in Chapter Let us now focus on hypothesis testing and see some examples: An Internet company just upgraded its Web site with the goal to attract more customers The company believes that if the site averages 500 hits/day, then the upgrade was successful A drug company wants to confirm the hypothesis that patients on a new drug that treats heart arrhythmia have a normal ECG on average after being on the drug for two months Let us look at the first example and assume that the data collected during the first 15 days of the new site are as follows: Site visits/day: 490, 550, 400, 600, 632, 400, 765, 578, 467, 623, 534, 577, 645, 456, 589 The hypothesis that the company wants to confirm is that the mean of the site visits/day is greater than 500 We will call this the alternative hypothesis The null hypothesis (or status quo) is that the mean of the site visits/day ≤ 500 The mean and standard deviation of the site visits/day above are as follows: Mean = 553.7 , s.d = 99 The mean is indeed greater than 500 However, we not know whether this result is statistically significant In other words, the company wants to know whether this mean is an accurate estimate of the mean of site visits/day for the entire time (until the next site upgrade) To express this in statistical terms, the company wants to have a 95% confidence in the confirmation of the alternative hypothesis In other words, the significance level, α, is 0.05 (α = 1−0.95) Having defined the confidence level, we can run a statistical test known as the Student’s t-test [Kac86], [Alb06], which gives us the following results: t- test value = 2.1 p-value = 0.0271 © 2010 by Taylor and Francis Group, LLC C9765_A001.indd 341 2/2/10 12:13:00 PM 342 ◾ Appendix A The t-test value indicates how many standard errors the sample mean is from the population mean The p-value indicates the probability that the t-test statistic gets its value by chance The smaller this probability, the more unlikely the null hypothesis and the more likely the alternative hypothesis (the one we want to prove) Specifically, if the p-value is less than the significance level then the alternative hypothesis is true In our case the p-value is indeed less than the significance level (0.05), which means that the company is 95% confident that the site visits/day is > 500 However, if we had said that we wanted to be 99% confident, then the significance level would have been 0.01 and p-value would have been greater than the significance level Let us now look at the second example Let us assume that each ECG is represented by the following features: wavelet coefficients and fractal dimension Here the hypothesis that the drug company wants to confirm is that the ECG of a patient on the new drug has all the characteristics of a normal ECG To this, the company performs an initial clinical trial with 19 patients for whom it collects ECG data and measures the wavelet coefficients and fractal dimension Then the features are normalized using the min-max normalization method Finally, the Euclidean distance between each ECG’s features and the corresponding features of a guideline ECG is computed The doctor in charge of the trial has decided that if the Euclidean distance between the patient ECG’s features and the guideline ECG’s features is less than 0.5 then the patient ECG can be considered normal Therefore, the alternative hypothesis (the one he wants to confirm) is that the Euclidean distance between the patients’ ECG and the guideline ECG is less than 0.5 Below are the Euclidean distance data for the 19 patients: Euclidean distance data: 0.2, 0.3, 0.4, 0.5, 0.6, 0.3, 0.5, 0.3, 0.3, 0.4, 0.2, 0.5, 0.4, 0.6, 0.3, 0.4, 0.5, 0.2, 0.3, 0.4 The mean of these data is 0.38 and the standard deviation is 0.12 The mean is indeed less 0.5 However, to confirm that this result is representative of the entire patient population, we must perform a t-test He chooses a significance level of 0.01 The results of this test are as follows: t- test value = –4.328 p-value = 0.0002 © 2010 by Taylor and Francis Group, LLC C9765_A001.indd 342 2/2/10 12:13:00 PM Appendix A ◾ 343 Because the p-value is less than 0.01, this means we are 99% confident that the patient population using this new drug will get a normal-looking ECG after two months on the drug A.2 Internet Sites with Time Series Data A.2.1 Time Series Data for Classification/Clustering http://www.cs.ucr.edu/~eamonn/time_series_data/ This site contains a diverse set of time series data appropriate for classification/clustering purposes The number of classes in each time series is given A.2.2 Diverse Time Series Data http://kdd.ics.uci.edu/summary.data.type.html This site contains eight data sets of diverse nature A.2.3 Physiological Data http://www.physionet.org/physiobank/database/ This site contains a variety of physiological signals, such as ECG signals and gait signals A.2.4 List of Data Set Sites http://www.kdnuggets.com/datasets/ This site contains references to many sites that contain data sets for data mining, including temporal data mining References [Alb06] Albright, S.C., W L Winston, and C Zappe, Data Analysis & Decision Making, Thomson Higher Education, 2006 [Fie09] Field, A., Discovering Statistics Using SPSS, 3rd edition, Sage Publishing, 2009 [Gla05] Glantz, S., Primer of Biostatistics, 6th edition, McGraw-Hill Medical, 2005 [Kac86] Kachigan, S.K., Statistical Analysis: An Interdisciplinary Introduction to Univariate and Multivariate Methods, Radius Press, 1986 [Leh08] Lehman, E.L and J.P Romano, Testing Statistical Hypotheses, Springer, 2008 [Pre02] Press, W.H., S.A Teukolsky, W.T Vetterling, B.P Flannery, Numerical Recipes in C, 2nd edition, Cambridge University Press, 2002 © 2010 by Taylor and Francis Group, LLC C9765_A001.indd 343 2/4/10 10:07:38 AM Appendix B To the best of the author’s knowledge the programs work (after the appropriate database driver information is entered) However, runtime or compilation time errors can not be excluded Chapter Programs Program Program for the implementation of the before temporal relationship Note: It uses an Oracle driver import java.sql.*; import java.io.*; import java.text.*; import java.net.*; // import your driver here //The program checks whether patient Jones was released //before Smith import oracle.jdbc.driver.*; //Author: Theophano Mitsa public class DateComp { static String url = "Your database's url here"; Connection connection; Statement statement; DateComp() { connection = null; statement = null; } public void initialize() { try { DriverManager.registerDriver (new oracle.jdbc.driver OracleDriver()); connection = DriverManager.getConnection(url); }catch(SQLException e) {} } 345 © 2010 by Taylor and Francis Group, LLC C9765_A002.indd 345 2/2/10 12:13:15 PM 346 ◾ Appendix B public void query() { try { Statement statement = connection.createStatement(); String sqlString = "SELECT RELEASE_DATE FROM PATIENTS WHERE FIRSTNAME='Ed' AND LASTNAME = 'Jones' "; ResultSet rs = statement.executeQuery(sqlString); Date date1 = rs.getDate("RELEASE_DATE"); String sqlString2 = "SELECT RELEASE_DATE FROM PATIENTS WHERE FIRSTNAME = 'John' AND LASTNAME= 'Smith' "; ResultSet rs2 = statement.executeQuery(sqlString2); Date date2 = rs2.getDate("RELEASE_DATE"); if(date1.before(date2)) { System.out.println("Ed Jones was released before John Smith"); } }catch(SQLException e) {} } public void close(){ try { connection.close(); } catch (SQLException e) {} } public static void main(String arg[]) { DateComp t1 = new DateComp(); t1.initialize(); t1.query(); t1.close(); } } Program Program for the implementation of a conversion of anchored data to an interval Note: It uses an Oracle driver import java.sql.*; import java.io.*; import java.text.*; import java.net.*; import java.util.*; // import your driver here © 2010 by Taylor and Francis Group, LLC C9765_A002.indd 346 2/4/10 10:08:05 AM Appendix B ◾ 347 import oracle.jdbc.driver.*; //Author: Theophano Mitsa //The program checks whether two patients stayed in the // hospital the same number of days It assumes that // the patients were hospitalized the same // year public class TempConv { static String url = "Your database's url here"; Connection connection; Statement statement; TempConv() { connection = null; statement = null; } public void initialize() { try { DriverManager.registerDriver (new oracle.jdbc.driver OracleDriver()); connection = DriverManager.getConnection(url); }catch(SQLException e) {} } public void query() { try { Statement statement = connection.createStatement(); String sqlString1 = "SELECT RELEASE_DATE FROM PATIENTS WHERE FIRSTNAME='Ed' AND LASTNAME = 'Jones' "; ResultSet rs1 = statement.executeQuery(sqlString1); java.sql.Date releaseDate1 = rs1.getDate("RELEASE_DATE"); String sqlString2 = "SELECT ADMISSION_DATE FROM PATIENTS WHERE FIRSTNAME='Ed' AND LASTNAME = 'Jones' "; ResultSet rs2 = statement.executeQuery(sqlString2); java.sql.Date admissionDate1 = rs2.getDate("ADMISSION_ DATE"); String sqlString3 = "SELECT RELEASE_DATE FROM PATIENTS WHERE FIRSTNAME = 'John' AND LASTNAME= 'Smith' "; ResultSet rs3 = statement.executeQuery(sqlString3); java.sql.Date releaseDate2 = rs3.getDate("RELEASE_DATE"); String sqlString4 = "SELECT ADMISSION_DATE FROM PATIENTS WHERE © 2010 by Taylor and Francis Group, LLC C9765_A002.indd 347 2/2/10 12:13:15 PM 348 ◾ Appendix B FIRSTNAME = 'John' AND LASTNAME= 'Smith' "; ResultSet rs4 = statement.executeQuery(sqlString4); java.sql.Date admissionDate2 = rs4.getDate("ADMISSION_ DATE"); //Convert to unanchored data Calendar c1 = Calendar.getInstance(); c1.setTime(releaseDate1); Calendar c2 = Calendar.getInstance(); c2.setTime(admissionDate1); Calendar c3 = Calendar.getInstance(); c3.setTime(releaseDate2); Calendar c4 = Calendar.getInstance(); c4.setTime(admissionDate2); int noOfDay1 = Math.abs(c1.get(Calendar.DAY_OF_YEAR)c2.get(Calendar.DAY_OF_YEAR)); int noOfDay2 = Math.abs(c3.get(Calendar.DAY_OF_YEAR)c4.get(Calendar.DAY_OF_YEAR)); if(noOfDay1 == noOfDay2) { System.out.println("The patients stayed in the hospital an equal number of days"); } else { System.out.println("The patients stayed in the hospital an unequal number of days"); } }catch(SQLException e) {} } public void close(){ try { connection.close(); } catch (SQLException e) {} } public static void main(String arg[]) { TempConv t1 = new TempConv(); t1.initialize(); t1.query(); t1.close(); } } © 2010 by Taylor and Francis Group, LLC C9765_A002.indd 348 2/2/10 12:13:15 PM Appendix B ◾ 349 XML file that contains the ontological description of the geologic eras Cenozoic Quarternary 1.8 0.0 Neogene 24.0 1.8 Paleogene 65.0 24.0 Mesozoic Cretaceous 146.0 65.0 Jurassic 208.0 146.0 Triassic 245.0 208.0 © 2010 by Taylor and Francis Group, LLC C9765_A002.indd 349 2/2/10 12:13:15 PM 350 ◾ Appendix B Program Program for the parsing of XML ontology file and extraction of temporal information The program prints out the begin and end dates for the Mesozoic and Jurassic periods //Author: Theophano Mitsa import javax.xml.parsers.DocumentBuilder; import javax.xml.parsers.DocumentBuilderFactory; import javax.xml.parsers.ParserConfigurationException; import org.w3c.dom.Document; import org.w3c.dom.Element; import org.w3c.dom.NodeList; public class Parser { Document dom; public Parser(){ } public void doParsing() { //parse the xml file and get the dom object parseXmlDoc(); //get the elements out of the dom object obtainElements(); } private void parseXmlDoc(){ DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance(); try { DocumentBuilder db = dbf newDocumentBuilder(); // get DOM representation of the XML file dom = db.parse("ontology.xml"); }catch(Exception e) { e.printStackTrace(); } } private void obtainElements(){ //get the root elememt Element docEle = dom.getDocumentElement(); float maxBeginDate=0.0f, minEndDate=100.0f, Ju_ BeginDate=0, Ju_EndDate=0; //get the Period elements NodeList nl = docEle.getElementsByTagName("Period"); © 2010 by Taylor and Francis Group, LLC C9765_A002.indd 350 2/2/10 12:13:15 PM Appendix B ◾ 351 if(nl != null && nl.getLength() > 0) { for(int i = ; i < nl.getLength();i++) { Element el = (Element)nl.item(i); //Find the date range for Jurassic period String name = getText(el,"Name"); if(name.equals("Jurassic")) { Ju_BeginDate = getFloatValue(el,"BeginDate"); Ju_EndDate = getFloatValue(el,"EndDate"); } //Find the date range for the Mesozoic era String type = el.getAttribute("parent"); if(type.equals("Mesozoic")) { if(maxBeginDate < getFloatValue(el,"BeginDate")) { maxBeginDate = getFloatValue(el, "BeginDate"); } if( minEndDate > getFloatValue(el,"EndDate")) { minEndDate = getFloatValue(el,"EndDate"); } } } System.out.println("For the Jurassic period, the BeginDate is" + Ju_BeginDate+ "and the EndDate is" + Ju_EndDate); System.out.println(" For the Mesozoic period BeginDate is:" + maxBeginDate + "EndDate is:" + minEndDate); } } private String getText(Element ele, String tagName) { String text = null; NodeList nl = ele getElementsByTagName(tagName); if(nl != null && nl.getLength() > 0) { Element el = (Element)nl.item(0); text = el.getFirstChild() getNodeValue(); } © 2010 by Taylor and Francis Group, LLC C9765_A002.indd 351 2/2/10 12:13:15 PM 352 ◾ Appendix B return text; } private float getFloatValue(Element ele, String tagName) { return Float.parseFloat(getText(ele,tagName) ); } public static void main(String[] args){ Parser p = new Parser(); p.doParsing(); } } © 2010 by Taylor and Francis Group, LLC C9765_A002.indd 352 2/2/10 12:13:15 PM ... Spatiotemporal Data 317 9.4 Applications of Spatiotemporal Data Mining in Geography 318 9.5 Spatiotemporal Data Mining of Traffic Data 320 9.6 Spatiotemporal Data Reduction 321 9.7 Spatiotemporal Data. .. prediction is not a temporal data mining task, it is quite often the ultimate goal of temporal data mining, and therefore it © 2 010 by Taylor and Francis Group, LLC C9765_C000g.indd 21 2/4 /10 9:54:26... C9765_C000.indd 2/4 /10 9:46:30 AM Chapman & Hall/CRC Data Mining and Knowledge Discovery Series Temporal Data Mining Theophano Mitsa © 2 010 by Taylor and Francis Group, LLC C9765_C000.indd 2/4 /10 9:46:31

IT training temporal data mining mitsa 2010 03 10

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan