Tài liệu Data Preparation for Data Mining- P17 ppt

15 361 0
Tài liệu Data Preparation for Data Mining- P17 ppt

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

include points that should otherwise be excluded. Or again, in the nearest-neighbor methods, neighborhoods were unbalanced. How does preparation help? Figure 12.6 shows the data range normalized in state space on the left. The data with both range and distribution normalized is shown on the right. The range-normalized and redistributed space is a “toy” representation of what full data preparation accomplishes. This data is much easier to characterize—manifolds are more easily fitted, cluster boundaries are more easily found, neighbors are more neighborly. The data is simply easier to access and work with. But what real difference does it make? Figure 12.6 Some of the effects of data preparation: normalization of data range (left), and normalization and redistribution of data set (right). 12.3.1 Neural Networks and the CREDIT Data Set The CREDIT data set is a derived extract from a real-world data set. Full data preparation and surveying enable the miner to build reasonable models—reasonable in terms of addressing the business objective. But what does data preparation alone achieve in this data set? In order to demonstrate that, we will look at two models of the data—one on prepared data, and the other on unprepared data. Any difficulty in showing the effect of preparation alone is due to the fact that with ingenuity, much better models can be built with the prepared data in many circumstances than with the data unprepared. All this demonstrates, however, is the ingenuity of the miner! To try to “level the playing field,” as it were, for this example the neural network models will use all of the inputs, have the same number of nodes in the hidden layer, and will use no extracted features. There is no change in network architecture for the prepared and unprepared data sets. Thus, this uses no knowledge gleaned from the either the data assay or the data survey. Much, if not most, of the useful information discovered about the data set, and how to build better models, is simply discarded so that the effect of the automated techniques is most easily seen. The difference between the “unprepared” and “prepared” data sets is, as nearly as can be, only that provided by the automated preparation—accomplished by the demonstration code. Now, it is true that a neural network cannot take the data from the CREDIT data set in its raw form, so some preparation must be done. Strictly speaking, then, there is no such thing—for a neural network—as modeling unprepared data. What then is a fair preparation method to compare with the method outlined in this book? StatSoft is a leading maker of statistical analysis software. Their tools reflect statistical state-of-the art techniques. In addition to a statistical analysis package, StatSoft makes a neural network tool that uses statistical techniques to prepare data for the neural network. Their data preparation is automated and invisible to the modeler using their neural network package. So the “unprepared” data in this comparison is actually prepared by the statistical preparation techniques implemented by StatSoft. The “prepared” data set is prepared using the techniques discussed in this book. Naturally, a miner using all of the knowledge and insights gleaned from the data using the techniques described in the preceding chapters should—using either preparation technique—be able to make a far better model than that produced by this na‹ve approach. The object is to attempt a direct fair comparison to see the value of the automated data preparation techniques described here, if any. As shown in Figure 12.7, the neural network architecture selected takes all of the inputs, passes them to six nodes in the hidden layer, and has one output to predict—BUYER. Both networks were trained for 250 epochs. Because this is a neural network, the data set was balanced to be a 50/50 mix of buyers and nonbuyers. Figure 12.7 Architecture of the neural network used in modeling both the prepared and unprepared versions of the CREDIT data set predicting BUYER. It is an all-input, six-hidden-node, one-output, standard back-propagation neural network. Figure 12.8 shows the result of training on the unprepared data. The figure shows a number of interesting features. To facilitate training, the instances were separated into training and verification (test) data sets. The network was trained on the training data set, and errors in both the training and verification data sets are shown in the “Training Error Graph” window. This graph shows the prediction errors made in the training set on which the network learned, and also shows the prediction errors made in the verification data set, which the network was not looking at, except to make this prediction. The lower, fairly smooth line is the training set error, while the upper jagged line shows the verification set error. Figure 12.8 Errors in the training and verification data sets for 250 epochs of training on the unprepared CREDIT data set predicting BUYER. Before the network has learned anything, the error in the verification set is near its lowest at 2, while the error in the training set is at its highest. After about 45 epochs of training, the error in the training set is low and the error in the verification set is at its lowest—about 50% error—at 1. As the training set was better learned, so the error rate in the training set declined. At first, the underlying relationship was truly being learned, so the error rate in the verification data set declined too. At some point, overtraining began, and the error in the training data set continued to decline but the error in the verification data set increased. At that point, the network was learning noise. In this particular example, in the very early epochs—long before the network actually learned anything—the lowest error rate in the verification data set was discovered! This is happenstance due to the random nature of the network weights. At the same time, the error rate in the training set was at its highest, so nothing of value was learned by then. Looking at the graph shows that as learning continued, after some initial jumping about, the relationship in the verification data set was at its lowest after about 45 epochs. The error rate at that point was about 0.5. This is really a very poor performance, since 50% is exactly the same as random guessing! Recall that the balanced data set has 50% of buyers and nonbuyers, so flipping a fair coin provides a 50% accuracy rate. It is also notable that the error rate in the training data set continued to fall so that the network continued to learn noise. So much then for training on the “unprepared” data set. The story shown for the prepared data set in Figure 12.9 is very different! Notice that the highest error level shown on the error graph here is about 0.55, or 55%. In the previous figure, the highest error shown was about 90%. (The StatSoft window scales automatically to accommodate the range of the graph.) In this graph, three things are very notable. First, the training and verification errors declined together at first, and are by no means as far separated as they were before. Second, error in the verification declined for more epochs than before, so learning of the underlying relationship continued longer. Third, the prediction error in the verification data set fell much lower than in the unprepared data set. After about 95 epochs, the verification error fell to 0.38, or a 38% error rate. In other words, with a 38% error rate, the network made a correct prediction 62% of the time, far better than random guessing! Figure 12.9 Training errors in the prepared data set for identical conditions as before. Minimum error is shown at 1. Using the same network, on the same data set, and training under the same conditions, data prepared using the techniques described here performed 25% better than either random guessing or a network trained on data prepared using the StatSoft-provided, statistically based preparation techniques. A very considerable improvement! Also of note in comparing the performance of the two data sets is that the training set error in the prepared data did not fall as low as in the unprepared data. In fact, from the slope and level of the training set error graphs, it is easy to see that the network training in the prepared data resisted learning noise to a greater degree than in the unprepared data set. 12.3.2 Decision Trees and the CREDIT Data Set Exposing the information content seems to be effective for a neural network. But a decision tree uses a very different algorithm. It not only slices state space, rather than fitting a function, but it also handles the data in a very different way. A tree can digest unprepared data, and also is not as sensitive to balancing of the data set as a network. Does data preparation help improve performance for a decision tree? Once again, rather than extracting features or using any insights gleaned from the data survey, and taking the CREDIT data set as it comes, how does a decision tree perform? Two trees were built on the CREDIT data set, one on prepared data, and one on unprepared data. The tree used was KnowledgeSEEKER from Angoss Software. All of the defaults were used in both trees, and no attempt was made to optimize either the model or the data. In both cases the trees were constructed automatically. Results? The data was again divided into training and test partitions, and again BUYER was the prediction variable. The trees were built on the training partitions and tested on the test partitions. Figure 12.10 shows the results. The upper image shows the Error Profile window from KnowledgeSEEKER for the unprepared data set. In this case the accuracy of the model built on unprepared data is 81.8182%. With prepared data the accuracy rises to 85.8283%. This represents approximately a 5% improvement in accuracy. However, the misclassification rate improves from 0.181818 to 0.141717, which is an improvement of better than 20%. For decision trees, at least in this case, the quality of the model produced improves simply by preparing the data so that the information content is best exposed. Figure 12.10 Training a tree with Angoss KnowledgeSEEKER on unprepared data shows an 81.8182% accuracy on the test data set (top) and an 85.8283% accuracy in the test data for the prepared data set (bottom). 12.4 Practical Use of Data Preparation and Prepared Data How does a miner use data preparation in practice? There are three separate issues to address. The first part of data preparation is the assay, described in Chapter 4. Assaying the data to evaluate its suitability and quality usually reveals an enormous amount about the data. All of this knowledge and insight needs to be applied by the miner when constructing the model. The assay is an essential and inextricable part of the data preparation process for any miner. Although there are automated tools available to help reveal what is in the data (some of which are provided in the demonstration code), the assay requires a miner to apply insight and understanding, tempered with experience. Modeling requires the selection of a tool appropriate for the job, based on the nature of the data available. If in doubt, try several! Fortunately, the prepared data is easy to work with and does not require any modification to the usual modeling techniques. When applying constructed models, if an inferential model is needed, data extracts for training, test, and evaluation data sets can be prepared and models built on those data sets. For any continuously operating model, the Prepared Information Environment Input and Output (PIE-I and PIE-O) modules must be constructed to “envelop” the model so that live data is dynamically prepared, and the predicted results are converted back into real-world values. All of these briefly touched-on points have been more fully discussed in earlier chapters. There are a wealth of practical modeling techniques available to any miner—far more than the number of tools available. Even a brief review of the main techniques for building effective models is beyond the scope of the present book. Fortunately, unlike data preparation and data surveying, much has been written about practical data modeling and model building. However, there are some interesting points to note about the state of current modeling tools. 12.5 Looking at Present Modeling Tools and Future Directions In every case, modern data mining modeling tools are designed to attempt two tasks. The first is to extract interesting relationships from a data set. The second is to present the results in a form understandable to humans. Most tools are essentially extensions of statistical techniques. The underlying assumption is that it is sufficient to learn to characterize the joint frequencies of occurrence between variables. Given some characterization of the joint frequency of occurrence, it is possible to examine a multivariable input and estimate the probability of any particular output. Since full, multivariable joint frequency predictors are often large, unwieldy, and slow, the modeling tool provides some more compact, faster, or otherwise modified method for estimating the probability of an output. When it works, which is quite often, this is an effective method for producing predictions, and also for exploring the nature of the relationships between variables. However, no such methods directly try to characterize the underlying relationship driving the “process” that produces the values themselves. For instance, consider a string of values produced from sequential calls to a pseudo-random number generator. There are many algorithms for producing pseudo-random numbers, some more effective at passing tests of randomness than others. However, no method of characterizing the joint frequency of the “next” number from the “previous” numbers in the pseudo-random series is going to do much toward producing worthwhile predictions of “next” values. If the pseudo-random number generator is doing its job, the “next” number is essentially unpredictable from a study of any amount of “previous” values. But since the “next” number is, in fact, being generated by a precisely determined algorithm, there is present a completely defined relationship between any “previous” number and the “next.” It is in this sense that joint frequency estimators are not looking for the underlying relationships. The best that they can accomplish is to characterize the results of some underlying process. There are tools, although few commercially available, that actually seek to characterize the underlying relationships and that can actually make a start at estimating, for instance, a pseudo-random number generation function. However, this is a very difficult problem, and most of the recent results from such tools are little different in practical application than the results of the joint frequency estimators. However, tools that attempt to explore the actual underlying nature of the relationships, rather than to just predict their effects, hold much promise for deeper insights. Another new direction for modeling tools is that of knowledge schema. Chapter 11 discussed the difference between the information embedded in a data set alone, and that embodied in the associated dictionary. A knowledge schema is a method of representing the dictionary information extracted and inferred from one or more data sets. Such knowledge schema represent a form of “understanding” of the data. Simple knowledge schema are currently being built from information extracted from data sets and offer powerful new insights into data—considerably beyond that of a mining or modeling tool that does not extract and apply such schema. As powerful as they are, presently available modeling tools are at a first-generation level. Tools exploring and characterizing underlying relationships, and those building knowledge schema, for instance, represent the growth to a second generation of tools as astoundingly more powerful and capable as modern tools are than the statistical methods that preceded them. Data preparation too is at a first-generation level. The tools and techniques presented in this book for data preparation represent the best that can be done today with automated general-purpose data preparation and represent advanced first-generation tools. Just as there are exciting new directions and tremendous advances in modeling tools on the horizon, so too there are new advances and techniques on the horizon for automated data preparation. Such future directions will lead to powerful new insights and understanding of data—and ultimately will enhance our understanding of, and enable us to develop, practically applicable insights into the world. 12.5.1 Near Future Data preparation looked at here has dealt with data in the form collected in mainly corporate databases. Clearly this is where the focus is today, and it is also the sort of data on which data mining tools and data modeling tools focus. The near future will see the development of automated data preparation tools for series data. Approaches for automated series data preparation are already moving ahead and could even be available shortly after this book reaches publication. Continuous text is a form of data that is not dealt with in this book and presents very different types of problems to those discussed here. Text can be characterized in various ways. There is writing of all types, from business plans to Shakespeare’s plays. Preparation of text in general is well beyond the near future, but mining of one particular type of text, and preparation of that type of text for mining, is close. This is the type of text that can be described as thematic, discursive text. Thematic means that it has a consistent thread to the subject matter that it addresses. Discursive means that it passes from premises to conclusions. Text is discursive as opposed to intuitive. This book is an example of thematic, discursive text, as are sales reports, crime reports, meeting summaries, many memoranda and e-mail messages, even newspaper and news magazine articles. Such text addresses a limited domain and presents facts, interpretations of facts, deductions from facts, and the relationships and inferences between facts. It also may express a knowledge structure. Looking at data of this sort in text form is called text mining. Much work is being done to prepare and mine such data, and there are already embryonic text mining tools available. Clearly, there is a huge range of types of even thematic, discursive text, and not all of it will soon be amenable to preparation and mining! But there are already tools, techniques, and approaches that promise success and are awaiting only the arrival of more computer power to become effective. The near future for data preparation includes fully automated information exposure for the types of data discussed in this book, including series data, and adding thematic, discursive text preparation. 12.5.2 Farther Out The Internet and World Wide Web present formidable challenges. So far there isn’t even any beginning to classifying the types of data available on the Internet, let alone devising strategies for mining it! The Internet is not just a collection of text pages. It includes such a wealth and variety of information forms that they are almost impossible to list, let alone characterize—sound, audio-visual, HTML, ActiveX objects, and on and on. All of these represent a form of information communication. Computers “understand” them enough to represent them in human-understandable form—and humans can extract sense and meaning from them. Automating the mining of such data presents a formidable challenge. Yet we will surely begin the attempt. A truly vast amount of information is enfolded in the Internet. It is arguably the repository for what we have learned about the nature of the world as well as a forum, meeting place, library, and market. It may turn out to be a development as revolutionary in terms of human development as the invention of agriculture. It is a new form of enterprise in human experience, and it is clearly beyond the scope of understanding by any individual human being. Automated discovery, cognition, and understanding are obviously going to be applied to this enormous resource. Already terms such as “web farming” are in current use as we try to come to grips with the problem. There are many other forms of data that will benefit from automated “understanding” and so will need preparation to best expose its information content. These range from continuous speech to CAT (computerized axial tomography) scans. Just as the tools of the industrial revolution represent tools to enhance and extend our physical capabilities, so computers represent the beginnings of tools to extend the reach of our understanding. Whatever sort of data it is that we seem to understand, correct preparation of that data, as well as preparation of the miner, will always pay huge dividends. Appendix A: Using the Demonstration Code on the CD-ROM Data Format The data must be in a comma-delimited ASCII file. The first line of the file contains the comma-delimited variable names. DP will automatically type the variables as numerical if the first value contains a number or categorical if the first value starts with an alphabetic character. Various control flags can be added after the variable name in the form VarName<F>. The available flags are N Treat the variable as numerical. Values containing nonnumeric characters will be treated as nulls. C Treat the variable as categorical. P Treat the variable as the dependent or prediction variable. X Exclude the variable from analysis. K Key variable—exclude the variable from analysis but copy to the output file. Control Variables The operation of the DP program is controlled by an ASCII file containing names of control variables and their values: one control value per line separated by spaces from the value. Comments can be included by preceding them with a double slash (//). IPFileName Input file name OPFileName Output file name ConfLevelSmpl Confidence level to stop sampling ConfLevelDrop Confidence level to drop a variable ConfLevelNum Confidence level for numerating categoricals CatCntDrop Categorical count to Sample count drop ratio. That is, if [...]... located in c: \data; and the control file, ctrl, is also in the c: \data To run dp10, make the c: \data directory the current directory and enter c:\dos\dp10.exe ctrl Appendix: Using the Demonstration Code on the CD-ROM Sample Control File Actual control file comments in Letter Gothic font Comments that are not included in italics IPFileName cars.dat OPFileName carDP ConfLevelSmpl 95 Confidence level for sampling... how to think about interpreting the results of investigations into data It is statistically based, but the discussions apply equally well to the results of data mining There is little mathematics in the discussion Useful to business managers, analysts, modelers, and miners • Berry, Michael J A., and Gordon Linoff Data Mining Techniques: For ~Marketing, Sales, and Customer Support New York: John Wiley... level for sampling ConfLevelDrop 70 Confidence level for dropping variables ConfLevelNum 95 Confidence level for numerating CatCntDrop 0.8 If > n x NoInstances drop categorical variable ReplaceMissing 1 1 = Replace missing 0 = Don’t replace missing MaxNumDimension 3 Max dimensions when representing categoricals MaxNumEstimates 6 Max start var count for numerating categoricals OutputType 2 0 = Only numerate... analysts, modelers, and miners • Grouse, Donald C., and Gerald M Weinberg Exploring Requirements: Quality Before Design New York: Dorset House, 1989 This is a book about exploring requirements for products and systems Much of what they have to say is easily applicable to exploring requirements for business problems and solutions Almost all of the points they raise, and the solutions and methods they... business managers, analysts, modelers, and miners • Jones, Morgan D The Thinker’s Toolkit; Fourteen Skills for Making Smarter Decisions in Business and in Life New York: Times Business Books, 1995 The jacket description for this book introduces it as “ a unique collection of proven, practical methods for simplifying any problem and making faster, better decisions every time.” All of these decision-making... Co., 1996 This is a nonmathematical treatment of an approach to thinking about the problems, use, and applicability of understanding information enfolded in data It is not so much about statistics as about introducing the issues in thinking about how to understand data Almost all of the issues raised and discussed are applicable to mining and modeling Useful to business managers, analysts, modelers,... BASIC Although it does not discuss sparsely connected networks or autoassociative neural networks, the general structure of back-propagation neural networks is well presented The demonstration code for data preparation, while not based on the code discussed in the book, uses the same constant and parameter names, and some of the same structure as the network in this book, so that the transition into... New York: John Wiley & Sons, 1997 This book provides a conceptual overview of various data mining techniques, looking at how each actually works at a conceptual level This is an almost entirely nonmathematical treatment The examples discussed are mainly business oriented • Deming, William E Statistical Adjustment of Data New York: Dover Publications, 1984 Deming, William E Some Theory of Sampling New... Sampling New York: Dover Publications, 1984 These two books were originally written in 1938 and 1950, respectively William Edwards Deming gave a great deal of thought to the problems of collecting data from, and using data in, the real world These books are mathematical in their treatment of the problem and are statistically based Many issues and problems are raised in these works that are not yet amenable... communication (information) theory was originally published as a paper in 1948 and as a book in 1949 Dr Warren Weaver’s introduction is very readable Claude Shannon’s paper, although mathematical, is a model of clarity A huge number of books have been written in the intervening 50 years, and the theory has been much extended, but this book still holds its own • Smith, Murray Neural Networks for Statistical . near future for data preparation includes fully automated information exposure for the types of data discussed in this book, including series data, and adding. sort of data on which data mining tools and data modeling tools focus. The near future will see the development of automated data preparation tools for series

Ngày đăng: 15/12/2013, 13:15

Từ khóa liên quan

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan