Tài liệu Data Preparation for Data Mining- P13 pptx

30 500 0
Tài liệu Data Preparation for Data Mining- P13 pptx

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

system that can affect the outcome. The more of them that there are, the more likely it is that, purely by happenstance, some particular, but actually meaningless, pattern will show up. The number of variables in a data set, or the number of weights in a neural network, all represent things that can change. So, yet again, high-dimensionality problems turn up, this time expressed as degrees of freedom. Fortunately for the purposes of data preparation, a definition of degrees of freedom is not needed as, in any case, this is a problem previously encountered in many guises. Much discussion, particularly in this chapter, has been about reducing the dimensionality/combinatorial explosion problem (which is degrees of freedom in disguise) by reducing dimensionality. Nonetheless, a data set always has some dimensionality, for if it does not, there is no data set! And having some particular dimensionality, or number of degrees of freedom, implies some particular chance that spurious patterns will turn up. It also has implications about how much data is needed to ensure that any spurious patterns are swamped by valid, real-world patterns. The difficulty is that the calculations are not exact because several needed measures, such as the number of significant system states, while definable in theory, seem impossible to pin down in practice. Also, each modeling tool introduces its own degrees of freedom (weights in a neural network, for example), which may be unknown to the minere .mi The ideal, if the miner has access to software that can make the measurements (such as data surveying software), requires use of a multivariable sample determined to be representative to a suitable degree of confidence. Failing that, as a rule of thumb for the minimum amount of data to accept, for mining (as opposed to data preparation), use at least twice the number of instances required for a data preparation representative sample. The key is to have enough representative instances of data to swamp the spurious patterns. Each significant system state needs sufficient representation, and having a truly representative sample of data is the best way to assure that. 10.7 Beyond Joint Distribution So far, so good. Capturing the multidimensional distribution captures a representative sample of data. What more is needed? On to modeling! Unfortunately, things are not always quite so easy. Having a representative sample in hand is a really good start, but it does not assure that the data set is modelable! Capturing a representative sample is an essential minimum—that, and knowing what degree of confidence is justified in believing the sample to be representative. However, the miner needs a modelable representative sample, and the sample simply being representative of the population may not be enough. How so? Actually, there are any number of reasons, all of them domain specific, why the minimum representative sample may not suffice—or indeed, why a nonrepresentative sample is needed. (Heresy! All this trouble to ensure that a fully representative sample is collected, and now we are off after a nonrepresentative sample. What goes on here?) Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. Suppose a marketing department needs to improve a direct-mail marketing campaign. The normal response rate for the random mailings so far is 1.5%. Mailing rolls out, results trickle in. A (neophyte) data miner is asked to improve response. “Aha!,” says the miner, “I have just the thing. I’ll whip up a quick response model, infer who’s responding, and redirect the mail to similar likely responders. All I need is a genuinely representative sample, and I’ll be all set!” With this terrific idea, the miner applies the modeling tools, and after furiously mining, the best prediction is that no one at all will respond! Panic sets in; staring failure in the face, the neophyte miner begins the balding process by tearing out hair in chunks while wondering what to do next. Fleeing the direct marketers with a modicum of hair, the miner tries an industrial chemical manufacturer. Some problem in the process occasionally curdles a production batch. The exact nature of the process failure is not well understood, but the COO just read a business magazine article extolling the miraculous virtues of data mining. Impressed by the freshly minted data miner (who has a beautiful certificate attesting to skill in mining), the COO decides that this is a solution to the problem. Copious quantities of data are available, and plenty more if needed. The process is well instrumented, and continuous chemical batches are being processed daily. Oodles of data representative of the process are on hand. Wielding mining tools furiously, the miner conducts an onslaught designed to wring every last confession of failure from the recalcitrant data. Using every art and artifice, the miner furiously pursues the problem until, staring again at failure and with desperation setting in, the miner is forced to fly from the scene, yet more tufts of hair flying. Why has the now mainly hairless miner been so frustrated? The short answer is that while the data is representative of the population, it isn’t representative of the problem. Consider the direct marketing problem. With a response rate of 1.5%, any predictive system has an accuracy of 98.5% if it uniformly predicts “No response here!” Same thing with the chemical batch processing—lots of data in general, little data about the failure conditions. Both of these examples are based on real applications, and in spite of the light manner of introducing the issue, the problem is difficult to solve. The feature to be modeled is insufficiently represented for modeling in a data set that is representative of the population. Yet, if the mining results are to be valid, the data set mined must be representative of the population or the results will be biased, and may well be useless in practice. What to do? 10.7.1 Enhancing the Data Set When the density of the feature to be modeled is very low, clearly the density of that feature needs to be increased—but in a way that does least violence to the distribution of the population as a whole. Using the direct marketing response model as an example, Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. simply increasing the proportion of responders in the sample may not help. It’s assumed that there are some other features in the sample that actually do vary as response varies. It’s just that they’re swamped by spurious patterns, but only because of their low density in the sample. Enhancing the density of responders is intended to enhance the variability of connected features. The hope is that when enhanced, these other features become visible to the predictive mining tool and, thus, are useful in predicting likely responders. These assumptions are to some extent true. Some performance improvement may be obtained this way, usually more by happenstance than design, however. The problem is that low-density features have more than just low-level interactions with other, potentially predictive features. The instances with the low-level feature represent some small proportion of the whole sample and form a subsample—the subsample containing only those instances that have the required feature. Considered alone, because it is so small, the subsample almost certainly does not represent the sample as a whole—let alone the population. There is, therefore, a very high probability that the subsample contains much noise and bias that are in fact totally unrelated to the feature itself, but are simply concomitant to it in the sample taken for modeling. Simply increasing the desired feature density also increases the noise and bias patterns that the subsample carries with it—and those noise and bias patterns will then appear to be predictive of the desired feature. Worse, the enhanced noise and bias patterns may swamp any genuinely predictive feature that is present. This is a tough nut to crack. It is very similar to any problem of extracting information from noise, and that is the province of information theory, discussed briefly in Chapter 11 in the context of the data survey. One of the purposes of the data survey is to understand the informational structure of the data set, particularly in terms of any identified predictive variables. However, a practical approach to solving the problem does not depend on the insights of the data survey, helpful though they might be. The problem is to construct a sample data set that represents the population as much as possible while enhancing some particular feature. Feature Enhancement with Plentiful Data If there is plenty of data to draw upon, instances of data with the desired feature may also be plentiful. This is the case in the first example above. The mailing campaign produces many responses. The problem is their low density as a proportion of the sample. There may be thousands or tens of thousands of responses, even though the response rate is only 1.5%. In such a circumstance, the shortage of instances with the desired feature is not the problem, only their relative density in the mining sample. With plenty of data available, the miner constructs two data sets, both fully internally representative of the population—except for the desired feature. To do this, divide the source data set into two Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. subsets such that one subset has only instances that contain the feature of interest and the other subset has no instances that contain the feature of interest. Use the already described techniques (Chapter 5) to extract a representative sample from each subset, ignoring the effect of the key feature. This results in two separate subsets, both similar to each other and representative of the population as a whole when ignoring the effect of the key feature. They are effectively identical except that one has the key feature and the other does not. Any difference in distribution between the two subsets is due either to noise, bias, or the effect of the key feature. Whatever differences there are should be investigated and validated whatever else is done, but this procedure minimizes noise and bias since both data sets are representative of the population, save for the effect of the key feature. Adding the two subsets together gives a composite data set that has an enhanced presence of the desired feature, yet is as free from other bias and noise as possible. Feature Enhancement with Limited Data Feature enhancement is more difficult when there is only limited data available. This is the case in the second example of the chemical processor. The production staff bends every effort to prevent the production batch from curdling, which only happens very infrequently. The reasons for the batch failure are not well understood anyway (that is what is to be investigated), so may not be reliably reproducible. Whether possible or not, batch failure is a highly expensive event, hitting directly at the bottom line, so deliberately introducing failure is simply not an option management will countenance. The miner was constrained to work with the small amount of failure data already collected. Where data is plentiful, small subsamples that have the feature of interest are very likely to also carry much noise and bias. Since more data with the key feature is unavailable, the miner is constrained to work with the data at hand. There are several modeling techniques that are used to extract the maximum information from small subsamples, such as multiway cross-validation on the small feature sample itself, and intersampling and resampling techniques. These techniques do not affect data preparation since they are only properly applied to already prepared data. However, there is one data preparation technique used when data instances with a key feature are particularly low in density: data multiplication. The problem with low feature-containing instance counts is that the mining tool might learn the specific pattern in each instance and take those specific patterns as predictive. In other words, low key feature counts prevent some mining tools from generalizing from the few instances available. Instead of generalizing, the mining tool learns the particular instance configurations—which is particularizing rather than generalizing. Data multiplication is the process of creating additional data instances that appear to have the feature of interest. White (or colorless) noise is added to the key feature subset, producing a second data subset. (See Chapter 9 for a discussion of noise and colored noise.) The Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. interesting thing about the second subset is that its variables all have the same mean values, distributions and so on, as the original data set—yet no two instance values, except by some small chance, are identical. Of course, the noise-added data set can be made as large as the miner needs. If duplicates do exist, they should be removed. When added to the original data set, these now appear as more instances with the feature, increasing the apparent count and increasing the feature density in the overall data set. The added density means that mining tools will generalize their predictions from the multiplied data set. A problem is that any noise or bias present will be multiplied too. Can this be reduced? Maybe. A technique called color matching helps. Adding white noise multiplies everything exactly as it is, warts and all. Instead of white noise, specially constructed colored noise can be added. The multidimensional distribution of a data sample representative of the population determines the precise color. Color matching adds noise that matches the multivariable distribution found in the representative sample (i.e., it is the same color, or has the same spectrum). Any noise or bias present in the original key feature subsample is still present, but color matching attempts to avoid duplicating the effect of the original bias, even diluting it somewhat in the multiplication. As always, whenever adding bias to a data set, the miner should put up mental warning flags. Data multiplication and color matching adds features to, or changes features of, the data set that simply are not present in the real world—or if present, not at the density found after modification. Sometimes there is no choice but to modify the data set, and frequently the results are excellent, robust, and applicable. Sometimes even good results are achieved where none at all were possible without making modifications. Nonetheless, biasing data calls for extreme caution, with much validation and verification of the results before applying them. 10.7.2 Data Sets in Perspective Constructing a composite data set enhances the visibility of some pertinent feature in the data set that is of interest to the miner. Such a data set is no longer an unbiased sample, even if the original source data allowed a truly unbiased sample to be taken in the first place. Enhancing data makes it useful only from one particular point of view, or from a particular perspective. While more useful in particular circumstances, it is nonetheless not so useful in general. It has been biased, but with a purposeful bias deliberately introduced. Such data has a perspective. When mining perspectival data sets, it is very important to use nonperspectival test and evaluation sets. With the best of intentions, the mining data has been distorted and, to at least that extent, no longer accurately represents the population. The only place that the inferences or predictions can be examined to ensure that they do not carry an unacceptable distortion through into the real world is to test them against data that is as undistorted—that Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. is, as representative of the real world—as possible. 10.8 Implementation Notes Of the four topics covered in this chapter, the demonstration code implements algorithms for the problems that can be automatically adjusted without high risk of unintended data set damage. Some of the problems discussed are only very rarely encountered or could cause more damage than benefit to the data if applied without care. Where no preparation code is available, this section includes pointers to procedures the miner can follow to perform the particular preparation activity. 10.8.1 Collapsing Extremely Sparsely Populated Variables The demonstration code has no explicit support for collapsing extremely sparsely populated variables. It is usual to ignore such variables, and only in special circumstances do they need to be collapsed. Recall that these variables are usually populated at levels of small fractions of 1%, so a much larger proportion than 99% of the values are missing (or empty). While the full tool from which the demonstration code was drawn will fully collapse such variables if needed, it is easy to collapse them manually using the statistics file and the complete-content file produced by the demonstration code, along with a commercial data manipulation tool, say, an implementation of SQL. Most commercial statistical packages also provide all of the necessary tools to discover the problem, manipulate the data, and create the derived variables. 1. If using the demonstration code, start with the “stat” file. 2. Identify the population density for each variable. 3. Check the number of discrete values for each candidate sparse variable. 4. Look in the complete-content file, which lists all of the values for all of the variables. 5. Extract the lists for the sparse variables. 6. Access the sample data set with your tool of choice and search for, and list, those cases where the sparse variables simultaneously have values. (This won’t happen often, even in sparse data sets.) 7. Create unique labels for each specific present-value pattern (PVP). 8. Numerate the PVPs. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. Now comes the only tricky part. Recall that the PVPs were built from the representative sample. (It’s representative only to some selected degree of confidence.) The execution data set may, and if large enough almost certainly will, contain a PVP that was not in the sample data set. If important, and only the domain of the problem provides that answer, create labels for all of the possible PVPs, and assign them appropriate values. That is a judgment call. It may be that you can ignore any unrecognized PVPs, or more likely, flag them if they are found. 10.8.2 Reducing Excessive Dimensionality Neural networks comprise a vast topic on their own. The brief introduction in this chapter only touched the surface. In keeping with all of the other demonstration code segments, the neural network design is intended mainly for humans to read and understand. Obviously, it also has to be read (and executed) by computer systems, but the primary focus is that the internal working of the code be as clearly readable as possible. Of all the demonstration code, this requirement for clarity most affects the network code. The network is not optimized for speed, performance, or efficiency. The sparsity mechanism is modified random assignment without any dynamic interconnection. Compression factor (hidden-node count) is discovered by random search. The included code demonstrates the key principles involved and compresses information. Code for a fully optimized autoassociative neural network, including dynamic connection search with modified cascade hidden-layer optimization, is an impenetrable beast! The full version, from which the demonstration is drawn, also includes many other obfuscating (as far as clarity of reading goes) “bells and whistles.” For instance, it includes modifications to allow maximum compression of information into the hidden layer, rather than spreading it between hidden and output layers, as well as modifications to remove linear relationships and represent those separately. While improving performance and compression, such features completely obscure the underlying principles. 10.8.3 Measuring Variable Importance Everything just said about neural networks for data compression applies when using the demonstration code to measure variable importance. For explanatory ease, both data compression and variable importance estimation use the same code segment. A network optimized for importance search can, once again, improve performance, but the principles are as well demonstrated by any SCANN-type BP-ANN. 10.8.4 Feature Enhancement To enhance features in a data set, build multiple representative data subsets, as described, and merge them. Describing construction of colored noise for color matching, if needed, is unfortunately Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. outside the scope of the present book. It involves significant multivariable frequency modeling to reproduce a characteristic noise pattern emulating the sample multivariable distribution. Many statistical analysis software packages provide the basic tools for the miner to characterize the distribution and develop the necessary noise generation function. 10.9 Where Next? A pause at this point. Data preparation, the focus of this book, is now complete. By applying all of the insights and techniques so far covered, raw data in almost any form is turned into clean prepared data ready for modeling. Many of the techniques are illustrated with computer code on the accompanying CD-ROM, and so far as data preparation for data mining is concerned, the journey ends here. However, the data is still unmined. The ultimate purpose of preparing data is to gain understanding of what the data “means” or predicts. The prepared data set still has to be used. How is this data used? The last two chapters look not at preparing data, but at surveying and using prepared data. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. Chapter 11: The Data Survey Overview Suppose that three separate families are planning a vacation. The Abbott family really enjoys lake sailing. Their ideal vacation includes an idyllic mountain lake, surrounded by trees, with plenty of wildlife and perhaps a small town or two nearby in case supplies are needed. They need only a place to park their car and boat trailer, a place to launch the boat, and they are happy. The Bennigans are amateur archeologists. There is nothing they like better than to find an ancient encampment, or other site, and spend their time exploring for artifacts. Their four-wheel-drive cruiser can manage most terrain and haul all they need to be entirely self-sufficient for a couple of weeks exploring—and the farther from civilization, the better they like it. The Calloways like to stay in touch with their business, even while on vacation. Their ideal is to find a luxury hotel in the sun, preferably near the beach but with nightlife. Not just any nightlife; they really enjoy cabaret, and would like to find museums to explore and other places of interest to fill their days. These three families all have very different interests and desires for their perfect vacation. Can they all be satisfied? Of course. The locations that each family would like to find and enjoy exist in many places; their only problem is to find them and narrow down the possibilities to a final choice. The obvious starting point is with a map. Any map of the whole country indicates broad features—mountains, forests, deserts, lakes, cities, and probably roads. The Abbotts will find, perhaps, the Finger Lakes in upstate New York a place to focus their attention. The Bennigans may look at the deserts of the Southwest, while the Calloways look to Florida. Given their different interests, each family starts by narrowing down the area of search for their ideal vacation to those general areas of the country that seem likely to meet their needs and interests. Once they have selected a general area, a more detailed map of the particular territory lets each family focus in more closely. Eventually, each family will decide on the best choice they can find and leave for their various vacations. Each family explores its own vacation site in detail. While the explorations do not seem to produce maps, they reveal small details—the very details that the vacations are aimed at. The Abbotts find particular lake coves, see particular trees, and watch specific birds and deer. The Bennigans find individual artifacts in specific places. The Calloways enjoy particular cabaret performers and see specific exhibits at particular museums. It is these detailed explorations that each family feels to be the whole purpose for their vacations. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. Each family started with a general search to find places likely to be of interest. Their initial search was easy. The U.S. Geological Survey has already done the hard work for them. Other organizations, some private survey companies, have embellished maps in particular ways and for particular purposes—road maps, archeological surveys, sailing maps (called “charts”), and so on. Eventually, the level of detail that each family needed was more than a general map could provide. Then the families constructed their own maps through detailed exploration. What does this have to do with data mining? The whole purpose of the data survey is to help the miner draw a high-level map of the territory. With this map, a data miner discovers the general shape of the data, as well as areas of danger, of limitation, and of usefulness. With a map, the Abbotts avoided having to explore Arizona to see if any lakes suitable for sailing were there. With a data survey, a miner can avoid trying to predict the stock market from meteorological data. “Everybody knows” that there are no lakes in Arizona. “Everybody knows” that the weather doesn’t predict the stock market. But these “everybodies” only know that through experience—mainly the experience of others who have been there first. Every territory needed exploring by pioneers—people who entered the territory first to find out what there was in general—blazing the trail for the detailed explorations to follow. The data survey provides a miner with a map of the territory that guides further exploration and locates the areas of particular interest, the areas suitable for mining. On the other hand, just as with looking for lakes in Arizona, if there is no value to be found, that is well to know as early as possible. 11.1 Introduction to the Data Survey This chapter deals entirely with the data survey, a topic at least as large as data preparation. The introduction to the use, purposes, and methods of data surveying in this chapter discusses how prepared data is used during the survey. Most, if not all, of the surveying techniques can be automated. Indeed, the full suite of programs from which the data preparation demonstration code is drawn is a full data preparation and survey tool set. This chapter touches only on the main topics of data surveying. It is an introduction to the territory itself. The introduction starts with understanding the concept of “information.” This book mentions “information” in several places. “Information is embedded in a data set.” “The purpose of data preparation is to best expose information to a mining tool.” “Information is contained in variability.” Information, information, information. Clearly, “information” is a key feature of data preparation. In fact, information—its discovery, exposure, and understanding—is what the whole preparation-survey-mining endeavor is about. A data set may represent information in a form that is not easily, or even at all, understandable by humans. When the data set is large, understanding significant and salient points becomes even more difficult. Data mining is devised as a tool to transform the impenetrable information embedded in a data set into understandable relationships or predictions. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. [...]... real-world data sets are noisy, so the example data sets used here comprise noisy data approximating the two curves Data set F (functional) contains the data for the curve that can be described by a function, and data set NF (nonfunctional) contains data for the curve not describable by a function While simplified for the example, both of these data sets contain all the elements of real-world data sets... structures is the purpose of data surveying The fundamental question posed by the data survey is, “Just what information is in here anyway?” 11.2 Information and Communication Everything begins with information The data set embeds it The data survey surveys it Data mining translates it But what exactly is information? The Oxford English Dictionary begins its definition with “The act of informing, ” and continues... reveal a tremendous amount about any prepared data set Information theory can be used to look at data sets in several ways, all of which reveal useful information Also, information theory can be used to examine data as instances, data as variables, the data set as a whole, and various parts of a data set Entropy and mutual information are used to evaluate data in many ways Some of these include Please... because information measures are very important to the data survey, you’ll also find information on deriving these measures from the data in the same section 11.3.1 Whole Data Set Entropy An ideal data set for modeling has an entropy that is close to the theoretical maximum Since the modeling data set should represent all of its meaningful states, entropy will be high in such a data set For modeling,... best perform—before any model is attempted They measure limits set by the nature of the data As well as looking at the overall mutual information, it can be worth looking at the individual signal entropy plots for the input and output data Figure 11.3 shows the entropy plots for the range of the input variable x predicting variable y in both data sets The upper image shows the entropy for data set... measure of the information content of the data set in bits So log2(system states) = information content in bits Using logarithms to the base of 2 gives the information bit measure needed to carry a specified number of system states Returning to the example of data sets A and B, for data set A with four system states that number is log2(4) = 2 bits For data set B with two system states the information content... a data set, how can we discover how many bits of information it does carry? 11.2.3 Measuring Information: Bits of Information When starting out to measure the information content of a data set, what can be easily discovered within a data set is its number of system states—not (at least directly) the number of bits needed to carry the information As an understandable example, however, imagine two data. .. entropy can tell the miner The example data is both simplified and small In practice, the mined data sets are too complex or too large to image graphically, but the entropic measures yield the same useful information for large data sets as for small ones Given a large data set, the miner must, in part, detect any areas in the data that have problems similar to those in data set NF—and, if possible, determine... relationship The following discussion uses six different information measures Since the actual values vary from data set to data set, the values are normalized across the range 0–1 for convenience The actual values for this data set, along with their normalized values, are shown in Table 11.8 TABLE 11.8 Information measure values for the example data set NF Please purchase PDF Split-Merge on www.verypdf.com... action) Examining the nature of, and the relationships in, the information content of a data set is a part of the task of the data survey It prepares the path for the mining that follows Some information is always present in the data understandable or not Mining finds relationships or predictions embedded in the information inherent in a data set With luck, they are not just the obvious relationships . for the minimum amount of data to accept, for mining (as opposed to data preparation) , use at least twice the number of instances required for a data preparation. affect data preparation since they are only properly applied to already prepared data. However, there is one data preparation technique used when data instances

Ngày đăng: 15/12/2013, 13:15

Từ khóa liên quan

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan