Data Mining and Knowledge Discovery Handbook, 2 Edition part 12 ppsx

90 Barak Chizi and Oded Maimon D(Pr(C | V i = v i ,V j = v j ),Pr(C | V j = v j )) = ∑ c∈C p(c | V i = v i ,V j = v j )log 2 p(c | V i =v i ,V j =v j ) p( c | V j =v j ) (5.2) For each feature i, the algorithm finds a set M i , containing K attributes from those that remain, that is likely to include the information feature i has about the class values. M i contains K features out of the remaining features for which the value of Equation 5.2 is smallest. The expected cross entropy between the distribution of the class values, given M i , V i , and the distribution of class values given just M i , is calculated for each feature i. The feature for which this quantity is minimal is removed from the set. This process iterates until the user- specified number of features are removed from the original set. Experiments on natural domains and two artificial domains using C4.5 and na ¨ ıve Bayes as the final induction algorithm showed that the feature selector gives the best results when the size K of the conditioning set M is set to 2. In two domains containing over 1000 features the algorithm is able to reduce the number of features by more than half, while improving accuracy by one or two percent. One problem with the algorithm is that it requires features with more than two values to be encoded as binary in order to avoid the bias that entropic measures have toward features with many values. This can greatly increase the number of features in the original data, as well as introducing further dependencies. Furthermore, the meaning of the original attributes is obscured, making the output of algorithms such as C4.5 hard to interpret. An Instance Based Approach to Feature Selection – RELIEF Kira and Rendell (1992) describe an algorithm called RELIEF that uses instance based learning to assign a relevance weight to each feature. Each feature’s weight re- flects its ability to distinguish among the class values. Features are ranked by weight and those that exceed a user- specified threshold are selected to form the final subset. The algorithm works by randomly sampling instances from the training data. For each instance sampled the nearest instance of the same class (nearest hit) and oppo- site class (nearest miss) is found. An attribute’s weight is updated according to how well its values distinguish the sampled instance from its nearest hit and nearest miss. An attribute will receive a high weight if it differentiates between instances from different classes and has the same value for instances of the same class. Equation (5.3) shows the weight updating formula used by RELIEF: W X = W X − dif f(X,R,H) 2 m + dif f(X,R,M) 2 m (5.3) where W X is the weight for attribute X , R is a randomly sampled instance, H is the nearest hit, M is the nearest miss, and m is the number of randomly sampled instances. The function diff calculates the difference between two instances for a given attribute. For nominal attributes it is defined as either 1 (the values are different) or 0 (the values are the same), while for continuous attributes the difference is the actual 5 Dimension Reduction and Feature Selection 91 difference normalized to the interval [0; 1]. Dividing by m guarantees that all weights are in the interval [-1,1]. RELIEF operates on two- class domains. Kononenko (1994) describes enhance- ments to RELIEF that enable it to cope with multi- class, noisy and incomplete domains. Kira and Rendell provide experimental evidence that shows RELIEF to be effective at identifying relevant features even when they interact. Interacting features are those, whose values are dependent on the values of other features and the class, and as such, provide further information about the class. On the other hand, redundant features, are those whose values are dependent on the values of other features irrespective of the class, and as such, provide no further information about the class. (for example, in parity problems). However, RELIEF does not handle redundant features. The authors state: “If most of the given features are relevant to the concept, it (RELIEF) would select most of the given features even though only a small number of them are necessary for concept description.” Scherf and Brauer (1997) describe a similar instance based approach (EUBAFES) to assigning feature weights developed independently of RELIEF. Like RELIEF, EU- BAFES strives to reinforce similarities between instances of the same class while si- multaneously decrease similarities between instances of different classes. A gradient descent approach is employed to optimize feature weights with respect to this goal. 5.2.2 Feature Wrappers Wrapper strategies for feature selection use an induction algorithm to estimate the merit of feature subsets. The rationale for wrapper approaches is that the induction method that will ultimately use the feature subset should provide a better estimate of accuracy than a separate measure that has an entirely different inductive bias (Lang- ley and Sage, 1994). Feature wrappers often achieve better results than filters due to the fact that they are tuned to the specific interaction between an induction algorithm and its training data. However, they tend to be much slower than feature filters because they must repeatedly call the induction algorithm and must be re- run when a different induction algorithm is used. Since the wrapper is a well defined process, most of the variation in its application are due to the method used to estimate the off- sample accuracy of a target induction algorithm, the target induction algorithm itself, and the organization of the search. This section reviews work that has focused on the wrapper approach and methods to reduce its computational expense. Wrappers for Decision Tree Learners John, Kohavi, and Pfleger (1994) were the first to advocate the wrapper (Allen, 1974) as a general framework for feature selection in machine learning. They present for- mal for two degrees of feature relevance definitions, and claim that the wrapper is able to discover relevant features. A feature X i is said to be strongly relevant to the 92 Barak Chizi and Oded Maimon target concept(s) if the probability distribution of the class values, given the full feature set, changes when X i is removed. A feature X i is said to be weakly relevant if it is not strongly relevant and the probability distribution of the class values, given some subset S (containing X i ) of the full feature set, changes when X i is removed. All features that are not strongly or weakly relevant are irrelevant. Experiments were conducted on three artificial and three natural domains using ID3 and C4.5 (Quinlan, 1993, Quinlan, 1986) as the induction algorithms. Accuracy was estimated by using 25- fold cross validation on the training data; a disjoint test set was used for reporting final accuracies. Both forward selection and backward elimination search were used. With the exception of one artificial domain, results showed that feature selection did not significantly change ID3 or C4.5’s generaliza- tion performance. The main effect of feature selection was to reduce the size of the trees. Like John et al., Caruana and Freitag (1994) test a number of greedy search methods with ID3 on two calendar scheduling domains. As well as backward elimination and forward selection they also test two variants of stepwise bi- directional search— one starting with all features, the other with none. Results showed that although the bi- directional searches slightly outperformed the forward and backward searches, on the whole there was very little difference between the various search strategies except with respect to computation time. Feature selection was able to improve the performance of ID3 on both calendar scheduling domains. Vafaie and De Jong (1995) and Cherkauer and Shavlik (1996) have both applied genetic search strategies in a wrapper framework for improving the performance of decision tree learners. Vafaie and De Jong (1995) describe a system that has two genetic algorithm driven modules— the first performs feature selection, and the second performs constructive induction (Constructive induction is the process of creating new attributes by applying logical and mathematical operators to the original features (Michalski, 1983)). Both modules were able to significantly improve the performance of ID3 on a texture classification problem. Cherkauer and Shavlik (1996) present an algorithm called SET- Gen which strives to improve the comprehensibility of decision trees as well as their accuracy. To achieve this, SET- Gen’s genetic search uses a fitness function that is a linear combination of an accuracy term and a simplicity term: Fitness(X) = 3 4 A + 1 4 (1 − S +F 2 ) (5.4) where X is a feature subset, A is the average cross- validation accuracy of C4.5, S is the average size of the trees produced by C4.5 (normalized by the number of training examples), and F is the number of features is the subset X (normalized by the total number of available features). Equation (5.4) ensures that the fittest popu- lation members are those feature subsets that lead C4.5 to induce small but accurate decision trees. 5 Dimension Reduction and Feature Selection 93 Wrappers for Instance-based Learning The wrapper approach was proposed at approximately the same time and independently of John et al. (1994) by Langley and Sage (1994) during their investigation of the simple nearest neighbor algorithm’s sensitivity to irrelevant attributes. Scaling experiments showed that the nearest neighbour’s sample complexity (the number of training examples needed to reach a given accuracy) increases exponentially with the number of irrelevant attributes present in the data (Aha et al., 1991, Langley and Sage, 1994). An algorithm called OBLIVION is presented which performs backward elimination of features using an oblivious decision tree (When all the original features are included in the tree and given a number of assumptions at classification time, Langley and Sage note that the structure is functionally equivalent to the simple nearest neighbor; in fact, this is how it is implemented in OBLIVION) as the induction algorithm. Experiments with OBLIVION using k- fold cross validation on several artificial domains showed that it was able to remove redundant features and learn faster than C4.5 on domains where features interact. Moore and Lee (1994) take a similar approach to augmenting nearest neighbor algorithm but their system uses leave- one- out instead of k- fold cross- validation and concentrates on improving the prediction of numeric rather than discrete classes. Aha and Blankert (1994) also use leave- one- out cross validation, but pair it with a beam search (Beam search is a limited version of best first search that only re- members a portion of the search path for use in backtracking), instead of hill climbing. Their results show that feature selection can improve the performance of IB1 (a nearest neighbor classifier) on a sparse (very few instances) cloud pattern domain with many features. Moore, Hill, and Johnson (1992) encompass not only feature selection in the wrapper process, but also the number of nearest neighbors used in prediction and the space of combination functions. Using leave- one- out cross validation they achieve significant improvement on several control problems involving, the prediction of continuous classes. In a similar vein, Skalak (1994) combines feature selection and prototype selection into a single wrapper process using random mutation hill climbing as the search strategy. Experimental results showed significant improvement in accuracy for nearest neighbor on two natural domains and a drastic reduction in the algorithm’s storage requirement (number of instances retained during training). Domingos (1997) describes a context sensitive wrapper approach to feature selection for instance based learners. The motivation for the approach is that there may be features that are either relevant in only a restricted area of the instance space and irrelevant elsewhere, or relevant given only certain values (weakly interacting) of other features and otherwise irrelevant. In either case, when features are estimated globally (over the instance space), the irrelevant aspects of these sorts of features may overwhelm their en-tire useful aspects for instance based learners. This is true even when using backward search strategies with the wrapper. In the wrapper approach, backward search strategies are generally more effective than forward search strategies in domains with feature interactions. Because backward search typically 94 Barak Chizi and Oded Maimon begins with all the features, the removal of a strongly interacting feature is usually detected by decreased accuracy during cross validation. Domingos presents an algorithm called RC which can detect and make use of context sensitive features. RC works by selecting a (potentially) different set of features for each instance in the training set. It does this by using a search strategy and cross validation to estimate accuracy. For each instance in back-ward the training set, RC finds its nearest neighbour of the same class and removes those features in which the two differ. The accuracy of the entire training dataset is then estimated by cross validation. If the accuracy has not degraded, the modified instance in question is accepted; otherwise the instance is restored to its original state and deactivated (no further feature selection is attempted for it). The feature selection process continues until all instances are inactive. Experiments on a selection of machine learning datasets showed that RC outperformed standard wrapper feature selectors using forward and backward search strategies with instance based learners. The effectiveness of the context sensitive approach was also shown on artificial domains engineered to exhibit restricted feature dependency. When features are globally relevant or irrelevant, RC has no advantage over standard wrapper feature selection. Furthermore, when few examples are available, or the data is noisy, standard wrapper approaches can detect globally irrelevant features more easily than RC. Domingos also noted that wrappers that employ instance based learners (includ- ing RC) are unsuitable for use on databases containing many instances because they are quadratic in N (the number of instances). Kohavi (1995) uses wrapper feature selection to explore the potential of decision table majority (DTM) classifiers. Appropriate data structures allow the use of fast incremental cross- validation with DTM classifiers. Experiments showed that DTM classifiers using appropriate feature subsets compared very favorably with sophisti- cated algorithms such as C4.5. Wrappers for Bayes Classifiers Due to the naive Bayes classifier’s assumption that, within each class, probability distributions for attributes are independent of each other. Langley and Sage (1994) note that the classifier performance on domains with redundant features can be im- proved by removing such features. A forward search strategy is employed to select features for use with na ¨ ıve Bayes, as opposed to the backward strategies that are used most often with decision tree algorithms and instance based learners. The rationale for a forward search is that it should immediately detect dependencies when harmful redundant attributes are added. Experiments showed overall improvement and increased learning rate on three out of six natural domains, with no change on the remaining three. Pazzani (1995) combines feature selection and simple constructive induction in a wrapper framework for improving the performance of naive Bayes. Forward and backward hill climbing search strategies are compared. In the former case, the algorithm considers not only the addition of single features to the current subset, but also 5 Dimension Reduction and Feature Selection 95 creating a new attribute by joining one of the as yet unselected features with each of the selected features in the subset. In the latter case, the algorithm considers both deleting individual features and replacing pairs of features with a joined feature. Re- sults on a selection of machine learning datasets show that both approaches improve the performance of na ¨ ıve Bayes. The forward strategy does a better job at removing redundant attributes than the backward strategy. Because it starts with the full set of features, and considers all possible pairwise joined features, the backward strategy is more effective at identifying attribute interactions than the forward strategy. Improvement for naive Bayes using wrapper- based feature selection is also reported in (Kohavi and Sommerfield, 1995, Kohavi and John, 1996). Provan and Singh (1996) have applied the wrapper to select features from which to construct Bayesian networks. Their results showed that while feature selection did not improve accuracy over networks which have been constructed from the full set of features, the networks created after feature selection were considerably smaller and faster to learn. 5.3 Variable Selection This section aim to provide a survey of variable selection. Suppose is Y a variable of interest, and X 1 , ,X p is a set of potential explanatory variables or predictors, are vectors of n observations. The problem of variable selection, or subset selection as it is often called, arises when one wants to model the relationship between Y and a subset of X 1 , ,X p , but there is uncertainty about which subset to use. Such a situation is particularly of interest when p is large and X 1 , ,X p is thought to contain many redundant or irrelevant variables The variable selection problem is most familiar in the linear regression context where attention is restricted to normal linear models. Letting q γ index the subsets of X 1 , ,X p and letting q be the size of the γ –th subset, the problem is to select and fit a model of the form: Y = X γ β γ + ε (5.5) where X γ is an nxq γ matrix whose columns correspond to the γ th subset, β γ is a q γ x 1 vector of regression coefficients and ε ≈ N(0, σ 2 I) . More generally, the variable selection problem is a special case of the model selection problem, where each model under consideration corresponds to a distinct subset of X 1 , ,X p . Typi- cally, a single model class is simply applied to all possible subsets. 5.3.1 Mallows Cp (Mallows, 1973) This method minimizes the mean square error of prediction: C p = RSS γ ˆ σ 2 FULL + 2q γ −n (5.6) 96 Barak Chizi and Oded Maimon where, RSS γ is the residual sum of squares for the γ th model and σ ˆ2 FULL is the usual unbiased estimate of σ 2 based on the full model. The goal is to get a model with minimum C p . By using C p one can reduce dimension by finding the minimal subset which has minimum C p , 5.3.2 AIC, BIC and F ratio Two of the other most popular criteria, motivated from very different points of view, are AIC (for Akaike Information Criterion) and BIC (for Bayesian Information Crite- rion). Letting ˆ l γ denote the maximum log likelihood of the γ th model, AIC selects the model which maximizes ( ˆ l γ −q γ ) whereas BIC selects the model which maximizes ( ˆ l γ - (logn)q γ /2). For the linear model, many of the popular selection criteria are special cases of a penalized sum- of squares criterion, providing a unified framework for comparisons. Assuming σ 2 known to avoid complications, this general criterion selects the subset model that minimizes: RSS γ / ˆ σ 2 + Fq γ (5.7) where F is a preset “dimensionality penalty”. Intuitively, the above penalizes RSS γ / σ 2 by F times q γ , the dimension of the γ th model. AIC and minimum C p are essentially equivalent, corresponding to F = 2, and BIC is obtained by setting F = log n. By imposing a smaller penalty, AIC and minimum C p will select larger models than BIC (unless n is very small). 5.3.3 Principal Component Analysis (PCA) Principal component analysis (PCA) is the best, in the mean-square error sense, linear dimension reduction technique (Jackson, 1991, Jolliffe, 1986). Being based on the covariance matrix of the variables, it is a second-order method. In various fields, it is also known as the singular value decomposition (SVD), the Karhunen-Loeve transform, the Hotelling transform, and the empirical orthogonal function (EOF) method.In essence, PCA seeks to reduce the dimension of the data by finding a few orthogonal linear combinations (the PCs) of the original variables with the largest variance. The first PC, s 1 , is the linear combination with the largest variance. We have s 1 =x T w 1 , where the p-dimensional coefficient vector w1 = (w 1 ,1, ,w 1 ,p) T solves: w 1 = arg max  w=1  Var  x T w  (5.8) The second PC is the linear combination with the second largest variance and orthogonal to the first PC, and so on. There are as many PCs as the number of the original variables. For many datasets, the first several PCs explain most of the variance, so that the rest can be disregarded with minimal loss of information. 5 Dimension Reduction and Feature Selection 97 5.3.4 Factor Analysis (FA) Like PCA, factor analysis (FA) is also a linear method, based on the second-order data summaries. First suggested by psychologists, FA assumes that the measured variables depend on some unknown, and often unmeasurable, common factors. Typ- ical examples include variables defined as various test scores of individuals, as such scores are thought to be related to a common “intelligence” factor. The goal of FA is to uncover such relations, and thus can be used to reduce the dimension of datasets following the factor model. 5.3.5 Projection Pursuit Projection pursuit (PP) is a linear method that, unlike PCA and FA, can incorporate higher than second-order information, and thus is useful for non-Gaussian datasets. It is more computationally intensive than second-order methods. Given a projection index that defines the “interestingness” of a direction, PP looks for the directions that optimize that index. As the Gaussian distribution is the least interesting distribution (having the least structure), projection indices usually measure some aspect of non- Gaussianity. If, however, one uses the second-order maximum variance, subject that the projections be orthogonal, as the projection index, PP yields the familiar PCA. 5.3.6 Advanced Methods for Variable Selection Chizi and Maimon (2002) describes in their work some new methods for variable selection. These methods based on simple algorithms and uses known evaluators like information gain, logistic regression coefficient and random selection. All the methods are presented with empirical results on benchmark datasets and with theoretical bounds on each method. Wider survey on variable selection can be found there pro- vided with decomposition of the problem of dimension reduction. In summary, features selection is useful for many application domains, such as: Manufacturing lr18,lr14, Security lr7,l10 and Medicine lr2,lr9, and for many data mining techniques, such as: decision trees lr6,lr12, lr15, clustering lr13,lr8, ensemble methods lr1,lr4,lr5,lr16 and genetic algorithms lr17,lr11. References Aha, D. W. and Blankert, R. L. Feature selection for case- based classification of cloud types. In Working Notes of th AAAI- 94 Workshop on Case- Based Reasoning, pages 106–112, 1994. Aha, D. W. Kibler, and Albert, M. K. Instance based learning algorithms. Machine Learning, 6: 37–66, 1991. Allen, D. The relationship between variable selection and data augmentation and a method for prediction. Technometrics, 16: 125– 127, 1974. 98 Barak Chizi and Oded Maimon Almuallim, H. and Dietterich, T. G. Efficient algorithms for identifying relevant features. In.Proceedings of the Ninth Canadian Conference on Artificial Intelligence, pages 38– 45. Morgan Kaufmann, 1992. Almuallim, H. and Dietterich, T. G. Learning with many irrelevant features. In Proceedings of the Ninth National Conference on Artificial Intelligence, pages 547– 542. MIT Press, 1991. Arbel, R. and Rokach, L., Classifier evaluation under limited resources, Pattern Recognition Letters, 27(14): 1619–1631, 2006, Elsevier. Averbuch, M. and Karson, T. and Ben-Ami, B. and Maimon, O. and Rokach, L., Context- sensitive medical information retrieval, The 11th World Congress on Medical Informat- ics (MEDINFO 2004), San Francisco, CA, September 2004, IOS Press, pp. 282–286. Blum P. and Langley, P. Selection Of Relevant Features And Examples In Machine Learning, Artificial Intelligence, 1997;97: 245-271 Cardie, C. Using decision trees to improve cased- based learning. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining. AAAI Press, 1995. Caruana, R. and Freitag, D. Greedy attribute selection. In Machine Learning: Proceedings of the Eleventh International Conference. Morgan Kaufmann, 1994. Cherkauer, K. J. and Shavlik, J. W. Growing simpler decision trees to facilitate knowledge discovery. In Proceedings of the Second International Conference on Knowledge Dis- covery and Data Mining. AAAI Press, 1996. Chizi, B. and Maimon, O. “On Dimensionality Reduction of High Dimensional Data Sets”, In “Frontiers in Artificial Intelligence and Applications”. IOS press, pp. 230-236, 2002. Cohen S., Rokach L., Maimon O., Decision Tree Instance Space Decomposition with Grouped Gain-Ratio, Information Science, Volume 177, Issue 17, pp. 3592-3612, 2007. Domingos, P. Context- sensitive feature selection for lazy learners. Artificial Intelligence Review, (11): 227– 253, 1997. Elder, J.F. and Pregibon, D. “A Statistical perspective on knowledge discovery in databases” In Advances in Knowledge Discovery and Data Mining, Fayyad, U. Piatetsky-Shapiro, G. Smyth, P. & Uthurusamy, R. ed., AAAI/MIT Press., 1996. George, E. and Foster. D. Empirical Bayes Variable Selection, Biometrika, 2000. Hall, M. Correlation- based feature selection for machine learning, Ph.D. Thesis, Department of Computer Science, University of Waikato, 1999. Holmes, G. and Nevill- Manning, C. G. . Feature selection via the discovery of simple classification rules. In Proceedings of the Symposium on Intelligent Data Analysis, Baden- Baden, Germany, 1995. Holte, R. C. Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11: 63– 91, 1993. Jackson, J. A User’s Guide to Principal Components. New York: John Wiley and Sons, 1991 John, G. H. Kohavi, R. and Pfleger, P. Irrelevant features and the subset selection problem. In Machine Learning: Proceedings of the Eleventh International Conference. Morgan Kaufmann, 1994. Jolliffe, I. Principal Component Analysis. Springer-Verlag, 1986 Kira, K. and Rendell, L. A A practical approach to feature selection. In Machine Learning: Proceedings of the Ninth International Conference, 1992. Kohavi R. and John, G. Wrappers for feature subset selection. Artificial Intelligence, special issue on relevance, 97(1– 2): 273– 324, 1996 Kohavi, R. and Sommerfield, D. Feature subset selection using the wrapper method: Overfit- ting and dynamic search space topology. In Proceedings of the First International Con- ference on Knowledge Discovery and Data Mining. AAAI Press, 1995. 5 Dimension Reduction and Feature Selection 99 Kohavi, R. Wrappers for Performance Enhancement and Oblivious Decision Graphs. PhD thesis, Stanford University, 1995. Koller, D. and Sahami, M. Towards optimal feature selection. In Machine Learning: Proceed- ings of the Thirteenth International Conference on machine Learning. Morgan Kauf- mann, 1996. Kononenko, I. Estimating attributes: Analysis and extensions of relief. In Proceedings of the European Conference on Machine Learning, 1994. Langley, P. Selection of relevant features in machine learning. In Proceedings of the AAAI Fall Symposium on Relevance. AAAI Press, 1994. Langley, P. and Sage, S. Scaling to domains with irrelevant features. In R. Greiner, editor, Computational Learning Theory and Natural Learning Systems, volume 4. MIT Press, 1994. Liu, H. and Setiono, R. A probabilistic approach to feature selection: A filter solution. In Machine Learning: Proceedings of the Thirteenth International Conference on Machine Learning. Morgan Kaufmann, 1996. Maimon O., and Rokach, L. Data Mining by Attribute Decomposition with semiconductors manufacturing case study, in Data Mining for Design and Manufacturing: Methods and Applications, D. Braha (ed.), Kluwer Academic Publishers, pp. 311–336, 2001. Maimon O. and Rokach L., “Improving supervised learning by feature decomposition”, Pro- ceedings of the Second International Symposium on Foundations of Information and Knowledge Systems, Lecture Notes in Computer Science, Springer, pp. 178-196, 2002. Maimon, O. and Rokach, L., Decomposition Methodology for Knowledge Discovery and Data Mining: Theory and Applications, Series in Machine Perception and Artificial In- telligence - Vol. 61, World Scientific Publishing, ISBN:981-256-079-3, 2005. Mallows, C. L. Some comments on Cp . Technometrics 15, 661- 676, 1973 Michalski, R. S. A theory and methodology of inductive learning. Artificial Intelligence, 20( 2): 111– 161, 1983. Moore, A. W. and Lee, M. S. Efficient algorithms for minimizing cross validation error. In Machine Learning: Proceedings of the Eleventh International Conference. Morgan Kaufmann, 1994. Moore, A. W. Hill, D. J. and Johnson, M. P. An empirical investigation of brute force to choose features, smoothers and function approximations. In S. Hanson, S. Judd, and T. Petsche, editors, Computational Learning Theory and Natural Learning Systems, volume 3. MIT Press, 1992. Moskovitch R, Elovici Y, Rokach L, Detection of unknown computer worms based on behav- ioral classification of the host, Computational Statistics and Data Analysis, 52(9):4544– 4566, 2008. Pazzani, M. Searching for dependencies in Bayesian classifiers. In Proceedings of the Fifth International Workshop on AI and Statistics, 1995. Pfahringer, B. Compression- based feature subset selection. In Proceeding of the IJCAI- 95 Workshop on Data Engineering for Inductive Learning, pages 109– 119, 1995. Provan, G. M. and Singh, M. Learning Bayesian networks using feature selection. In D. Fisher and H. Lenz, editors, Learning from Data, Lecture Notes in Statistics, pages 291– 300. Springer- Verlag, New York, 1996. Quinlan, J.R. C4.5: Programs for machine learning. Morgan Kaufmann, Los Altos, Califor- nia, 1993. Quinlan, J.R. Induction of decision trees. Machine Learning, 1: 81– 106, 1986. Rissanen, J. Modeling by shortest data description. Automatica, 14: 465–471, 1978. . Springer, pp. 178-196, 20 02. Maimon, O. and Rokach, L., Decomposition Methodology for Knowledge Discovery and Data Mining: Theory and Applications, Series in Machine Perception and Artificial In- telligence. on knowledge discovery in databases” In Advances in Knowledge Discovery and Data Mining, Fayyad, U. Piatetsky-Shapiro, G. Smyth, P. & Uthurusamy, R. ed., AAAI/MIT Press., 1996. George, E. and. 177, Issue 17, pp. 35 92- 36 12, 20 07. Domingos, P. Context- sensitive feature selection for lazy learners. Artificial Intelligence Review, (11): 22 7– 25 3, 1997. Elder, J.F. and Pregibon, D. “A Statistical

Data Mining and Knowledge Discovery Handbook, 2 Edition part 12 ppsx

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan