Financial distress prediction based on OR-CBR in the principleof k-nearest neighbors

Thông tin tài liệu

Tri tue nhan tao

Financial distress prediction based on OR-CBR in the principle of k-nearest neighbors Hui Li, Jie Sun * , Bo-Liang Sun School of Business Administration, Zhejiang Normal University, sub-mailbox 91# in P.O. Box 62# at YingBinDaDao 688#, Jinhua 321004, Zhejiang Province, PR China Abstract Financial distress prediction including bankruptcy prediction has called broad attention since 1960s. Various techniques have been employed in this area, ranging from statistical ones such as multiple discriminate analysis (MDA), Logit, etc. to machine learning ones like neural networks (NN), support vector machine (SVM), etc. Case-based reasoning (CBR), which is one of the key methodologies for problem-solving, has not won enough focus in financial distress prediction since 1996. In this study, outranking relations (OR), including strict difference, weak difference, and indifference, between cases on each feature are introduced to build up a new feature-based similarity measure mechanism in the principle of k-nearest neighbors. It is different from traditional distance-based similarity mechanisms and those based on NN, fuzzy set theory, decision tree (DT), etc. Accuracy of the CBR prediction method based on OR, which is called as OR-CBR, is determined directly by such four types of parameters as difference parameter, indifference parameter, veto parameter, and neighbor parameter. It is described in detail that what the model of OR-CBR is from various aspects such as its developed background, formalization of the specific model, and implementation of corresponding algorithm. With three year’s real-world data from Chinese listed companies, experimental results indicate that OR-CBR outperforms MDA, Logit, NN, SVM, DT, Basic CBR, and Grey CBR in financial distress prediction, under the assessment of leave-one-out cross-validation and the process of Max normalization. It means that OR-CBR may be a preferred model dealing with financial distress prediction in China. Ó 2007 Elsevier Ltd. All rights reserved. Keywords: Financial distress prediction; Case-based reasoning; Outranking relations; k-nearest neighbors 1. Introduction With the development of Chinese economy and its impact on globalization of economy, Chinese listed companies have entered into lots of global supply chains. To control the cost of the whole supply chain, lots of notable international enterprises, such as General Motor, Volkswagen, Dell, Cisco, Hitachi, Sony, etc. have also set up factories in China. While, enterprises that fail to struggle in the competitive business environment might run into bankrupt. It not only brings individual loss to its stockholders, creditors, manag- ers, employees, etc., but also shocks the whole country’s, even the world’s, economic development, if lots of enterprises fall into bankrupt at one time. Hence, it is a valuable and applicable area to research how to control bankruptcy. While, bankruptcy is not just happened by accident but a continuously developed process that covers a period of time. Most enterprises that ran into bankruptcy had expe- rienced financial distress, which usually have some symp- toms indicated by company’s account items. It is significant to explore effective financial distress prediction models with various classification and prediction techniques. There exist lots of literatures on bankruptcy and financial distress prediction with research methodology typically ranging from statistical ones such as multiple discriminate analysis (MDA) (Altman, 1968; Taffler, 1982; Altman, Marco, & Varetto, 1994), Logit (Martin, 1977; Ohlson, 1980; Andres, Landajo, & Lorca, 2005), etc., to machine learning methods such as neural networks (NN) (Odom & Sharda, 1990; Tam & Kiang, 1992; Altman 0957-4174/$ - see front matter Ó 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2007.09.038 * Corresponding author. Tel.: +86 134 5494 2829. E-mail addresses: lihuihit@gmail.com (H. Li), sjhit@sina.com (J. Sun). www.elsevier.com/locate/eswa Available online at www.sciencedirect.com Expert Systems with Applications 36 (2009) 643–659 Expert Systems with Applications et al., 1994; Boritz & Kennedy, 1995; Barniv, Anurag, & Leach, 1997; Ahn, Cho, & Kim, 2000; Liang & Wu, 2005), support vector machine (SVM) (Shin, Lee, & Kim, 2005; Min & Lee, 2005; Min, Lee, & Han, 2006; Hui & Sun, 2006; Wu, Tzeng, & Goo, 2007; Hua, Wang, & Xu, 2007; Ding, Song, & Zen, 2007), decision tree (DT) (Fryd- man, Altman, & Kao, 1985; Sun & Li, 2007a), etc., and to multi-classifier models by combining different classifiers (Jo & Han, 1996; Lin & McClean, 2001; Sun & Li, 2007b). Case-based reasoning (CBR) was demonstrated to be one of the key methodologies employed in the development of knowledge-based systems in the last decade (Liao, 2005). Such knowledge-based systems are human-centered com- puter programs by emulating the problem-solving behav- iors of experts (Humphreys, McIvor, & Chan, 2003; Pal & Shiu, 2004). Capturing experts’ knowledge and their problem-solving logics in the real-world problems is one of the key contents involved. The basic idea of CBR is to utilize solutions of previous problems to solve new problems (Schank, 1982; Waheed & Adeli, 2005; Schmidt & Vorobieva, 2006). In CBR, experts’ knowledge is stored in case library in the form of cases for users to call on for specific advice as needed. Experts’ knowledge is always qualitative and even quantification value of experts’ knowledge is subject to imprecision, uncertainty and inconsis- tency. Hence, when structuring the unstructured experts’ knowledge into the form of cases and storing them in case library, experts’ knowledge hides in the corresponding rela- tionships between cases and their results, which are classes when employing CBR in financial distress prediction which belongs to the problem of classification. While, comparing with abundant literatures focusing on financial distress prediction based on MDA, Logit, SVM, NN, etc. (Kumar & Ravi, 2007), there are a few studies (Jo & Han, 1996; Bryant, 1997; Jo, Han, & Lee, 1997; Park & Han, 2002; Yip, 2004) that have paid attention to CBR- based classification in financial distress prediction. And almost all the foregoing research employed traditional feature-based similarity measure mechanism of CBR based on Euclidean distance, which is named as Basic CBR in this research. CBR was outperformed by or outperformed the other models they used with data they collected. Lately, Sun and Hui (2006) refocused on employing CBR in financial distress prediction with data from Chinese listed companies using an improved similarity measure mechanism. They integrated CBR with weighted voting on the basis of a similarity measure mechanism by way of introducing grey relationship degrees into the Euclidean distance metric. Experimental research with data from Chinese listed companies comparing the new proposed model with MDA, Logit, NN, and SVM, demonstrates that Grey CBR in the principle of k-nearest neighbors they proposed is more applicable in the prediction for enterprises which might probably run into financial distress in less than two years. In this research, outranking relations are introduced to improve similarity measure mechanism of CBR, and a new CBR-based classifier based on outranking relations, which is named as OR-CBR, is built up to predict financial distress. Experimental results indicate that OR-CBR is more accu- rate when dealing with financial data set from Chinese listed companies comparing with MDA, Logit, NN, SVM, DT, Basic CBR, and Grey CBR, under the assessment of leave- one-out cross-validation and the process of Max normalization. The breakdown of this research is divided into five section. Section 2 is a description on research background. Section 3 builds up the model of OR-CBR in the principle of k-nearest neighbors. And an empirical experiment carries on with data from Chinese listed companies to make a com- parative analysis between OR-CBR and other prediction models in Section 4. Section 5 makes conclusions. 2. Research background 2.1. Basic mechanism of similarity measure Case retrieval process is a major step researched in CBR cycle (Aamodt & Plaza, 1994; Finnie & Sun, 2003), in which similarity measure mechanism among cases is its core. Similarity measure mechanism greatly influences retrieval performance in the process. The basic assumption of similarity is that target case is dealt with by resorting to a previous situation with common conceptual features (Smyth & Keane, 1998; Pal & Shiu, 2004). A CBR system describes a concept C implicitly by a pair (CB, sim), in which the concept of similarity plays a pivotal role. Tver- sky (1977) firstly defined the degree of similarity between case a and b as a ratio, whose numerator is the number of common features between a and b, and whose denominator is the total number of features. It denotes as follows: SIMð a; bÞ ¼ common features between a and b common and different features between a and b ð1Þ Assumptions of the definition are as follows: • Memberships of features are bivalent. • Features are taken as equal importance. With the development of CBR, the most common used method nowadays in computing relatedness among cases is weighted feature-based similarity. Given a target case a, weight for the kth feature can be denoted as w k . Similarity between target case a and stored case b can be computed by the following mechanism. SIM ðwÞ ða; bÞ¼ X m k¼1 w k Â simða k ; b k Þð2Þ where m is the number of features, a k is the value of kth feature for target case, b k is the value of kth feature for stored case. And the similarity on feature k between the two cases can be defined as follows: 644 H. Li et al. / Expert Systems with Applications 36 (2009) 643–659 where norm(a k ) and norm(b k ) represent the normalized feature values of a k and b k . Similarity value of two cases is always becoming bigger with distance value between them becoming smaller. In Formula (3), similarity value of two cases is defined as distance value subtracted by 1. And the distance mechanism employed is called Manhattan distance. Of course, there are some other similarity computing mechanisms such as distance value divided by 1 using Euclidean distance. While, the disposal approach is similar. The similarity mechanism based on weighted Euclidean distance is shown as follows: SIMða; bÞ¼ X sim ðwÞ ða k ; b k Þ ¼ 1 1 þ að P w 2 Âjnormða k ÞÀnormða k Þj 2 Þ 0:5 ð4Þ The ratio proposed by Tversky (1977) is a particular instance of weighted feature-based similarity of common used mechanism. When taking distance measure, dis(a k ,b k )=j norm(a k ) À norm(b k )j, into consideration directly, the distance measure for the kth feature has the following properties. • dis(a k ,b k ) = 0 if and only if a k = b k . • dis(a k ,b k ) = dis(b k ,a k ). • dis(a k ,c k ) 6 dis(a k ,b k ) + dis(b k ,c k ). On the foundation of the mechanism of weighed feature- based similarity, the principle of k-nearest-neighbor and its derivatives, the most commonly used retrieval technique by far, have been developed. When carrying out further research on similarity measure mechanism or even on CBR, whether CBR is a technology like rule-based reasoning, NN or it is better described as a methodology for problem-solving had better be made clear. Because it determines what is being researched and how the research can be carried on. Kolodner (1993) first proposed the question and believed that CBR is both a cognitive model and a method of building intelligent systems. After describing four applications of CBR, which variously use the technology of nearest neighbor (Price & Pegler, 1995), induction such as ID3 (Heider, 1996), fuzzy logic (Leake, 1996), and database technology (Cheetham & Graf, 1997), Watson (1999) educed the conclusion that CBR is a methodology and not a technology. Because there are no technologies in CBR, any technologies or approaches can be introduced into the CBR life- cycle. Pal and Shiu (2004) believed that it was hard to achieve a quantum jump in CBR by just using traditional computing and reasoning methods. In fact, the technology aggregate of soft computing (Zadeh, 1965, 1994), including NN, genetic algorithm, fuzzy set theory, etc., has already been introduced into it to accelerate the development of CBR. Hence, the stage of soft CBR (SCBR) has already come (Pal & Shiu, 2004). Soft computing techniques help CBR systems to reasoning with uncertainty and imprecision, which are very useful to enhance its problem-solving ability. As a matter of fact, there may be some other technologies and approaches that can urge the development of CBR, such as outranking relations (Roy, 1971; Roy & Vincke, 1984; Roy, 1991; Roy, 1996; Roy & Słowin ´ ski, 2007). In this research, we deal with how outranking relations can be used to extend traditional weighted feature- based similarity computation in the principle of k-nearest neighbors. At the same time, there are some other similarity measure mechanisms like fuzzy similarity by introducing fuzzy set theory into case matching, DT mechanism, NN mechanism, etc. (Pal & Shiu, 2004). While, the fuzzy similarity mechanism is to handle linguistic memberships as low,me- dium and high of features. The object of this research is still numerical memberships of features. Thus, discussion about the similarity measure mechanism of fuzzy similarity is beyond this research. For the other similarity mechanisms by introducing DT and NN, we will carry on an experiment to compare prediction accuracies of OR-CBR and predic- tive models directly developed by DT and NN. 2.2. Outranking relations Outranking relation methods, which are called ELEC- TRE and others, in which an outranking relation between alternatives is constructed from pseudo-criteria, have been developed so far. In order to deal with conflicting multi-criteria in complex systems and numerical values of alternative of some criteria subject to imprecision, uncertainty, and incompleteness, Roy et al. (Roy, 1971; Roy & Vincke, 1984; Roy, 1991; Roy, 1996; Roy & Słowin ´ ski, 2007) developed the concept of pseudo-criterion and its indifference, preference, and veto thresholds. Considering two alternatives a and b, and m criteria g k (k =1,2, ., m), Let g k (a) and g k (b), respectively denote the kth criteria value of alternative a and b. It is assumed that larger value is preferred to smaller one. Considering the assertion that b is outranked by a. g k (a)>g k (b) simða k ; b k Þ¼ 0 if the kth feature is symbolic=qualitative and a k – b k 1 if the kth feature is symbolic=qualitative and a k ¼ b k 1 Àjnormða k ÞÀnormðb k Þj if the kth feature is continuous 8 > > > > > > < > > > > > > : ð3Þ H. Li et al. / Expert Systems with Applications 36 (2009) 643–659 645 expresses that a is preferred to b according to the kth criteria, and g k (a)=g k (b) expresses that a is indifferent to b according to the kth criteria. Then, g k is called as a true criteria. In presence of impression, uncertainty, and so on, it is reasonable to admit that if g k (a) À g k (b) is small enough, a and b can be regarded as indifferent. By introducing an indifference threshold q k , g k (a) À g k (b)>q k expresses that a is preferred to b, and g k (a) À g k (b) 6 q k expresses that a is indifferent to b. Thus, g k is called a semi-criteria. To avoid an abrupt change from strict preference to indifference, a preference threshold p k (p k > q k ) is introduced. If g k (a)–g k (b) 6 q k , a and b are considered indifference. If g k (a) À g k (b)>p k , a is considered strict preferred to b. The case q k < g k (a) À g k (b) 6 p k is then considered as a hesitation between indifference and preference, which is called weak preference. Then, g k is called a pseudo-criteria. Considering the same assertion ‘b is outranked by a’, for each pseudo-criterion g k , an outranking relation index of c k (a, b) is defined as follows: c k ða; bÞ¼1; if g k ðaÞÀg k ðbÞ > p k ð5Þ c k ða; bÞ¼0; if jg k ðaÞÀg k ðbÞj 6 q k ð6Þ Otherwise; c k ða; bÞ2½0; 1Þð7Þ Thus the concordance index of a outranks b is defined as follows: Cða; bÞ¼Rw k Â c k ða; bÞð8Þ On the other hand, by introducing a veto threshold v k for each criterion g k , a discordance index d k (a, b) which rejects the assertion’a outranks b’ is defined as follows: d k ða; bÞ¼1; if g k ðbÞÀg k ðaÞ > v k ð9Þ d k ða; bÞ¼0; if g k ðbÞÀg k ðaÞ 6 p k ð10Þ Otherwise; d k ða; bÞ2ð0; 1ð11Þ The degree of credibility of’b is outranked by a’is defined as follows: Sða; bÞ¼Cða; bÞ¼Rw k Â c k ða; bÞ ifd k ða; bÞ 6 Cða; bÞ8k: ð12Þ Sða; bÞ¼Cða; bÞ P k2Kða;bÞ 1 À d k ða; bÞ 1 À Cða; bÞ ð13Þ where K(a, b) is the set of criteria for which d k (a, b)> C(a, b). 3. Prediction model of OR-CBR There are three steps when employing CBR into prediction, which are listed as follows: • Identifying significant features to describe a case. • Collecting similar cases in case library to target case grounded on significant features. • Predicting the class of target case based on classes of similar cases. This process of prediction is often called as forecasting- by-analogy (Jo & Han, 1996). It can be found that similarity measure mechanism is the foundation of such CBR- based classification as financial distress prediction. And it is the core of Sun and Hui’s (2006) research when building a new model of similarity weighted voting CBR for financial distress prediction by introducing grey relationship degrees. Reasoning structure of the model of OR-CBR in financial distress prediction is shown as Fig. 1. The core of OR-CBR in financial distress prediction is matching similar cases to target case. By introducing outranking relations into weighted feature-based similarity measure mechanism, similarity measuring matrix is transformed into concordance indices matrix and discordance indices matrix, based on which case matching carries on. For such CBR-based classification problems as financial distress prediction, cases are divided as at least two classes, which can be denoted as t (t 2 N, t P 1). The aim of CBR- based classifier is to predict which class the target case belongs to. Prediction of OR-CBR is on the foundation of distance measure that is based on the location of objects in Manhattan space. It can be expressed in the following manner. Let CB = {(c 1 , t 1 ),(c 2 , t 2 ), .,(c n , t n )} denote a case library having n cases. Each case can be identified by a set of corresponding features, which can be expressed as F ={F k }(k =1,2, ., m). The ith case c i in case base can be represented as an m-dimensional vector, c i =(x i1 , x i2 , ., x im ), where x ik corresponds to the value of feature F k (1 6 k 6 m) of case c i . Let the weight of each feature F k be expressed as w k (w k 2 [0, 1]). When the technology of outranking relations is introduced to improve similarity computation in the building-up of OR-CBR, the main difference between OR-CBR and Basic CBR is the mechanism in dealing with distance computation. 3.1. OR-CBR in the principle of nearest neighbor When introducing outranking relations into CBR to build up a new similarity measure mechanism, the assertion that a is indifferent to b is employed because similarity often means that there are little difference between two entities. The main definitions of nearest-neighbor OR- CBR classifier are as follows: Definition 1 (Standardization). That means to standardize input data of the OR-CBR classifier into the range of [0, 1] or [À1,1]. " target case c 0 , " c i 2 CB, " F k 2 F, $ x 0k 2 c 0 , $ x ik 2 c i . It can be supposed that value is more preferred if it is bigger. The standardization process for the range of [À1,1] can be denoted as x ik ¼ x ik max k jx ik j ði ¼ 0; 1; 2; ÁÁÁ; n; k ¼ 1; 2; ÁÁÁ; mÞð14Þ where, max k x ik represents the maximum value of feature F k among all stored cases and target question case. x ik denotes the value of case c i at feature F k . So after standardization, it is clear that x ik 2 [À1, 1]. 646 H. Li et al. / Expert Systems with Applications 36 (2009) 643–659 Definition 2 (Outranking relations between cases on each feature with the assertion that two cases are indifferent). "target case c 0 , " c i 2 CB, " F k 2 F, $ x 0k 2 c 0 , $ x ik 2 c i , $ p k 2 [0,2]. If the condition, jx ik À x 0k j > p k ,is met, then what it means is that c i is strictly different to c 0 if feature F k is taken into consideration, namely they are strict different. That can be denoted as P k (c i , c 0 ). And in this definition, p k is difference threshold parameter. " target case c 0 , " c i 2 CB, " F k 2 F, $ x 0k 2 c 0 , $ x ik 2 c i , $ q k 2 [0,2]. If the condition, jx ik À x 0k j 6 q k , is met, then what it means is that c i is indifference to c 0 if feature F k is taken into consideration. That can be denoted as I k (c i ,c 0 ). And in this definition, q k is indifference threshold parameter. " target case c 0 , " c i 2 CB, " F k 2 F, $ x 0k 2 c 0 , $ x ik 2 c i , $ q k 2 [0,2], $ p k 2 [0, 2], q k < p k . If the condition, q k < jc ik À c 0k j 6 p k is met, then what it means is that c i is weakly difference to c 0 if feature F k is taken into consideration, namely they are weak different. That can be denoted as W k (c i , c j ). And in this definition, q k is indifference threshold parameter, p k is difference threshold parameter. Definition 3 (Concordance indices between cases on each feature with the assertion that two cases are indifferent). "target case c 0 , " c i 2 CB, " F k 2 F, $ x 0k 2 c 0 , $ x ik 2 c i , $ q k 2 [0,2], $ p k 2 [0,2], q k < p k . Let the concordance index of c i indifferent to c 0 on feature F k denote as c k (c i ,c 0 ). c k ðc i ; c 0 Þ¼0; if P k ðc i ; c 0 Þð15Þ c k ðc i ; c 0 Þ¼1; if I k ðc i ; c 0 Þð16Þ c k ðc i ; c 0 Þ¼ p k Àjx ik À x 0k j p k À q k ; if W k ðc i ; c 0 Þð17Þ Outranking relations and concordance indices on feature F k can be expressed as Fig. 2. Definition 4 (Concordance indices between cases with the assertion that two cases are indifferent). "w k . Let C(c i , c 0 ) denote concordance index with the assertion that stored case c i and target question case c 0 are indifferent. It can be computed through the following formula. Cðc i ; c 0 Þ¼ P m k¼1 ðw k Â c k ðc i ; c 0 ÞÞ P m k¼1 w k ð18Þ Definition 5 (Discordance indices on each feature with the assertion that two cases are indifferent). "target case c 0 , " c i 2 CB, " F k 2 F, $ x 0k 2 c 0 , $ x ik 2 c i , $ v k 2 [0,2], $ p k 2 [0,2], p k < v k . Let the discordance index of c i indifferent to c 0 on feature F k denote as d k (c i ,c 0 ). d k ðc i ; c 0 Þ¼0; if jx ik À x 0k j 6 p k ð19Þ d k ðc i ; c 0 Þ¼1; if jx ik À x 0k j > v k ð20Þ d k ðc i ; c 0 Þ¼ jx ik À x 0k jÀp k v k À p k ; if p k < jx ik À x 0k j 6 v k ð21Þ In this definition, v k is veto threshold parameter of feature F k such that the assertion of c i indifferent to c 0 is refused. Discordance indices of c i indifferent to c 0 on feature F k can be expressed as Fig. 3. Fig. 1. A proposed reasoning structure of the model of OR-CBR in financial distress prediction. H. Li et al. / Expert Systems with Applications 36 (2009) 643–659 647 Definition 6 (Similarity between cases on the basis of concordance indices and discordance indices). " target case c 0 , " c i 2 CB, " F k 2 F. Based on concordance indices and discordance indices on each feature, similarity measure between stored case c i and target case c 0 is defined as the credibility degree of indifference between them based on ELECTRE. SIM ðwÞ i;0 ¼ SIM ðwÞ ðc i ; c 0 Þ¼Cðc i ; c 0 Þ if d k ðc i ; c 0 Þ 6 Cðc i ; c 0 Þ8k ð22Þ SIM ðwÞ i;0 ¼ SIM ðwÞ ðc i ; c 0 Þ¼Cðc i ; c 0 Þ P k2Kðc i ;c 0 Þ 1 À d k ðc i ; c 0 Þ 1 À Cðc i ; c 0 Þ ð23Þ where K(c i ,c 0 ) is the set of criteria for which d k (c i ,c 0 )> C(c i ,c 0 ). There are three characteristics, which the similarity measure mechanism based on outranking relations meets, shown as follows: • SIM ðwÞ i;0 P 0. Proof [{] c k (c i ,c 0 ) P 0; [)] Based on Formula(18), C(c i , c 0 ) P 0; [)] Based on Formula(22), SIM ðwÞ i;0 P 0; [)]1P c k (c i ,c 0 ) P 0and0P d k (c i ,c 0 ) P 0; [)] Based on Formula(23), SIM ðwÞ i;0 P 0. • SIM ðwÞ i;i P SIM ðwÞ i;0 . Proof [{] c k (c i , c i ) = 1 and d k (c i , c i )=0; [)] Based on Formula(18), C(c i ,c i ) = 1; Based on Formula(22), SIM ðwÞ i;i ¼ 1. [)]1P c k (c i ,c 0 ) P 0 and 0 P d k (c i ,c 0 ) P 0; [)] Based on Formula(18),06 C(c i ,c i )6 1; [)] Based on Formula(22),06 SIM ðwÞ i;0 6 1; [)] In Formula(23), d k (c i ,c 0 )>C(c i ,c 0 ); [)]06 (1-d k (c i ,c 0 ))/(1-C(c i ,c 0 )) < 1; [)] Based on Formula(23),06 SIM ðwÞ i;0 6 1; [)] SIM ðwÞ i;i P SIM ðwÞ i;0 . • SIM ðwÞ i;0 ¼ SIM ðwÞ 0;i . Proof [{] P k (c i , c 0 )=P k (c 0 , c i ), I k (c i , c 0 )=I k (c 0 , c i ), and W k (c i , c 0 )=W k (c 0 , c i ); [)] c k (c i , c 0 )=c k (c 0 ,c i )andd k (c i ,c 0 )=d k (c 0 ,c i ); [)] Whether in Formula(22) or in Formula(23), SIM ðwÞ i;0 ¼ SIM ðwÞ 0;i . Definition 7 (Predicting the class of target case in the principle of nearest neighbor). Suppose that c * denotes the nearest neighbor case, whose corresponding value of class is represented as t * . Then the class value of the nearest neighbor case is taken as predicted class value of target case, i.e. t 0 = t * . Fig. 2. Outranking relations and indifference concordance indices of c i to c 0 on feature F k . 648 H. Li et al. / Expert Systems with Applications 36 (2009) 643–659 3.2. OR-CBR in the principle of k-nearest neighbors When the principle of k-nearest neighbors is employed, the class values of the k-nearest neighbor cases are integrated to predict the class value of target case by similarity weighted voting technology. It means that there is a different prediction mechanism employed than that of Definition 7. Definition 8 (Predicting the class of target case in the principle of k-nearest neighbors). Given the neighbor parameter denoted as per, similarity threshold T is often set as the neighbor parameter per multiplies the maximum similarity between target case and all stored cases. T ¼ per Â maxðSIM ðwÞ i;0 Þð24Þ Suppose c i * denotes one of the k-nearest neighbor cases, according to k-nearest neighbors principle, stored cases whose similarity value with target case are bigger than or equal to T constitute a case set ofk-nearest neighbors to target case, denoted by C * . C Ã ¼fc iÃ jSIM ðwÞ iÃ;0 P T gðiÃ¼1; 2; .; kÞð25Þ In classification problem, the class value of target case c 0 is predicted by those of the k most similar cases. Suppose corresponding class value of c i * is represented as t i * , and its similarity with target case c 0 is represented as SIM ðwÞ iÃ;0 . The similarity weighted voting probability of target case c 0 belonging to the class t l is defined as prob(t l ). probðt l Þ¼ P t iÃ ¼t l SIM ðwÞ iÃ;0 P k iÃ¼1 SIM ðwÞ iÃ;0 ðl ¼ 1; 2; .; LÞð26Þ In which, the denominator is the sum of similarity between target case c 0 and all k-nearest neighbor cases. The numerator is the sum of similarity between target case c 0 and those k-nearest neighbor cases whose class values equal to t l . Let the class value of target case c 0 denote as t 0 . According to the combination principle of formula (28), the final value of t 0 is the class t * , whose similarity weighted voting probability is the biggest among all classes. t 0 ¼ t Ã if probðt Ã Þ¼max L l¼1 ðprobðt l ÞÞ ð27Þ When there are only two classes, the mechanism can be simplified as the following formula. t 0 ¼ t 1 if probðt 1 Þ P probðt 2 Þ t 2 if probðt 2 Þ P probðt 1 Þ  ð28Þ 3.3. Algorithm description of OR-CBR in financial distress prediction The main difference between similarity measuring mechanism of OR-CBR and other feature-weighted ones such as Basic CBR and Grey CBR is shown as Fig. 4. In Basic CBR, similarity is computed directly through distance matrix between memberships of target case and stored cases on each feature. The most widely used is Ham- ming distance and Euclidean distance. In Grey CBR, grey Fig. 3. Discordance indices of c i indifferent to c j on feature F k . H. Li et al. / Expert Systems with Applications 36 (2009) 643–659 649 correlation measure is employed to transform memberships of target case and stored cases on each feature to grey correlation matrix. Based on the matrix, a sum mechanism is employed to compute similarity between target case and stored cases. While, in OR-CBR, outranking relations between cases on each feature are employed. Thus, concordance indices matrix and discordance indices matrix substi- tute distance matrix in Basic CBR and grey correlation matrix in Grey CBR. On the whole, inputs of OR-CBR are as follows: • Case library – CB. • Target case. • Difference threshold parameter for each feature – p k . • Indifference threshold parameter for each feature – q k . • Veto threshold parameter for each feature – v k . • Neighbor parameter – per. Output of OR-CBR is class of target case – t * . It means that there are mainly four types of parameters to be optimized, i.e. per, difference parameters, indifference parameters, and veto parameters. Because a standardization process is employed, the assumption can be made that the equal value will be employed in the same type of outranking parameters. As a result, there are four parameters to be optimized, i.e. per, p, q,andv. Algorithm description of OR-CBR in financial distress prediction is as follows. Begin //Inputting the vectors of companies to be predicted Get_target_cases( ); For each target case For each case in CB //Standardizing vectors of target case and stored cases into the range of [À1,1] or [0, 1] Data_standardization(c 0 , CB); //Computing similarity between target case and stored cases on each feature For each feature of a case //Getting outranking relations between target case and stored cases Get_outranking_relations(c 0 , c i , q, p, v); //Computing concordance and discordance indices on each feature Get_concordance(c 0 , c i ); Get_discordance(c 0 , c i ); EndFor //Computing similarity between target case and stored cases Get_similarity(c 0 , c i ); EndFor Fig. 4. Comparison between similarity measure mechanism of OR-CBR and other feature-weighted ones. 650 H. Li et al. / Expert Systems with Applications 36 (2009) 643–659 //Searching k nearest neighbors, which companies are similar to the target case on features employed, in CB Get k casesðper; maxðSIM ðwÞ i;0 ÞÞ; //Predicting class of target company based on voting mechanism of the k-nearest neighbors Prediction([c i ]); EndFor End Because there are four parameters in OR-CBR, i.e. q, p, v, and per, they have to be optimized before carrying on prediction. On the basis of the R 4 model of CBR (Aamodt & Plaza, 1994; Finnie & Sun, 2003; Pal & Shiu, 2004), working flows of OR-CBR prediction and parameters optimization are shown in Fig. 5. Firstly, parameters of OR-CBR should be optimized to assure a more accu- rate prediction. In the parameters optimization process, cross validation is utilized. Cases in case library are divided into several shares. Each time taking out one share as target cases, and the rest act as case base. In the retrieval process of CBR, outranking parameters, i.e. q, p, and v, and neighbor parameter per are inputted into OR-CBR. Based on final feature sets employed, similari- ties are computed and the k-nearest neighbors are retrieved. In the reuse process of CBR, class values of the k-nearest neighbors are used to predict class value of a target case, which is the suggested class value of a target case. In the revision process of CBR, real class value of the target case is utilized to evaluate the prediction accuracy. It is circulated till the optimal parameters have been searched. Then, the new company can be predicted with optimal parameters. In retrieval process, k-nearest neighbors are retrieved from case library. In the reuse process, class values of the cases retrieved are employed to predict class value of target company, and a suggested class value of target company is obtained. If the new case should be stored in case library, then the process of retaining carries on. 4. Experiment with real-world data There are a set of in-order approaches carried out, i.e. data collection, data preprocessing, feature extraction, modeling, and assessment. The framework of the experiment is shown as Fig. 6. Fig. 5. Working flow of OR-CBR prediction and parameters optimization. H. Li et al. / Expert Systems with Applications 36 (2009) 643–659 651 4.1. Data collection The data sample consists of financial data from Chinese listed companies. The companies are grouped into two types: specially treated (ST) companies and normal ones. A company is specially treated in China because it has had negative net profit in continuous two years or for some other reasons. In this research, all the data samples of ST collected are those for negative net profit in continuous two years. Companies’ financial state are divided into two classes according to whether the company is specially treated or not. They are as follows: • Normal company. • Distressed company. Companies that never be specially treated are regarded healthy and ST companies are regarded financial distress. Sample data, whether they are distressed or not, are those listed companies on the Shanghai Stock Exchange and Shenzhen Stock Exchange. While the distressed companies in this study are of those 1, 2, 3 year prior to run into distress. That is to say, three scheduling datasets are employed in the research. Let the year when a company is specially treated because of negligible value be denoted as the year t, then, t-1 represent one year prior ST, t-2 represent two years prior ST, and t-3 represent three years prior ST. In OR-CBR, the three datasets are treated as three case libraries. Data were collected from 135 pairs of listed companies over the span of 2000–2005. Features employed are account items from both balance sheet and income statement. The 35 initial features cover profitability, activity ability, debt ability and growth ability. 4.2. Data preprocessing The main function of data preprocessing approach, is to eliminate missing data and outliers of the three data sets. Operation principle are listed as follows: • Transforming the three data sets into three tables, which take the role of case libraries. • Filtering out duplicated data. • Eliminating samples that missing at least one feature data. Fig. 6. Framework of the empirical experiment. 652 H. Li et al. / Expert Systems with Applications 36 (2009) 643–659 . 0. 600 8 0. 2637 0 .12 01 0. 517 5 10 .397 0. 000 c X 21 0. 0594 0. 0499 0. 02 51 0. 0 815 À9 .04 4 0. 000 c X22 0 . 10 18 0 .16 51 0. 0656 0 .17 90 À7 .00 8 0. 000 c X23 0 .15 10 0 . 10 44. X 11 0. 2745 0. 400 3 0. 603 3 0. 205 5 À6.4 20 0 .00 0 c X12 0. 6 804 0 .15 94 0. 4296 0 .15 98 9.5 31 0. 000 c X13 0 . 10 64 0. 26 81 0 .14 52 0 .18 28 À6.5 50 0 .00 0 c X14 0. 600 8

Ngày đăng: 23/12/2013, 17:17

Xem thêm: Financial distress prediction based on OR-CBR in the principleof k-nearest neighbors, Financial distress prediction based on OR-CBR in the principleof k-nearest neighbors

Financial distress prediction based on OR-CBR in the principleof k-nearest neighbors

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan