Performance of Modern Techniques for Rating Model Design

23 510 0
Performance of Modern Techniques for Rating Model Design

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

Credit risk forecasting is one of the leading topics in modern finance, as the bank regulation has made increasing use of external and internal credit ratings. One of the most important examples is the package of rules for determining the required capital for the market risk in the trading book, issued by the Basel Committee on Banking Supervision

Performance of Modern Techniques for Rating Model Design Master Thesis Supervising Professor: Practical Supervisor: Prof Dr Uwe Schmock Master of Advanced Studies in Finance Ernst Young Zürich Student: Anca Antonov Year: 2002/2003 Zürich 2004 Table of Contents : Introduction Classification problem 2.1 Classification theoretical framework 2.2 Classification problem and corporate rating problem 2.3 Generalization Performance Data Set Description Linear Classifiers 4.1 Linear Regressions 4.2 Gradient Descendent 4.3 Discriminant Analysis 4.3.1 Linear Discriminant Analysis Non – Linear Classifiers 5.1 Quadratic Discriminant Analysis 5.2 Polynomial Regression 5.3 Logistic Regression 5.4 K-Nearest Neighbors Regression 5.5 Parzen Windows density estimator Neural Networks 6.1 Multilayer Perceptrons 6.1.1 BackPropagation Training with Gradient Descendent 6.1.2 QuickPropagation 6.1.3 Scaled Conjugate Gradient 6.1.4 Stochastic Learning Process 6.1.5 Pruning – Optimal Brain Surgeon 6.2 Radial Basis Neural Networks 6.2.1 Probabilistic Neural Networks 6.3 Learning Vector Quantization 6.4 SelfOrganized Maps Conclusions and Further Research Performance of modern techniques for rating model design Introduction Credit risk forecasting is one of the leading topics in modern finance, as the bank regulation has made increasing use of external and internal credit ratings One of the most important examples is the package of rules for determining the required capital for the market risk in the trading book, issued by the Basel Committee on Banking Supervision The discussion process that led to the June 1999 BSBS proposal for a revised international accord follow this trend on importance growing attached to the credit scoring process, culminating in a more prominent proposed role for credit ratings in the determination of overall capital for banking institution, but the problems that the regulators and practioners are facing are not few Most of the problems that we are facing in the credit scoring are rather technical than theoretical In an ideal (theoretical) world, probabilities of default (PDs) could directly be assigned to obligors In such a world the model builder would know the probability distribution of future defaults within the population of borrowers This information is, however, unknown to the model builder a priori Due to this data restriction, however, usually a twostep approach is carried out First, based on the past information, to infer the default risk models assign a credit score for each corporate observation, which leads towards a ranking between the contemplated corporations Second, given the ranking, corporations are mapped to an internal grade for which a PD has to be estimated Statistical scoring methods combine and weight individual accounting ratios to produce a measure – a credit risk score – that discriminates between healthy and problem firms The most widely used statistical methods are discriminant linear analysis and logistic regression The classic Fisher linear discriminant analysis seeks to find a linear function of accounting variables that maximizes the differences (variance) between the two groups of firms while minimizing the differences within each group The variables of the scoring function are generally selected among a large set of accounting ratios on the basis of their statistical significance The coefficients of the scoring functions represent the contributions (weights) of each ratio to the overall score All in all, multivariate accounting-based credit-scoring models have been shown to perform quite well In particular, linear discriminant analysis seems robust even when the underlying statistical hypotheses not hold exactly, especially when used with large samples Logistic analysis has produced similar results Some recent studies (see[4] for a review of all methods applied in rating methodologies) use both methods and choose the one with the best out-of-sample performance, to avoid problems of sample-specific bias and over fitting A relatively new – and less thoroughly tested – approach to the problem of credit risk classification is based on artificial intelligence methods, such as expert systems and automated learning (neural networks, decision trees and genetic algorithms) These methods dispense with some of the restrictive assumptions of the earlier statistical models as explained in the beginning of the chapter The scope of this master thesis is to test the efficiency and the accuracy of each of the logistic regression, discriminant linear and quadratic analysis, polynomial regression, k-nearest neighbours comparing with the neuronal networks This master thesis is organized as follows: an overview of the classification problem and the appropriate statistical measure for the generalization performance) The data used will be analysis in third part and the statistical methods will be explained in the 4-6 part together with the results The conclusion will be detailed on the seventh part, together with the last remarks some and possible extension of the topic Classification problem 2.1 Classification theoretical framework Decision theoretical framework is based on classical decision theory as it is presented in the book of Christopher Bishop [2] An observation is a feature vector X = x , from some space X = ( x j ), j = N , where all the observations available constitute { } the sample space The true classification membership is ω ∈ Ω = {w1 , w2 , w3 , wM } , M being the total number of classes, supposed finite We assume that pairs ( X , ω ) are drawn independently from the joint distribution p ( x, ω ) on X × Ω Given an observation (vector of features), X = x , the scope is to make a decision regarding the true class membership of the observation, this is the classification problem: predict the true class ω , for an observation X = x In this context, we define the action space Y , which is the space of allowable decisions in our classification problem The action space can be seen as an extension of the space of the classes, Y = {Ω, ' outlier ' , ' doubt ' , } as in Ripley In our case, we will assume that the action space equal the space of classes, Y = Ω The relationship between the action space Y and the space of the classes Ω is quantified by the loss function L(ω , y ) which gives the loss incurred if ω is the true state of nature and action y is taken Performance of modern techniques for rating model design The link between the sample space and the action space is the classifier The classifier is a nonrandomized decision rule δ from the space of allowable decisions that specifies, for each observation x ∈ X , what action y ∈ Y to take if X = x is observed The classifier is a map from the sample space X to the action space Y , δ : X → Y , such that: ω j = δ (x j ) + ε j A classifier δ can be evaluated by its risk function R (ω , δ ) , given by: R (ω , δ ) = E x ω L (ω , δ (x )) = ∫ L (ω , δ (x ))P (x ω )dx X The risk function is the expected loss by using the classifier δ when ω is the true state of nature It is desirable to choose a classifier that ahs a small value of the risk function R (ω , δ ) for all classes ω Comparing classifiers on the basis of their respective risk functions can be difficult since different classifiers might give superior results to others in separate subspaces Ω In Bayes decision analysis one attempts to choose the classifier δ that minimizes the risk on the space of the classes, Ω , weighted by the prior probabilities of the classes This is called the Bayes risk, B (π , δ ) , and is given as, B (π , δ ) = where π (ω ) is the prior probability given the class ω ∑ R (ω , δ ) ⋅ π (ω ) ω ∈Ω Minimizing the Bayes risk is achieved by each x choosing the classifier δ (x ) that minimizes the posterior expected loss E x ω L(ω , δ ( x)) (which is usually called conditional risk), E x ω L (ω , δ (x )) = ∑ L (ω , δ (x )) p (x ω ) ω where p ( x ω ) is the posterior probability for class ω given x Using the zero–one loss function, 0 L0 / (ω , y ) =  1 if ω = y otherwise the classifier is constructed by choosing the class giving the maximum posterior probability So, as stated before, in classification the aim is to predict the true class ω for a certain observation In discrimination the aim is to separate the sample space into disjoint regions for the classes {w1 , w2 , wM } , but both of them put a lot of emphasis on the posterior distribution of the class ω 2.2 Classification problem and corporate rating problem Usually an expert committee of a special financial agency performs the process of bond rating, but this process is covert with mystery because of the confidentiality issues Accordingly, many researchers have tried to formulate alternative approaches to predict companies ratings, especially since the rating agency update their “grades” infrequently, there is a considerable value to being able to anticipate rating changes before they are announced In this framework, we can consider that this process of mapping the companies ratios into the rating as a classical classification problem where the feature space is five dimensional and contains the following ratios: • X1 – Working Capital over Total Assets • X2 – Retained Earnings over Total Assets • X3 – EBIT over Total Assets • X4 – Capital over Total Assets • X5 – Sales over Total Assets and the categories are the Standard & Poor’s Classification starting with AAA – companies with extremely strong capabilities to meet their financial commitments and ending with D – already defaulted companies So far, there have been under analysis three main approaches to construct classifiers depending on the philosophy behind their construction: • A Posteriori Classifiers: which try to model the a posteriori probabilities p (c K x) • • Probability Density Classifiers: which try to model the conditional probabilities and combine them with the help of Bayes rules Decision boundary Classifiers: which construct the discrimination function and the decision boundary as well Performance of modern techniques for rating model design 2.3 Generalization Performance The biggest statistical challenge in this case, is to differentiate the models according to their generalization performance, that is the accuracy on the never seen data.1 The learning or the regression process will imply to find an estimate for δˆλ ( x; D) , given the data set D , from a class of predictors δ λ , indexed by λ , where in general λ ∈ Λ = (S , A ,W ) , where S⊂X chosen subset of available data inputs, A is a selected architecture within a class of model architecture adjustable parameters space The prediction risk P (λ ) is defined as the expected performance on future data: A and W are the [ denotes a ] P (λ ) = ∫ p (x ) δ (x ) − δ (x ) dx + σ ε2 which is approximated by the expected performance on the finite test set: 2  N  P (λ ) ≈ E  ∑ (ω =j − δ λ (x j= ))   N j =1  where (x j= , t j= ) are new observations that were not used in constructing the classifier In all the paper, P (λ ) as a measure of the generalization ability of the model and our strategy will be to choose the model/architecture - λ , that minimize the prediction risk Since we can not directly calculate the prediction risk, P (λ ) we have to estimate it from the available data set D and the standard test – validation set is not advisable when the data set is not very large, but instead we can use a sample reuse method which makes maximally efficient use of the data: Cross-Validation (CV) Furthermore, the method has the advantage of making minimum assumptions on the statistics of the data Let δ λ ( j ) (x ) be a predictor trained using all observations except (x j , ω j ) such that δ λ ( j ) (x ) minimizes: MSE j = ∑ (ω k N −1 k≠j − δ λ ( j ) (x )) then, an estimator for the prediction risk P (λ ) is the cross validation mean square error: CV (λ ) = N N ∑ (ω j j =1 − δ λ ( j ) (x )) which is known as leave-one-out cross–validation However, this form of cross validation is expensive to compute especially for neural networks and therefore, v-fold cross validation has been chosen instead where larger subsets of D are ignored in the training phase The data set D is divided into v randomly selected disjoints subsets of roughly equal size such that: v UB j =D j =1 B i IB j = φ ∀i, j,i ≠ j Let δ λ (B j ) (x ) be the estimator trained on all the data except (x , ω ) ∈ B j , then the cross validation is defined as: CV (λ ) = v ∑ card (B j j ∑ ((ω j − δ λ (B J ) (x )) ) ( x k ,ωk )∈B j Typically, choices for v are or 10 and we can observe that leave out cross validation is obtained in the limit v → N Special for neural network, other two useful criteria have been created: Generalized Cross Validation (GCV) and Akaike`s Final Prediction Error (FPE) with the following forms: G C V (λ ) = MSE (λ ) ⋅  S (λ )  1 −  N   S (λ ) 1+ N F P E (λ ) = MSE (λ ) ⋅ S (λ ) 1− N Performance of modern techniques for rating model design where S (λ ) denotes the number of weights of model λ Note that they are slightly different for small sample size, but they are asymptotically equivalent for large N: S (λ )   P (λ ) ≡ MSE (λ )1 +  ≈ GCV (λ ) ≈ FPE (λ ) N   It have been shown by Moody[14] that FPE is an unbiased estimator of prediction risk for neural network models, provided that the noise in the observed targets are independent and identically distributed and provided that weight decay is not used Data Set Description Our final belief is that we are able to analyse the data in such a manner that finally we are able to understand what separates different categories, which are the features that have a strong discriminate power For analysing the ability of each model to capture the features that have a strong discriminate power, we have used four different sets: • • • • • One randomly generated (Rand) One with an obvious clustering (Separable Set - SS) One with a light overlapping distribution (Overlapping Set - OS) One with hard overlapping distribution (Hard Overlapping Set - HOS) A real data set composed of a database of 720 companies in which we will try to find the right class membership ie the rating (Real Data - RD) The graph of the first four categories of data are presented below: Fig No 1: Random, Separable, Overlapping, and Hardly Overlapping Sets Performance of modern techniques for rating model design The real data set has 720 records, with the following characteristics of the Group Initial No of Initial Prior Probability observations Estimates (%) AAA 10 1.41 AA 72 10.14 A 214 30.14 categories After Excluding Outliers 10 66 196 Prior Probability Estimates (%) 1.55 10.20 30.30 BBB 249 35.07 230 35.55 BB B CCC D 95 57 10 13.38 8.03 1.41 0.42 92 53 - 14.22 8.18 - Table Categories Summary – with and without outliers The main statistics about the features are summarized in the table 2: X1 Min Max 25% Q 75% Q Mean Variance Skew Kurtosis -5.152 4.217 -0.7275 0.649 0.360 1.679 X2 X3 X4 X5 -1.388 0.9606 0.0294 0.2681 0.1450 0.0523 -1.315 8.5697 -0.4253 0.4913 0.0379 0.1173 0.0812 0.0059 0.0473 5.7096 -0.5301 3.46455 0.31552 0.89595 0.67701 0.29792 1.58505 3.65485 0.00 4.68 0.512 1.1675 0.9592 0.4846 1.903 4.798 Table Features Summary The above summary and the box plots presented in the Fig.2, clearly show the presence of the outliers, especially for the variable and Also, the density plots in the figure show the estimated density of the features variables and indicate some suspicious outline values The problem is that the usual outliers tests assume the outliers affect normal distribution and the testing of the normality In this case, the outliers have been defined as the features that normalized are exceeding three X i − µi ≥3 σi Xi − feature i µ i , σ i − mean / var iance of feature i Proceeding like this, our data set is reduced to 647 records (with 7%) and we removed the last rating classes: C and D, but the category BBB remains the largest one, and can be considered as the benchmark for clever guessing, having a prior probability of 35.55% Fig.2: Histogram and Density plots of the Data Performance of modern techniques for rating model design The first step in analysing the data is to look at the scatter plots to see if we can observe any structure in the data, especially in the case of the non-linear regression problem We can scatter both the output variable against the input variable, or the features variables one against the other, but the first one are most relevant Fig.3 Scatter Plots of the Variables Based on that we can assume that the variables there is linear relationship that we are going to use for the polynomial regression between the variables: X1&X2, X2&X3, X2&X5 The co linearity test included in Finmetrics module, gives a result of 3.966, which is clearly showing that the data are independent If the result is greater than 20 then the data should be interpreted that the data present a high co linearity and a sign of doubt should be raised if the value is larger than 10 The same idea is confirmed by the covariance matrix, which exhibits very low values X1 X2 X3 X4 X5 X1 0.012645 0.0039532 0.0010997 0.0026837 0.023819 X2 0.0039532 0.031686 0.0041818 0.03103 0.020852 X3 0.0010997 0.0041818 0.004312 0.0067614 0.01077 X4 0.0026837 0.03103 0.0067614 0.19612 0.0070272 X5 0.023819 0.020852 0.01077 0.0070272 0.2993 Table Covariance Matrix Nevertheless, when the data are prepared for training the neural network, a normalization must be performed such that each group has zero mean and one variance This is the standard procedure A little bit complicate normalization is the “whitening” where the idea is to decorrelate the input variable by rotating the basis vectors and also standardize them The underlying idea is that the covariance matrix Σ = ( X − ⋅ µ T )T ( X − ⋅ µ T ) is symmetric and can thus be diagonalized using the orthonormal transformation Q If we choose ( ) −1 Z = X − ⋅ µT ⋅ Q ⋅ Λ then or new variables are zero mean, uncorrelated and with variance equal with one (the covariance matrix is the identity matrix) Also, in order to make them lie in the interval –1 and 1, we are transforming the patterns according to the following relationship: ∗ pattern − (Max + Min ) pattern = Max − Min Regarding the output variables, we can consider two approaches: binary (where the classes are encoded with and bits) where the records are linked to the category to which the Euclidian distance is minimum or continuous (where we can allow the classifier to take any value between and 6) Although the binary approach is the classical one, we will prefer the continuous one, as it implies similar results Performance of modern techniques for rating model design Linear Classifiers 4.1 Linear Regression In the case of the linear regression we assume that the classifier has the following form: δ (x ) = w T ⋅ X , in other words we assume that there is a linear process underlying the classification decision: δ (x ) = w *T ⋅ X + ε and the noise process fulfills the Gauss-Markov conditions Applying the least square error, w = δ − wX ( we obtain that: w = X T X w ) −1 X T ω For our data set, the results are synthesized in the table no.4 Separable Set Light Overlapping Hard Overlapping Random Set Real Data Set 4.2 Accurancy Class Error CV GCV FPE Free Parameters 99.722 100 0.040302 0.049906 0.049752 90.972 100 0.092036 0.10045 0.10037 25.833 76.528 1.9399 2.1653 2.1636 24.861 42.222 71.944 88.254 1.8605 0.89425 2.0469 1.0327 2.0453 1.0233 Gradient Descent Table Linear Regression Accuracy Another fix to the singularity problem is to avoid inverting the matrix X T X , by using the gradient descent scheme, which iteratively searches for the minimum defined in the equation above The algorithm tries to adapt the parameters in the direction of the negative gradient of the error function and which is the basic strategy for learning the neural networks The algorithm goes as follows(known as Adaline as well): Initialize the weights Repeat until the convergence criteria is met or the number of total iterations is reached N Compute the update ∆w (t ) = η ⋅ ∑ e (n ) ⋅ x (n ) where e (n ) = δ (n ) − w T (n ) ⋅ x (n ) n =1 Compute the new weights w (t + 1) = w (t ) + ∆w (t ) Compute the new error: E (t + 1) The crucial parameter in this case is the learning rate or η A too large learning rate means that the algorithm will fail to converge, while a too small η it will make it very slow In the literature have been shown that the learning rate should satisfy the following relationship:

Ngày đăng: 16/04/2013, 20:00

Từ khóa liên quan

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan