Data Mining and Knowledge Discovery Handbook, 2 Edition part 73 doc

10 162 0
Data Mining and Knowledge Discovery Handbook, 2 Edition part 73 doc

Đang tải... (xem toàn văn)

Thông tin tài liệu

700 Vicenc¸ Torra Algorithm 1: Optimal Univariate Microaggregation Data: X: original data set, k: integer Result: X’: protected data set begin Let X =(a 1 a n ) be a vector of size n containing all the values for the attribute1 being protected. Sort the values of X in ascending order so that if i < j then a i ≤ a j . Given A and k, a graph G k,n is defined as follows.2 begin Define the nodes of G as the elements a i in A plus one additional node g 0 (this3 node is later needed to apply the Dijkstra algorithm). For each node g i , add to the graph the directed edges (g i ,g j ) for all j such that4 i +k ≤ j < i +2k. The edge (g i ,g j ) means that the values (a i , ,a j ) might define one of the possible clusters. The cost of the edge (g i ,g j ) is defined as the within-group sum of squared error for5 such cluster. That is, SSE = Σ j l=i (a l − ¯a) 2 , where ¯a is the average record of the cluster. The optimal univariate microaggregation is defined by the shortest path algorithm 6 between the nodes g 0 and g n . This shortest path can be computed using the Dijkstra algorithm. Algorithm 2: Projected Microaggregation Data: X: original data set, k: integer Result: X’: protected data set begin Split the data set X into r sub-data sets {X i } 1≤i≤r , each one with a i attributes of the1 n records, such that r ∑ i=1 a i = A foreach (X i ∈ X) do2 Apply a projection algorithm to the attributes in X i , which results in an univariate3 vector z i with n components (one for each record) Sort the components of z i in increasing order4 Apply to the sorted vector z i the following variant of the univariate optimal5 microaggregation method: use the algorithm defining the cost of the edges z i,s ,z i,t , with s < t, as the within-group sum of square error for the a i -dimensional cluster in X i which contains the original attributes of the records whose projected values are in the set {z i,s ,z i,s+1 , ,z i,t } For each cluster resulting from the previous step, compute the v i -dimensional centroid6 and replace all the records in the cluster by the centroid Heuristic approaches for sets of attributes can be classified into two categories. One ap- proach consists of projecting the records (using e.g. Principal Components or Zscores) and then applying optimal microaggregation on the projected dimension (Algorithm 2 describes this approach). The other is to develop adhoc algorithms. The MDAV (Domingo and Mateo, 2002) (Maximum Distance to Average Vector) algorithm follows this second approach. It is explained in detail in Algorithm 3, when applied to a data set X with n records and A at- tributes. The implementation of MDAV for categorical data is given in (Domingo-Ferrer and Torra, 2005). Microaggregation has been used as a way to implement k-anonymity. Section 35.4.4 dis- cusses the relation between both microaggregation and k-anonymity. 35 Privacy in Data Mining 701 Algorithm 3:MDAV Data: X: original data set, k: integer Result: X’: protected data set begin while(|X|> k) Compute the average record ¯x of all records in X 1 Consider the most distant record x r to the average record ¯x2 Form a cluster around x r . The cluster contains x r together with the k −1 closest records to3 x r Remove these records from data set X4 if(|X|> k) Find the most distant record x s from record x r 5 Form a cluster around x s . The cluster contains x s together with the k −1 closest records to6 x s Remove these records from data set X7 Form a cluster with the remaining records8 Additive noise This method protects data adding noise into the original file. That is, X  = X + ε , where ε is the noise. The simplest approach is to require ε to be such that E( ε )=0 and Var( ε )=kVar(X ) for a given constant k. Uncorrelated noise addition corresponds to the case that for variables V i and V j , noise is such that Cov( ε i , ε j )=0 for i = j. In this case, additive noise preserves means and covariances, but neither variances Var(X  )=Var(X)+kVar(X) nor correlation coefficients. Note that, E(X  )=E(X)+E( ε )=E(X) Cov(X  i ,X  j )=Cov(X i ,X j ) for i = j Var(X  )=Var(X)+kVar(X)=(1 + k)Var(X) ρ X  i ,X  j = Cov(X  i ,X  j )  Var(X  i )Var(X  j ) = Cov(X i ,X j ) (1 +k)  Var(X i )Var(X j ) = 1 1 +k ρ X i ,X j Correlated noise addition preserves correlation coefficients and means. In this case, how- ever, neither variance nor covariance is preserved: they are proportional to the variance and covariance of the original data set. In correlated noise addition, ε follows a normal distribution N(0,k Σ ) where Σ is the co- variance matrix of X. E(X  )=E(X)+E( ε )=E(X) Cov(X  i ,X  j )=(1 + k)Cov(X i ,X j ) for i = j Var(X  )=Var(X)+kVar(X)=(1 + k)Var(X) ρ X  i ,X  j = Cov(X  i ,X  j )  Var(X  i )Var(X  j ) = (1 +k)Cov(X i ,X j ) (1 +k)  Var(X i )Var(X j ) = ρ X i ,X j 702 Vicenc¸ Torra First extensive testing of noise addition was due to Spruill (Spruill, 1983). (Brand, 2002) gives an overview of these approaches for noise addition as well as more sophisticated tech- niques. (Domingo-Ferrer et al., 2004) also describes some of the existing methods as well as the difficulties for its application in privacy. In addition to that, there exists a related approach known as multiplicative noise (see e.g. (Kim and Winkler, 2003,Liu et al., 2006) for details). PRAM PRAM, Post-RAndomization Method (Gouweleeuw et al., 1998), is a method for categorical data where categories are replaced according to a given probability. Formally, it is based on a Markov matrix on the set of categories. Let C = {c 1 , ,c c } be the set of categories, then P is the Markov matrix on C when P : C ×C → [0, 1] such that ∑ c j ∈C P(c i ,c j )=1. Then, X  is constructed from X replacing, with probability P(c i ,c j ), each c i in X by a c j . The application of PRAM requires an adequate definition of the probabilities P(c i ,c j ). (Gouweleeuw et al., 1998) proposes the Invariant PRAM. Given T =(T (c 1 ) T (c c )) the vector of frequencies of categories in C, it consists of defining P such that frequencies are kept after PRAM. That is, ∑ c i=1 T (c i )p ij = T(c j ) for all j. Then, assuming without loss of generality T (c k ) ≥ T (c i ) for all i, and given a parameter θ such that 0 < θ < 1, p ij is defined as follows: p ij =  1 −( θ T (c k )/T (c i )) if i = j θ T (c k )/((k −1)T (c i )) if i = j Note that a θ equal to zero implies no perturbation, and θ equal to 1 implies total pertur- bation. So, θ permits the user to control the degree of distortion suffered by the data set. (Gross et al., 2004) proposes the computation of matrix P from a preference matrix W = {w ij } where w ij is our degree of preference about replacing category c i by category c j . Formally, given W the probabilities P are determined from the following optimization function: Minimize ∑ i, j w ij p ij Subject to p ij ≥ 0 ∑ j p ij = 1 ∑ c i=1 T (c i )p ij = T (c j ) for all j (Gross et al., 2004) use integers to express preferences, and w ij = 1 is the most preferred change, w ij = 2 is the second most preferred changes, and so on. Lossy Compression This approach, first proposed in (Domingo-Ferrer and Torra, 2001a), consists of viewing a numerical data file as a grey-level image. Rows are records and columns are attributes. Then, a lossy compression method is applied to the image, obtaining a compressed image. This image is then decompressed and the decompressed image corresponds to the masked file. Different compression rates lead to files with different degrees of distortion. I.e., the more compression, the more distortion. (Domingo-Ferrer and Torra, 2001a) used JPEG, which is based on DCT, for the compression. (Jimenez and Torra, 2009) uses JPEG 2000, which is based on wavelets. 35 Privacy in Data Mining 703 The transformation of the original data file into an image requires, in general, a quantiza- tion. As JPEG 2000 allows a higher dynamic range (65536 levels of grey) than JPEG (only 256), this quantization step is more accurate in JPEG 2000. 35.4.2 Non-perturbative Methods In this section we review some of the non-perturbative algorithms. We review generalization (also known as recoding), top and bottom coding, and suppression. Sampling (Willenborg, 2001), which is not discussed here, can also be seen as a non-perturbative method. Generalization and Recoding This method is mainly applied to categorical attributes. Protection is achieved by means of combining a few categories into a more general one. Local and global recoding can be distin- guished. Global recoding (Willenborg, 2001, LeFevre et al., 2005) corresponds to the case that the same recoding is applied to all the categories in the original data file. Formally, if Π is a partition of the categories in C, then each c in the original data file is replaced by the partition element in Π that contains c. In contrast, in local recoding (Takemura, 2002) different categories might be replaced by different generalizations. Constrained local recoding is when the data space is partitioned and within the region the same recoding is used, but different regions use different recodings. In general, global recoding has a larger information loss (as changes are applied to all records without taking into account whether they need them to ensure privacy) than local recoding. Nevertheless, local recoding generates a data set that has a larger diversity on the terms used to describe the records. This situation might cause difficulties when using the protected data set in analysis. While most of the literature considers the recoding as functions of a single variable (i.e., single-dimension recoding), it is also possible to consider recoding of several variables at once. This is multidimensional recoding (LeFevre et al., 2005). Formally, when n variables are considered, recoding is understood as a function of n values. One of the main difficulties for applying recoding is the need for a hierarchy of the cat- egories. In some applications such hierarchy is assumed to be known, while in others it is constructed by the protection procedure. In this later case, difficulties appear to determine the optimal generalization structure. In general, to find an optimal generalization for data protec- tion is a hard problem and besides of that the constructed hierarchy might have small semantic interpretation. E.g. generalizing Zip codes 08192 and 09195 into 0**9*. Top and Bottom Coding Top and bottom coding are two methods that consist of replacing the lowest and largest val- ues (given a threshold) by a generalized category. These two methods can be considered as particular cases of global recoding. These methods are applied in data protection when there are only a few records that have extreme values. This kind of generalization permits us to reduce the disclosure risk as records assigned to the same category are indistinguishable. 704 Vicenc¸ Torra Local Suppression Suppression consists of replacing some values by a special label denoting that the value has been suppressed. Suppression is often applied (Samarati and Sweeney, 1998, Sweeney, 2002, Kisilevich et al., 2010) in combination with recoding, and mainly for categorical data. 35.4.3 Synthetic Data Generators In recent years, a new research trend has emerged for protecting data. It consists of publishing synthetic data instead of the original one. The rationale is that synthetic data cannot com- promise the privacy of the respondents as the data is not “real”. Synthetic data generation methods consist first of constructing a model of the data, and then generating artificial data from this model. The difficulties of the approach are twofold. On the one hand, although data is synthetic some reidentification is possible and, therefore, disclosure risk should be studied for this type of data. See e.g. (Torra et al., 2006) that presents the results of a series of reidentification experiments. Due to this, synthetic data do not avoid disclosure risk and still represents a threat to privacy. On the other hand, as the synthetic data is generated from a particular data model built from the original data, all those aspects that are not explicitly included in the model, are often not included in the data. Due to this, unforseen analysis on the protected data might lead to results rather different than the same analysis on the original data. Several methods for synthetic data generation have been developed in the last few years. Data distortion by probability distribution (Liew et al., 1985) is one of the first protection procedures that can be classified as such. This procedure is defined by the following three steps: • identify an underlying density function and determine its parameters • generate distorted series using the estimated density function • the distorted series are put in place of the original ones The procedure was originally defined for univariate density functions, although it can be applied to multivariate density functions. Information Preserving Statistical Obfuscation (IPSO) (Burridge, 2003) is another family of methods for synthetic data generation. It includes three different methods with names IPSO- A, IPSO-B, and IPSO-C. In this family of methods, IPSO-A is the simplest method and IPSO- C is the most elaborated one. They satisfy the property that the data generated by IPSO-C is more similar to the original data than the data generated by IPSO-A. The larger the complexity, the more the protected data resembles in terms of statistics to the original one. All three methods presume that the variables of the original data file can be divided into two sets of variables X and Y , X contains the confidential outcome variables and Y the quasi- identifier variables. IPSO-A is defined as follows. Take X as independent and Y as dependent variables, and then fit a multiple regression model of Y on X. Compute Y  A with this model. IPSO-A releases the variables X and Y  A in place of X and Y . The following can be stated with respect to IPSO-A. In the above setting, conditional on the specific confidential attributes x i , the quasi-identifier attributes Y i are assumed to follow a multivariate normal distribution with covariance matrix Σ = { σ jk } and a mean vector x i B, where B is the matrix of regression coefficients. Let ˆ B and ˆ Σ be the maximum likelihood esti- mates of B and Σ derived from the complete data set (y,x). If a user fits a multiple regression 35 Privacy in Data Mining 705 model to (y  A ,x), she will get estimates ˆ B A and ˆ Σ A which, in general, are different from the estimates ˆ B and ˆ Σ obtained when fitting the model to the original data (y,x). The goal of IPSO-B is to modify y  A into y  B in such a way that the estimate ˆ B B obtained by multiple linear regression from (y  B ,x) satisfies ˆ B B = ˆ B. Finally, the goal of IPSO-C is to obtain a matrix y  C such that when a multivariate multiple regression model is fitted to (y  C ,x), both sufficient statistics ˆ B and ˆ Σ obtained on the original data (y,x) are preserved. (Muralidhar and Sarathy, 2008) is another example of a data generator for synthetic data. 35.4.4 k-Anonymity k-Anonymity is not a protection procedure in itself but a condition to be satisfied by the pro- tected data set. Nevertheless, in most of the cases, k-anonymity is achieved via a combination of recoding and suppression. As we will discuss later, it can also be achieved via microaggre- gation. Formally, a data set X satisfies k-anonymity (Samarati and Sweeney, 1998,Samarati, 2001, Sweeney, 2002, Sweeney, 2002) when X is partitioned into sets of at least k indistinguishable records. In the literature there exist different approaches for achieving k-anonymity. As stated above, most of them are based on generalization and suppression. Nevertheless, optimal k- anonymity with generalization and suppression is an NP-Hard problem. Due to this, heuristic algorithms have been defined. As k-anonymity pressumes no disclosure, only information loss measures are of interest here. An alternative approach for achieving k-anonymity is the use of microaggregation meth- ods. In this case, data homogeneity is ensured. Microaggregation has to be applied considering all the variables at once, otherwise, k-anonymity would not be guaranteed. (Nin et al., 2008a) measures real k-anonymity when not all variables are microaggregated at once. When k-anonymity is satisfied for quasi-identifiers, disclosure might occur if all the values of a confidential attribute are the same for a particular combination of quasi-identifiers. p- sensitive k-anonymity was defined to avoid this type of problems. Definition 1. (Truta and Vinay, 2006) A data set is said to satisfy p-sensitive k-anonymity for k > 1 and p ≤ k if it satisfies k-anonymity and, for each group of records with the same combination of values for quasi-identifiers, the number of distinct values for each confidential value is at least p (within the same group). To satisfy p-sensitive k-anonymity we require p different values in all sets. This require- ment might be difficult to achieve in some databases. l-diversity (Machanavajjhala et al., 2006) is an alternative approach for k-anonymity. Sim- ilar to the case of p-sensitivity, l-diversity forces l different categories in each set. However, in this case, categories should have to be well-represented. Different meanings have been given to what well-represented means. t-closeness (Li et al., 2007) is another alternative approach to standard k-anonymity. In this case it is required that the distribution of the attribute in any k-anonymous subset of the database is similar to the one of the full database. Similarity is defined in terms of the distance between the two distributions and such distance should be below a given threshold t. This approach permits the data protecter to limit the disclosure. Nevertheless, a low thresh- old forces all the sets to have the same distribution as the full data set. This might cause a 706 Vicenc¸ Torra large information loss: any correlation between the confidential attributes and the one used for l-diversity might be lost when forcing all the sets to have the same distribution. 35.5 Information Loss Measures As stated in the introduction, information loss measures are to quantify in which extent the distortion embedded in the protected data set distorts the results that would be obtained with the original data. These measures are defined for general-purpose protection procedures. When it is known which type of analysis the user will perform on the data, the analysis of the distortion of a particular protection procedure can be done in detail. That is, measures can be developed, and protection procedures can be compared and ranked using such measures. Specific information loss measures are the indices that permits us to quantify such distortion. Nevertheless, when the type of analysis to be performed is not known, only generic indices can be computed. Generic information loss measures are the indices to be applied in this case. They have been defined to evaluate the utility of the protected data but not for a specific application but for any of them. Naturally, as these indices aggregate a few components, it might be the case that a protection procedure with a good average performance behaves badly in a specific application. Nevertheless, information loss (either specific or generic) should not be considered in isolation but disclosure risk should also be considered. In Section 35.3, we discussed disclo- sure risk measures and in Section 35.6 we will describe the tools for taking into account both elements. In this section we describe some of the existing methods for measuring the utility of the protected data. We will start with some generic information loss measures and then point out some specific ones. We focus on methods for numerical data. 35.5.1 Generic Information Loss Measures First steps on the definition of a family of information loss measures were presented in (Domingo-Ferrer et al., 2001) and (Domingo-Ferrer and Torra, 2001b). There, informa- tion loss was defined as the discrepancy between a few matrices obtained on the original data and on the masked one. Covariance matrices, correlation matrices and a few other ma- trices (as well as the original file X and the masked file X  ) were used. Mean square error, mean absolute error and mean variation were used to compute matrix discrepancy. E.g., mean square error of the difference between the original data and the protected one is defined as ( ∑ n i=1 ∑ p i=1 (x ij −x  ij ) 2 )/np (here p is the number of attributes and n is the number of records). Nevertheless, these information loss measures are unbounded, and problems appear e.g. when the original values are close to zero. To solve these problems, (Yancey et al., 2002) proposed to replace the mean variation of X and X  by a measure more stable when the original values are close to zero Trottini (Trottini, 2003) detected that using such information loss measures in combi- nation to disclosure risk measures for comparing masking methods was not well defined as information loss measures were unbounded. That is, when an overall score is computed for a method (see (Domingo-Ferrer et al., 2001) and (Yancey et al., 2002)) as e.g. the average of information loss and disclosure risk, both measures should be defined in the same commen- surable range. To solve this problem, Trottini proposed to settle a predefined maximum value 35 Privacy in Data Mining 707 of error. More recently, in (Mateo-Sanz et al., 2005), probabilistic information loss measures were introduced to solve the same problem avoiding the need of such predefined values. Definitions in (Mateo-Sanz et al., 2005) start considering the discrepancy between a pop- ulation parameter θ on X and a sample statistic Θ on X  . Let ˆ Θ be the value of this statistic for a specific sample. Then, the standardized sample discrepancy corresponds to Z = ˆ Θ − θ  Var( ˆ Θ ) This discrepancy can be assumed to follow a N(0,1) (see (Mateo-Sanz et al., 2005) for details). Then, we define the probabilistic information loss measure for ˆ Θ as follows: pil( ˆ Θ ) := 2 ·P  0 ≤Z ≤ ˆ θ − θ  Var( ˆ Θ )  (35.1) Following (Mateo-Sanz et al., 2005, Domingo-Ferrer and Torra, 2001b), we consider five measures for analysis. They are based on the following statistics: • Mean for attribute V (Mean(V )): n ∑ i=1 x i,V /n • Variance for attribute V (Var(V )): n ∑ i=1 (x i,V −Mean(V )) 2 /n • Covariance for attribute V and V  (Cov(V  V  )): ∑ n  i=1  x i,V −Mean(V )  x i,V  −Mean(V  )  n  • Correlation coefficient for V and V  ( ρ (V,V  )): ∑ n i=1  x i,V −Mean(V )  x i,V  −Mean(V  )  ∑ n i=1  x i,V  −Mean(V  )  2 ∑ n i=1  x i,V −Mean(V )  • Quantiles for attribute V : That is, the values that divide the distribution in such a way that a given proportion of the observations are below the quantile. (Mateo-Sanz et al., 2005) uses the quantiles for i from 5% to 95% with increments of 5%. pil(Mean), pil(Var), pil(Cov), pil( ρ ) and pil(Q) are computed using Expression 35.1 above. In fact, for a given data file with several attributes V i , these measures are computed for each V i (or pair V i , V j ) and then the corresponding pil averaged. In the particular case of the quantile, pil(Q(V )) is first defined as the average of the set of measures pil(Q i (V )) for i = 5% to 95% with increments of 5%. Then, the average of all these can be defined as a generic information loss. We denote it by aPil. That is, aPil :=(pil(Mean)+pil(Var)+pil(Cov)+pil( ρ )+pil(Q))/5. This measure has been used in (Mateo-Sanz et al., 2005) and (Ladra and Torra, 2008) for evaluating and ranking different protection methods. 708 Vicenc¸ Torra 35.5.2 Specific Information Loss Measures These measures depend on the type of application. In general, these measures compare the performance of an analysis applied to the original data and the same analysis to the protected data. When parametric protection procedures are considered, the more distortion is applied to the data, the larger the difference between the analysis on the original data and the analysis on the protected data. The comparison of the analysis depends on what type of analysis is performed. In some cases, the comparison of the results is not trivial. For example, when the analysis corresponds to the application of a (crisp) clustering algorithm, measures should compare the partition obtained from the original data and the one obtained from the protected data. That is, we need functions to measure the similarity of two partitions. Similarly, in the case of fuzzy clustering, functions to measure the similarity between two fuzzy partitions are required. In the case that protected data has to be used for classification, the comparison should be done on the resulting classifiers themselves or on the performance of the classifiers. As an example, (Agrawal and Srikant, 2000) analyses the effects of noise addition on data classifiers (decision trees). In general, the analysis of the results will show the decline of the performance of the classifiers with respect to the distortion included in the data. The best protection procedures are those that permit a large distortion (and a low disclosure risk) but with a low decline of the performance of classifiers (or clustering). 35.6 Trade-off and Visualization In this chapter we have described a few protection methods and described a few ways to analyse them. We have seen that the distortion caused by protection methods permits us to reduce the risk of disclosure but at the same time this causes some information loss. In the last section we have described a few measures for this purpose. To compare data protection procedures and to visualize the trade-off between information loss and disclosure risk, a few tools have been developed. We describe some of them below. 35.6.1 The Score In (Domingo-Ferrer and Torra, 2001b), the average between disclosure risk and information loss was proposed. That is, Score(method)= IL(method)+DR(method) 2 . where IL corresponds to the information loss and DR the disclosure risk. Different measures will, of course, result into different scores. (Domingo-Ferrer and Torra, 2001b) used generic information loss measures and generic disclosure risk. Disclosure risk was defined as an average of interval disclosure ID (one of the measures described in Section 35.3 for attribute disclosure) and reidentification RD (a measure for individual disclosure). That is, DR(method)=(ID(method)+RD(method))/2. Then, in relation to the measure about reidentification, as we can pressume that the in- truder will use the most effective method for disclosure risk, we define RD as the maximum percentage of the reidentification obtained by a set of reidentification algorithms (RDA). That is, 35 Privacy in Data Mining 709 method aPIL DR Score Rank19 34.36 15.55220303 24.95390177 Rank20 34.46 18.856305 26.6562812 Rank17 35.88 17.66314794 26.77372912 Rank15 33.41 20.60728858 27.00680666 Rank18 34.69 20.728565 27.70810968 Rank16 33.25 22.996878 28.12175372 Rank13 31.29 25.26048534 28.27610286 Rank14 34.12 26.5717775 30.34832651 Rank11 29.79 32.09452819 30.94388101 Rank12 31.08 31.800747 31.43809107 Rank09 24.13 40.95472341 32.54038346 Rank10 28.39 41.191185 34.79133316 Rank07 20.95 52.05937146 36.5030496 Micmul03 38.36 35.61935825 36.98824959 Micmul04 42.80 31.30605884 37.05470347 Micmul05 46.47 27.98280127 37.2245081 Micmul08 51.00 23.4813759 37.24028758 Micmul06 48.60 25.94948057 37.27647059 Rank08 22.42 52.32661 37.37232476 Micmul09 52.71 22.06353532 37.38619949 Micmul07 50.55 24.37663265 37.46127816 Micmul10 53.53 21.82407737 37.6762343 Rank05 16.27 64.60353016 40.43547191 Mic4mul03 20.44 60.58282339 40.51198937 Mic4mul04 27.09 54.33719341 40.71608119 Table 35.1. Methods with the best performace according to the score. RD(method)= max rda∈RDA RD rda (method). For illustration, Table 35.1 displays the 25 best scores from the selection of methods in- cluded in (Domingo-Ferrer and Torra, 2001b). The best methods are rank swapping (stan- dard implementation) and microaggregation (individual ranking and MDAV). The analysis included other methods as lossy compression (using JPEG) and noise addition. In Table 35.1 both information loss and disclosure risk are represented in the scale [0,100]. Information loss, disclosure risk and the score are defined as in (Nin et al., 2008c). That is, they are defined as follows. Disclosure risk corresponds to the average between interval disclo- sure and the maximum percentage of reidentification using a few reidentification algorithms (including probabilistic record linkage and distance-based record linkage, using Euclidean, Mahalanobis, and also another distance based on kernel functions). Information loss was com- puted using aPil. For both measures, 0 is the optimal value (no information loss and no risk) and 100 is the worst value (maximum information loss and 100% of reidentifications). Note that using the identity function (X  = X) as a protection method, we obtain a score of 50 (full risk and no loss). In the same way, a complete distortion (e.g., completely random data) has also a score equal to 50 (full loss and no risk). So, only scores lower than 50 are of interest. In comparison with (Domingo-Ferrer and Torra, 2001b), were the average was used to combine only two reidentification algorithms (probabilistic record linkage and distance-based . 27 .00680666 Rank18 34.69 20 . 728 565 27 .70810968 Rank16 33 .25 22 .996878 28 . 121 753 72 Rank13 31 .29 25 .26 048534 28 .27 61 028 6 Rank14 34. 12 26.5717775 30.348 326 51 Rank11 29 .79 32. 094 528 19 30.94388101 Rank 12 31.08 31.800747. 46.47 27 .9 828 0 127 37 .22 45081 Micmul08 51.00 23 .4813759 37 .24 028 758 Micmul06 48.60 25 .94948057 37 .27 647059 Rank08 22 . 42 52. 326 61 37.3 723 2476 Micmul09 52. 71 22 .063535 32 37.38619949 Micmul07 50.55 24 .3766 326 5. Privacy in Data Mining 709 method aPIL DR Score Rank19 34.36 15.5 522 0303 24 .95390177 Rank20 34.46 18.856305 26 .65 628 12 Rank17 35.88 17.66314794 26 . 773 729 12 Rank15 33.41 20 .60 728 858 27 .00680666 Rank18

Ngày đăng: 04/07/2014, 05:21

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan