John wiley sons data mining techniques for marketing sales_6 ppt

34 393 0
John wiley sons data mining techniques for marketing sales_6 ppt

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

470643 c05.qxd 3/8/04 11:11 AM Page 142 T able 5.2 The 95 Percent Confidence Interval Bounds for the Champion Group SEP RESPONSE SIZE 95% CONF 95% CONF * SEP LOWER UPPER 4.5% 900,000 0.0219% 1.96 0.0219%*1.96=0.0429% 4.46% 4.54% 4.6% 900,000 0.0221% 1.96 0.0221%*1.96=0.0433% 4.56% 4.64% 4.7% 900,000 0.0223% 1.96 0.0223%*1.96=0.0437% 4.66% 4.74% 4.8% 900,000 0.0225% 1.96 0.0225%*1.96=0.0441% 4.76% 4.84% 4.9% 900,000 0.0228% 1.96 0.0228%*1.96=0.0447% 4.86% 4.94% 5.0% 900,000 0.0230% 1.96 0.0230%*1.96=0.0451% 4.95% 5.05% 5.1% 900,000 0.0232% 1.96 0.0232%*1.96=0.0455% 5.05% 5.15% 5.2% 900,000 0.0234% 1.96 0.0234%*1.96=0.0459% 5.15% 5.25% 5.3% 900,000 0.0236% 1.96 0.0236%*1.96=0.0463% 5.25% 5.35% 5.4% 900,000 0.0238% 1.96 0.0238%*1.96=0.0466% 5.35% 5.45% 5.5% 900,000 0.0240% 1.96 0.0240%*1.96=0.0470% 5.45% 5.55% Response rates vary from 4.5% to 5.5%. The bounds for the 95% confidence level are calculated using1.96 standard deviations fro m the mean. 142 Chapter 5 TEAMFLY Team-Fly ® 470643 c05.qxd 3/8/04 11:11 AM Page 143 The Lure of Statistics: Data Mining Using Familiar Tools 143 Based on these possible response rates, it is possible to tell if the confidence bounds overlap. The 95 percent confidence bounds for the challenger model were from about 4.86 percent to 5.14 percent. These bounds overlap the confi- dence bounds for the champion model when its response rates are 4.9 percent, 5.0 percent, or 5.1 percent. For instance, the confidence interval for a response rate of 4.9 percent goes from 4.86 percent to 4.94 percent; this does overlap 4.86 percent—5.14 percent. Using the overlapping bounds method, we would con- sider these statistically the same. Comparing Results Using Difference of Proportions Overlapping bounds is easy but its results are a bit pessimistic. That is, even though the confidence intervals overlap, we might still be quite confident that the difference is not due to chance with some given level of confidence. Another approach is to look at the difference between response rates, rather than the rates themselves. Just as there is a formula for the standard error of a proportion, there is a formula for the standard error of a difference of propor- tions (SEDP): ) (p11 - p1) SEDP = N1 + p2 ) (1 - p2) N2 This formula is a lot like the formula for the standard error of a proportion, except the part in the square root is repeated for each group. Table 5.3 shows this applied to the champion challenger problem with response rates varying between 4.5 percent and 5.5 percent for the champion group. By the difference of proportions, three response rates on the champion have a confidence under 95 percent (that is, the p-value exceeds 5 percent). If the challenger response rate is 5.0 percent and the champion is 5.1 percent, then the difference in response rates might be due to chance. However, if the cham- pion has a response rate of 5.2 percent, then the likelihood of the difference being due to chance falls to under 1 percent. WARNING Confidence intervals only measure the likelihood that sampling affected the result. There may be many other factors that we need to take into consideration to determine if two offers are significantly different. Each group must be selected entirely randomly from the whole population for the difference of proportions method to work. 470643 c05.qxd 3/8/04 11:11 AM Page 144 Table 5.3 The 95 Percent Confidence Interval Bounds for the Difference between the Champion and Challenger g roups CHALLE NGER CHAMPION DIFFERENCE RESPONSE SIZE RESPONSE SIZE VALUE SEDP Z-VALUE P-VALUE 5.0% 100,000 4.5% 900,000 0.5% 0.07% 6.9 0.0% 5.0% 100,000 4.6% 900,000 0.4% 0.07% 5.5 0.0% 5.0% 100,000 4.7% 900,000 0.3% 0.07% 4.1 0.0% 5.0% 100,000 4.8% 900,000 0.2% 0.07% 2.8 0.6% 5.0% 100,000 4.9% 900,000 0.1% 0.07% 1.4 16.8% 5.0% 100,000 5.0% 900,000 0.0% 0.07% 0.0 100.0% 5.0% 100,000 5.1% 900,000 –0.1% 0.07% –1.4 16.9% 5.0% 100,000 5.2% 900,000 –0.2% 0.07% –2.7 0.6% 5.0% 100,000 5.3% 900,000 –0.3% 0.07% –4.1 0.0% 5.0% 100,000 5.4% 900,000 –0.4% 0.07% –5.5 0.0% 5.0% 100,000 5.5% 900,000 –0.5% 0.07% –6.9 0.0% 144 Chapter 5 470643 c05.qxd 3/8/04 11:11 AM Page 145 The Lure of Statistics: Data Mining Using Familiar Tools 145 Size of Sample The formulas for the standard error of a proportion and for the standard error of a difference of proportions both include the sample size. There is an inverse relationship between the sample size and the size of the confidence interval: the larger the size of the sample, the narrower the confidence interval. So, if you want to have more confidence in results, it pays to use larger samples. Table 5.4 shows the confidence interval for different sizes of the challenger group, assuming the challenger response rate is observed to be 5 percent. For very small sizes, the confidence interval is very wide, often too wide to be use- ful. Earlier, we had said that the normal distribution is an approximation for the estimate of the actual response rate; with small sample sizes, the estimation is not a very good one. Statistics has several methods for handling such small sample sizes. However, these are generally not of much interest to data miners because our samples are much larger. Table 5.4 The 95 Percent Confidence Interval for Difference Sizes of the Challenger Group RESPONSE SIZE SEP 95% CONF LOWER HIGH WIDTH 5.0% 1,000 0.6892% 1.96 3.65% 6.35% 2.70% 5.0% 5,000 0.3082% 1.96 4.40% 5.60% 1.21% 5.0% 10,000 0.2179% 1.96 4.57% 5.43% 0.85% 5.0% 20,000 0.1541% 1.96 4.70% 5.30% 0.60% 5.0% 40,000 0.1090% 1.96 4.79% 5.21% 0.43% 5.0% 60,000 0.0890% 1.96 4.83% 5.17% 0.35% 5.0% 80,000 0.0771% 1.96 4.85% 5.15% 0.30% 5.0% 100,000 0.0689% 1.96 4.86% 5.14% 0.27% 5.0% 120,000 0.0629% 1.96 4.88% 5.12% 0.25% 5.0% 140,000 0.0582% 1.96 4.89% 5.11% 0.23% 5.0% 160,000 0.0545% 1.96 4.89% 5.11% 0.21% 5.0% 180,000 0.0514% 1.96 4.90% 5.10% 0.20% 5.0% 200,000 0.0487% 1.96 4.90% 5.10% 0.19% 5.0% 500,000 0.0308% 1.96 4.94% 5.06% 0.12% 5.0% 1,000,000 0.0218% 1.96 4.96% 5.04% 0.09% 470643 c05.qxd 3/8/04 11:11 AM Page 146 146 Chapter 5 What the Confidence Interval Really Means The confidence interval is a measure of only one thing, the statistical dispersion of the result. Assuming that everything else remains the same, it measures the amount of inaccuracy introduced by the process of sampling. It also assumes that the sampling process itself is random—that is, that any of the one million customers could have been offered the challenger offer with an equal likeli- hood. Random means random. The following are examples of what not to do: ■■ Use customers in California for the challenger and everyone else for the champion. ■■ Use the 5 percent lowest and 5 percent highest value customers for the challenger, and everyone else for the champion. ■■ Use the 10 percent most recent customers for the challenger, and every- one else for the champion. ■■ Use the customers with telephone numbers for the telemarketing cam- paign; everyone else for the direct mail campaign. All of these are biased ways of splitting the population into groups. The pre- vious results all assume that there is no such systematic bias. When there is systematic bias, the formulas for the confidence intervals are not correct. Using the formula for the confidence interval means that there is no system- atic bias in deciding whether a particular customer receives the champion or the challenger message. For instance, perhaps there was a champion model that predicts the likelihood of customers responding to the champion offer. If this model were used, then the challenger sample would no longer be a ran- dom sample. It would consist of the leftover customers from the champion model. This introduces another form of bias. Or, perhaps the challenger model is only available to customers in certain markets or with certain products. This introduces other forms of bias. In such a case, these customers should be compared to the set of customers receiving the champion offer with the same constraints. Another form of bias might come from the method of response. The chal- lenger may only accept responses via telephone, but the champion may accept them by telephone or on the Web. In such a case, the challenger response may be dampened because of the lack of a Web channel. Or, there might need to be special training for the inbound telephone service reps to handle the chal- lenger offer. At certain times, this might mean that wait times are longer, another form of bias. The confidence interval is simply a statement about statistics and disper- sion. It does not address all the other forms of bias that might affect results, and these forms of bias are often more important to results than sample varia- tion. The next section talks about setting up a test and control experiment in marketing, diving into these issues in more detail. 470643 c05.qxd 3/8/04 11:11 AM Page 147 The Lure of Statistics: Data Mining Using Familiar Tools 147 Size of Test and Control for an Experiment The champion-challenger model is an example of a two-way test, where a new method (the challenger) is compared to business-as-usual activity (the cham- pion). This section talks about ensuring that the test and control are large enough for the purposes at hand. The previous section talked about determin- ing the confidence interval for the sample response rate. Here, we turn this logic inside out. Instead of starting with the size of the groups, let’s instead consider sizes from the perspective of test design. This requires several items of information: ■■ Estimated response rate for one of the groups, which we call p ■■ Difference in response rates that we want to consider significant (acuity of the test), which we call d ■■ Confidence interval (say 95 percent) This provides enough information to determine the size of the samples needed for the test and control. For instance, suppose that the business as usual has a response rate of 5 percent and we want to measure with 95 percent confidence a difference of 0.2 percent. This means that if the response of the test group greater than 5.2 percent, then the experiment can detect the differ- ence with a 95 percent confidence level. For a problem of this type, the first step this is to determine the value of SEDP. That is, if we are willing to accept a difference of 0.2 percent with a con- fidence of 95 percent, then what is the corresponding standard error? A confi- dence of 95 percent means that we are 1.96 standard deviations from the mean, so the answer is to divide the difference by 1.96, which yields 0.102 percent. More generally, the process is to convert the p-value (95 percent) to a z-value (which can be done using the Excel function NORMSINV) and then divide the desired confidence by this value. The next step is to plug these values into the formula for SEDP. For this, let’s assume that the test and control are the same size: .% p ) (1 - p) (1 - pd) 02 - . + ( + 196 N pd ) ) N Plugging in the values just described (p is 5% and d is 0.2%) results in: . . . 0.102% = 5% ) 95% 5 2% ) 94 8% 0 0963 = N + N N N = 0 0963 . = 66 875 (. 0 00102 ) 2 , So, having equal-sized groups of of 92,561 makes it possible to measure a 0.2 percent difference in response rates with a 95 percent accuracy. Of course, this does not guarantee that the results will differ by at least 0.2 percent. It merely 470643 c05.qxd 3/8/04 11:11 AM Page 148 148 Chapter 5 says that with control and test groups of at least this size, a difference in response rates of 0.2 percent should be measurable and statistically significant. The size of the test and control groups affects how the results can be inter- preted. However, this effect can be determined in advance, before the test. It is worthwhile determining the acuity of the test and control groups before run- ning the test, to be sure that the test can produce useful results. TIP Before running a marketing test, determine the acuity of the test by calculating the difference in response rates that can be measured with a high confidence (such as 95 percent). Multiple Comparisons The discussion has so far used examples with only one comparison, such as the difference between two presidential candidates or between a test and con- trol group. Often, we are running multiple tests at the same time. For instance, we might try out three different challenger messages to determine if one of these produces better results than the business-as-usual message. Because handling multiple tests does affect the underlying statistics, it is important to understand what happens. The Confidence Level with Multiple Comparisons Consider that there are two groups that have been tested, and you are told that difference between the responses in the two groups is 95 percent certain to be due to factors other than sampling variation. A reasonable conclusion is that there is a difference between the two groups. In a well-designed test, the most likely reason would the difference in message, offer, or treatment. Occam’s Razor says that we should take the simplest explanation, and not add anything extra. The simplest hypothesis for the difference in response rates is that the difference is not significant, that the response rates are really approximations of the same number. If the difference is significant, then we need to search for the reason why. Now consider the same situation, except that you are now told that there were actually 20 groups being tested, and you were shown only one pair. Now you might reach a very different conclusion. If 20 groups are being tested, then you should expect one of them to exceed the 95 percent confidence bound due only to chance, since 95 percent means 19 times out of 20. You can no longer conclude that the difference is due to the testing parameters. Instead, because it is likely that the difference is due to sampling variation, this is the simplest hypothesis. 470643 c05.qxd 3/8/04 11:11 AM Page 149 The Lure of Statistics: Data Mining Using Familiar Tools 149 The confidence level is based on only one comparison. When there are mul- tiple comparisons, that condition is not true, so the confidence as calculated previously is not quite sufficient. Bonferroni’s Correction Fortunately, there is a simple correction to fix this problem, developed by the Italian mathematician Carlo Bonferroni. We have been looking at confidence as saying that there is a 95 percent chance that some value is between A and B. Consider the following situation: ■■ X is between A and B with a probability of 95 percent. ■■ Y is between C and D with a probability of 95 percent. Bonferroni wanted to know the probability that both of these are true. Another way to look at it is to determine the probability that one or the other is false. This is easier to calculate. The probability that the first is false is 5 per- cent, as is the probability of the second being false. The probability that either is false is the sum, 10 percent, minus the probability that both are false at the same time (0.25 percent). So, the probability that both statements are true is about 90 percent. Looking at this from the p-value perspective says that the p-value of both statements together (10 percent) is approximated by the sum of the p-values of the two statements separately. This is not a coincidence. In fact, it is reasonable to calculate the p-value of any number of statements as the sum of the p-values of each one. If we had eight variables with a 95 percent confidence, then we would expect all eight to be in their ranges 60 percent at any given time (because 8 * 5% is a p-value of 40%). Bonferroni applied this observation in reverse. If there are eight tests and we want an overall 95 percent confidence, then the bound for the p-value needs to be 5% / 8 = 0.625%. That is, each observation needs to be at least 99.375 percent confident. The Bonferroni correction is to divide the desired bound for the p-value by the number of comparisons being made, in order to get a confi- dence of 1 – p for all comparisons. Chi-Square Test The difference of proportions method is a very powerful method for estimat- ing the effectiveness of campaigns and for other similar situations. However, there is another statistical test that can be used. This test, the chi-square test, is designed specifically for the situation when there are multiple tests and at least two discrete outcomes (such as response and non-response). 470643 c05.qxd 3/8/04 11:11 AM Page 150 150 Chapter 5 The appeal of the chi-square test is that it readily adapts to multiple test groups and multiple outcomes, so long as the different groups are distinct from each other. This, in fact, is about the only important rule when using this test. As described in the next chapter on decision trees, the chi-square test is the basis for one of the earliest forms of decision trees. Expected Values The place to start with chi-square is to lay data out in a table, as in Table 5.5. This is a simple 2 × 2 table, which represents a test group and a control group in a test that has two outcomes, say response and nonresponse. This table also shows the total values for each column and row; that is, the total number of responders and nonresponders (each column) and the total number in the test and control groups (each row). The response column is added for reference; it is not part of the calculation. What if the data were broken up between these groups in a completely unbi- ased way? That is, what if there really were no differences between the columns and rows in the table? This is a completely reasonable question. We can calculate the expected values, assuming that the number of responders and non-responders is the same, and assuming that the sizes of the champion and challenger groups are the same. That is, we can calculate the expected value in each cell, given that the size of the rows and columns are the same as in the original data. One way of calculating the expected values is to calculate the proportion of each row that is in each column, by computing each of the following four quantities, as shown in Table 5.6: ■■ Proportion of everyone who responds ■■ Proportion of everyone who does not respond These proportions are then multiplied by the count for each row to obtain the expected value. This method for calculating the expected value works when the tabular data has more columns or more rows. Table 5.5 The Champion-Challenger Data Laid out for the Chi-Square Test Champion 900,000 4.80% Challenger 5,000 95,000 5.00% 1,000,000 4.82% RESPONDERS NON-RESPONDERS TOTAL RESPONSE 43,200 856,800 100,000 TOTAL 48,200 951,800 470643 c05.qxd 3/8/04 11:11 AM Page 151 The Lure of Statistics: Data Mining Using Familiar Tools 151 Table 5.6 Calculating the Expected Values and Deviations from Expected for the Data in Table 5.5 EXPECTED ACTUAL RESPONSE RESPONSE DEVIATION YES NO TOTAL YES NO YES NO Champion 43,200 856,800 900,000 43,380 856,620 –180 180 Challenger 5,000 95,000 100,000 4,820 95,180 180 –180 TOTAL 48,200 951,800 1,000,000 48,200 951,800 OVERALL PROPORTION 4.82% 95.18% The expected value is quite interesting, because it shows how the data would break up if there were no other effects. Notice that the expected value is measured in the same units as each cell, typically a customer count, so it actu- ally has a meaning. Also, the sum of the expected values is the same as the sum of all the cells in the original table. The table also includes the deviation, which is the difference between the observed value and the expected value. In this case, the deviations all have the same value, but with different signs. This is because the original data has two rows and two columns. Later in the chapter there are examples using larger tables where the deviations are different. However, the deviations in each row and each column always cancel out, so the sum of the deviations in each row is always 0. Chi-Square Value The deviation is a good tool for looking at values. However, it does not pro- vide information as to whether the deviation is expected or not expected. Doing this requires some more tools from statistics, namely, the chi-square dis- tribution developed by the English statistician Karl Pearson in 1900. The chi-square value for each cell is simply the calculation: (x - expected( )) 2 x Chi-square(x) = expected()x The chi-square value for the entire table is the sum of the chi-square values of all the cells in the table. Notice that the chi-square value is always 0 or positive. Also, when the values in the table match the expected value, then the overall chi-square is 0. This is the best that we can do. As the deviations from the expected value get larger in magnitude, the chi-square value also gets larger. Unfortunately, chi-square values do not follow a normal distribution. This is actually obvious, because the chi-square value is always positive, and the nor- mal distribution is symmetric. The good news is that chi-square values follow another distribution, which is also well understood. However, the chi-square [...]... responded Data Mining and Statistics Many of the data mining techniques discussed in the next eight chapters were invented by statisticians or have now been integrated into statistical soft­ ware; they are extensions of standard statistics Although data miners and The Lure of Statistics: Data Mining Using Familiar Tools statisticians use similar techniques to solve similar problems, the data mining approach... another Data mining, on the other hand, must often consider the time component of the data Experimentation is Hard Data mining has to work within the constraints of existing business practices This can make it difficult to set up experiments, for several reasons: The Lure of Statistics: Data Mining Using Familiar Tools ■ ■ Businesses may not be willing to invest in efforts that reduce short-term gain for. .. approaches Data miners generally have lots and lots of data with few measurement errors This data changes over time, and values are sometimes incomplete The data miner has to be particularly suspicious about bias introduced into the data by business processes The next eight chapters dive into more detail into more modern techniques for building models and understanding data Many of these techniques. .. sample, for instance, might have an equal number of churners and nonchurners, although the original data had different proportions However, it is generally better to use more data rather than sample down and use less, unless there is a good reason for sampling down Time Dependency Pops Up Everywhere Almost all data used in data mining has a time dependency associated with it Customers’ reactions to marketing. .. equal to the observed sales—another example of censored data Truncated data poses another problem in terms of biasing samples Trun­ cated data is not included in databases, often because it is too old For instance, when Company A purchases Company B, their systems are merged Often, the active customers from Company B are moved into the data warehouse for Company A That is, all customers active on a given... approach differs from the standard statistical approach in several areas: ■ ■ Data miners tend to ignore measurement error in raw data ■ ■ Data miners assume that there is more than enough data and process­ ing power ■ ■ Data mining assumes dependency on time everywhere ■ ■ It can be hard to design experiments in the business world ■ ■ Data is truncated and censored These are differences of approach, rather... hitches when planning a test; some­ times these hitches make it impossible to read the results Data Is Censored and Truncated The data used for data mining is often incomplete, in one of two special ways Censored values are incomplete because whatever is being measured is not complete One example is customer tenures For active customers, we know the tenure is greater than the current tenure; however, we... bound) Although this is not an issue for large data mining problems, it can be an issue when analyzing results from a small test The process for using the chi-square test is: Calculate the expected values ■ ■ Calculate the deviations from expected ■ ■ Calculate the chi-square (square the deviations and divide by the expected) ■ ■ Sum for an overall chi-square value for the table ■ ■ Calculate the probability... depends on the normal distribution Business problems often require analyzing data expressed as proportions Fortunately, these behave similarly to normal distributions The formula for the standard error for proportions (SEP) makes it possible to define a confidence interval on a proportion such as a response rate The standard error for the dif­ ference of proportions (SEDP) makes it possible to determine... picked up by data mining algorithms One major difference between business data and scientific data is that the latter has many continuous values and the former has many discrete values Even monetary amounts are discrete—two values can differ only by multiples of pennies (or some similar amount)—even though the values might be repre­ sented by real numbers 159 160 Chapter 5 There Is a Lot of Data Traditionally, . 1. 96 0.0219%*1. 96= 0.0429% 4. 46% 4.54% 4 .6% 900,000 0.0221% 1. 96 0.0221%*1. 96= 0.0433% 4. 56% 4 .64 % 4.7% 900,000 0.0223% 1. 96 0.0223%*1. 96= 0.0437% 4 .66 % 4.74% 4.8% 900,000 0.0225% 1. 96 0.0225%*1. 96= 0.0441%. 1. 96 0.0234%*1. 96= 0.0459% 5.15% 5.25% 5.3% 900,000 0.02 36% 1. 96 0.02 36% *1. 96= 0.0 463 % 5.25% 5.35% 5.4% 900,000 0.0238% 1. 96 0.0238%*1. 96= 0.0 466 % 5.35% 5.45% 5.5% 900,000 0.0240% 1. 96 0.0240%*1. 96= 0.0470%. 1,000 0 .68 92% 1. 96 3 .65 % 6. 35% 2.70% 5.0% 5,000 0.3082% 1. 96 4.40% 5 .60 % 1.21% 5.0% 10,000 0.2179% 1. 96 4.57% 5.43% 0.85% 5.0% 20,000 0.1541% 1. 96 4.70% 5.30% 0 .60 % 5.0% 40,000 0.1090% 1. 96 4.79%

Ngày đăng: 21/06/2014, 04:20

Từ khóa liên quan

Mục lục

  • sample.pdf

    • sterling.com

      • Welcome to Sterling Software

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan