Đang tải... (xem toàn văn)
Chúng ta thấy biến miles có 1 missing value. Để biến missing value ở đâu chúng ta sort cột miles từ thấp đến cao. Các missing value luôn nằm ở cuối bảng. Xác định các outlier lỗi và các giá trị lỗi trong bộ dữ liệu Chúng ta có thể xác định outlier của dữ liệu bằng cách sử dụng trung bình, độ lệch chuẩn cho các biến định lượng.
Trang 1CHAPTER 1: DATA CLEANSING
Kiểm tra missing data của biến trong excel
Chúng ta có thể dùng hàm countblank để đếm số missing value trong một biến của excel
Chúng ta thấy biến miles có 1 missing value.
Để biến missing value ở đâu chúng ta sort cột miles từ thấp đến cao Các missing value luôn nằm ở cuối bảng.
Xác định các outlier lỗi và các giá trị lỗi trong bộ dữ liệu
Chúng ta có thể xác định outlier của dữ liệu bằng cách sử dụng trung bình, độ lệch chuẩn cho các biến định lượng
Trong ví dụ dữ liệu hình trên, chúng ta gõ vào ô H3 công thức =AVERAGE(C2:C457)
Sau đó chúng ta gõ vào ô H4 công thức =STDEV.S(C2:C457)
Chúng ta làm tương tự cho các ô còn lại
Kết quả cho thấy các biến này đều có trung bình và độ lệch chuẩn ổn định không phát hiện bất thường về outlier.
Trang 2Chúng ta cũng có thể tính giá trị tối đa và tối thiểu cho từng biến Nhập công thức vào ô H5
Nhập công thức vào ô H6
= MAX(C2:C457)
Chúng ta thấy giá trị tối thiểu và tối đa của biến life of tires là 1.8 months và
601.0 Giá trị tối đa này (50 năm) là không phù hợp đối với biến life of tires Để xác định xe nào có outlier này chúng ta cần sort toàn bộ dữ liệu theo biến Life of Tire (Months) và cuốn đến các dòng cuối cùng của dữ liệu.
We see in Figure 2.34 that the observation with Life of Tire (Months) value of
Tire (Months) for the other three tires from this automobile is 60.1 This suggests that the decimal for Life of Tire (Months) for this automobile’s left rear tire value
Trang 3rear tire value is also misplaced Both of these erroneous entries can now be corrected By repeating this process for the remaining variables in the data (Tread Depth and Miles) in columns I and J, we determine that the minimum and maximum values
are in error and if so, what might be the correct value
Not all erroneous values in a data set are extreme; these erroneous values are much more difficult to find However, if the variable with suspected erroneous values has a relatively strong relationship with another variable in the data, we can use this knowledge
to look for erroneous values Here we will consider the variables Tread Depth and Miles;
because more miles driven should lead to less tread depth on an automobile tire, we expect these two variables to have a negative relationship A scatter chart will
Trang 4whether any of the tires in the data set have values for Tread Depth and Miles that are
counter to this expectation
The red ellipse in Figure 2.35 shows the region in which the points representing
Trang 5Closer examination of outliers and potential erroneous values may reveal an error
Trang 6technique Dimension reduction is the process of removing variables from the analysis without losing crucial information One simple method for reducing the number of variables is to examine pairwise correlations to detect variables or
may supply similar information Such variables can be aggregated or removed to allow
more parsimonious model development
A critical part of data mining is determining how to represent the measurements of the
variables and which variables to consider The treatment of categorical variables is particularly important Typically, it is best to encode categorical variables with 0–1
Trang 7different categories results in a large number of variables In these cases, the use of PivotTables is helpful in identifying categories that are similar and can possibly be
reduce the number of 0–1 dummy variables For example, some categorical
code, product model number) may have many possible categories such that, for the purpose of model building, there is no substantive difference between multiple
therefore the number of categories may be reduced by combining categories
Often data sets contain variables that, considered separately, are not particularly insightful but that, when appropriately combined, result in a new variable that reveals an important relationship Financial data supplying information on stock
may be as useful as the derived variable representing the price/earnings (PE) ratio.
Trang 8Lập bảng phân phối tần suất cho biến định tính trong excel Giả sử chúng ta có bảng dữ liệu như sau
Chúng ta muốn lập bảng phân phối tần suất của từng loại thức uống, chúng ta sử dụng hàm countif như sau
Kết quả như sau
Trang 9Lập bảng phân phối tần suất cho biến định lượng trong Excel
Để lập bảng phân phối tần suất cho biến định lượng chúng ta dùng hàm frequency
Vẽ tổ chức đồ (histogram) cho biến định lượng
Để có thể vẽ histogram chúng ta cần phải có công cụ Data Analysis Toolpak.
Step 1 Click the Data tab in the RibbonStep 2 Click Data Analysis in the Analyze groupStep 3 When the Data Analysis dialog box opens, choose Histogram from the
Trang 10list of
Analysis Tools, and click OK
Trong hộp Input Range: chúng ta nhập chuỗi dữ liệu vào ở đây ví dụ chúng ta
nhập A2:A21
Trong hộp Bin Range: chúng ta nhập các giới hạn trên của từng bin Ở đây chúng
ta nhập C2:C6
Under Output Options:, select New Worksheet Ply:
Select the check box for Chart Output (see Figure 2.13)Click OK
Kết quả là chúng ta được một sheet mới với histogram
Trang 11Vẽ box plot cho biến định lượng
The step-by-step directions below illustrate how to create boxplots in Excel for
single variable and multiple variables First we will create a boxplot for a single variable
Step 1 Select cells B1:B13
Step 2 Click the Insert tab on the RibbonClick the Insert Statistic Chart button in the Charts groupChoose the Box and Whisker chart from the drop-down menu
The resulting boxplot created in Excel is shown in Figure 2.24 Comparing this figure to
Figure 2.22, we see that all the important elements of a boxplot are generated here Excel
orients the boxplot vertically, and by default it also includes a marker for the mean.
Next we will use the HomeSalesComparison file to create boxplots in Excel for
variables similar to what is shown in Figure 2.26.
Trang 12Step1 Select cells B1:F11
Step 2 Click the Insert tab on the RibbonClick the Insert Statistic Chart button in the Charts groupChoose the Box and Whisker chart from the drop-down menu
The boxplot created in Excel is shown in Figure 2.25 Excel again orients the
Trang 14CHƯƠNG 2: THỐNG KÊ MÔ TẢ
Tính các giá trị tập trung cho biến định lượng
Chúng ta có thể tính trung bình, trung vị, yếu vị của biến định lượng bằng cách dùng các hàm evarge, median, mode.
Trong bảng trên dãy dữ liệu của chúng ta có hai yếu vị là 138 và 25400000 do đó chúng ta phải dùng hàm MODE.MULTI (MULTI tượng trưng cho multimodal) Còn nếu dãy dữ liệu của chúng ta chỉ có 1 yếu vị, chúng ta sử dụng hàm MODE.SNGL.
Chúng ta có thể tính geometric mean với Excel.
The geometric mean is often used in analyzing growth rates in financial data In these
types of situations, the arithmetic mean or average value will provide misleading results.
Trang 15To illustrate the use of the geometric mean, consider Table 2.10, which shows the percentage annual returns, or growth rates, for a mutual fund over the past 10 We refer to 0.779 as the growth factor for year 1 in Table 2.10 We can compute the balance at the end of year 1 by multiplying the value invested in the fund at the
year 1 by the growth factor for year 1: $100(0.779) $ 5 77.90 The balance in the fund at the end of year 1, $77.90, now becomes the beginning balance in year 2 So, with a percentage annual return for year 2 of 28.7%, the
In other words, the balance at the end of year 2 is just the initial investment at the beginning of year 1 times the product of the first two growth factors This result can be generalized to show that the balance at the end of year 10 is the initial
Trang 16investment times the
compute the balance at the end of year 10 for any amount of money invested at the beginning of year 1 by multiplying the value of the initial investment by 1.335 For
Trang 17What was the mean percentage annual return or mean rate of growth for this investment over the 10-year period? The geometric mean of the 10 growth factors
answer this question Because the product of the 10 growth factors is 1.335, the geometric
The geometric mean tells us that annual returns grew at an average annual rate of (1.029 2 1) 100, or 2.9% In other words, with an average annual growth rate of 2.9%, a $100 investment in the fund at the beginning of year 1 would grow to $100(1.029) $ 10 5 133.09 at the end of 10 years We can use Excel to calculate the
geometric mean for the data in Table 2.10 by using the function GEOMEAN In Figure 2.17, the value for the geometric mean in cell C13 is found using the formula
= GEOMEAN(C2:C11)
It is important to understand that the arithmetic mean of the percentage annual returns does not provide the mean annual growth rate for this investment The sum of
the 10 percentage annual returns in Table 2.10 is 50.4 Thus, the arithmetic mean
Trang 18mean is appropriate only for an additive process For a multiplicative process, such as applications involving growth rates, the geometric mean is the appropriate
Trang 19applications include changes in the populations of species, crop yields, pollution
death rates The geometric mean can also be applied to changes that occur over any number of successive periods of any length In addition to annual changes, the
often applied to find the mean rate of change over quarters, months, weeks, and even days
Tính các số phân tán cho biến định lượng
Chúng ta có thể dùng các công thức sau để tính các số phân tán cho biến định lượng
Lưu ý để tính độ lệch chuẩn cũng như variance chúng ta dùng S nghĩa là tính cho sample chứ không phải cho toàn bộ dân số.
Variance is a measure of variability in the values of a random variable It is a
weighted
Trang 20average of the squared deviations of a random variable from its mean where the
referred to as the variance The notations Var(x) and s 2 are both used to denote the
variance of a random variable
The calculation of the variance of the number of payments made per year by a mortgage
customer is summarized in Table 4.12 We see that the variance is 42.360 The
deviation, s , is defined as the positive square root of the variance Thus, the
standard deviation for the number of payments made per year by a mortgage
Trang 21The Excel function SUMPRODUCT can be used to easily calculate equation
a custom discrete random variable We illustrate the use of the SUMPRODUCT
We can also use Excel to find the variance directly from the data when the values
occur with relative frequencies that correspond to the probability distribution of the random
variable Cell F305 in Figure 4.12 shows that we use the Excel formula
=VAR.P(F2:F301) to calculate the variance from the complete data This formula
which is the same as that calculated in Table 4.12 and Figure 4.13 Similarly, we
formula 5STDEV.P(F2:F301) to calculate the standard deviation of 6.508.
As with the AVERAGE function and expected value, we cannot use the Excel functions
VAR.P and STDEV.P directly on the x values to calculate the variance andstandard deviation of a custom discrete random variable if the x values are not
Instead we must either use the formula from equation (4.14) or use the Excel
the entire data set as shown in Figure 4.12
Trang 23variables, each with different standard deviations and different means Tính các chỉ số phân phối của biến định lượng
A z-score allows us to measure the relative location of a value in the data set More
specifically, a z-score helps us determine how far a particular value is from the
mean relative to the data set’s standard deviation Suppose we have a sample of n
values denoted by x1 2 , , x x , , n In addition, assume that the sample mean, x , and
the sample standard deviation, s, are already computed Associated with each
value called its z-score Equation (2.8) shows how the z-score is computed for eachxi:
Trang 24The z-score is often called the standardized value The z-score, zi, can be
that the value of the observation is equal to the mean.
The z-scores for the class size data are computed in Table 2.13 Recall the
Trang 25deviations below the mean.
The z-score can be calculated in Excel using the function STANDARDIZE Figure
Trang 26Tính đồng phương sai (covariance) giữa hai biến định lượng
Covariance is a descriptive measure of the linear association between two
the deviation of each xi from its sample mean ( ) x x i 2 by the deviation of the corresponding yi from its sample mean ( ) y y i 2 ; this sum is then divided by n 2 1
To measure the strength of the linear relationship between the high temperature x
that for our calculations, x 5 84.6 and y 5 26.3
The covariance calculated in Table 2.15 is s 12.8
than 0, it indicates a positive relationship between the high temperature and sales
water This verifies the relationship we saw in the scatter chart in Figure 2.26 that
Trang 27high temperature for a day increases, sales of bottled water generally increase The sample covariance can also be calculated in Excel using the COVARIANCE.S function Figure 2.27 shows the data from Table 2.14 entered into an Excel
covariance is calculated in cell B17 using the formula 5COVARIANCE.S(A2:A15,B2:B15) A2:A15 defines the range for the x variable (high temperature), and
range for the y variable (sales of bottled water)
Trang 28For the bottled water, the covariance is positive, indicating that higher
are associated with higher sales (y) If the covariance is near 0, then the x and y
are not linearly related If the covariance is less than 0, then the x and y variablesare negatively related, which means that as x increases, y generally decreases.
Figure 2.28 demonstrates several possible scatter charts and their associated
One problem with using covariance is that the magnitude of the covariance value is difficult to interpret Larger sxy values do not necessarily mean a stronger linear
Trang 31Tính correlation coefficient giữa hai biến định lượng
The correlation coefficient measures the relationship between two variables, and, unlike
covariance, the relationship between two variables is not affected by the units of
measurement for x and y For sample data, the correlation coefficient is defined as
Trang 32scales the correlation coefficient so that it will always take values between 21 and 11.
Let us now compute the sample correlation coefficient for bottled water sales at Queensland Amusement Park Recall that we calculated sxy 5 12.8 using equation (2.9).
Using data in Table 2.14, we can compute sample standard deviations for x and y
The sample correlation coefficient is computed from equation (2.10) as follows: 12.8
The correlation coefficient can take only values between 21 and 11 Correlation
coefficient values near 0 indicate no linear relationship between the x and y
variables Correlation coefficients greater than 0 indicate a positive linear
variables The closer the correlation coefficient is to 11, the closer the x and y
to forming a straight line that trends upward to the right (positive slope) Correlation coefficients less than 0 indicate a negative linear relationship between
The closer the correlation coefficient is to 21, the closer the x and y values are to
Trang 33we can see in Figure 2.26, one could draw a straight line with a positive slope that would
close to all of the data points in the scatter chart Because the correlation coefficient defined here measures only the strength of the
cooling) and the daily high outside temperature for 100 consecutive days The sample correlation coefficient for these data is rxy 5 20.007 and indicates that there is no linear relationship between the two variables However, Figure 2.29 provides
strong visual evidence of a nonlinear relationship That is, we can see that as the daily high
Trang 34outside temperature increases, the money spent on environmental control first
less heating is required and then increases as greater cooling is required We can compute correlation coefficients using the Excel function CORREL The correlation coefficient in Figure 2.27 is computed in cell B18 for the sales of
Trang 35CHƯƠNG 3: THỐNG KÊ SUY DIỄN
Sampling from a Finite Population
Statisticians recommend selecting a probability sample when sampling from a finite population because a probability sample allows you to make valid statistical
random sample of size n from a finite population of size N is defined as follows.
Procedures used to select a simple random sample from a finite population are
generated is called a random number because the mathematical procedure used by the RAND function guarantees that every number between 0 and 1 has the same
Trang 36N involves two steps.
Step 1 Assign a random number to each element of the population.
Step 2 Select the n elements corresponding to the n smallest random numbers.
Because each set of n elements in the population has the same probability of beingassigned the n smallest random numbers, each set of n elements has the same
probability of
being selected for the sample If we select the sample using this two-step procedure, every
sample of size n has the same probability of being selected; thus, the sample
selected satisfies the definition of a simple random sample.
Let us consider the process of selecting a simple random sample of 30 EAI
Step 1 In cell D1, enter the text Random Numbers
Step 2 In cells D2:D2501, enter the formulaRAND()Step 3 Select the cell range D2:D2501
Step 4 In the Home tab in the Ribbon:Click Copy in the Clipboard group
Click the arrow below Paste in the Clipboard group When the Pastewindow appears, click Values in the Paste Values area
Press the Esc key
Trang 37Step 5 Select cells A1:D2501
Step 6 In the Data tab on the Ribbon, click Sort in the Sort & Filter groupStep 7 When the Sort dialog box appears:
Select the check box for My data has headers
In the first Sort by dropdown menu, select Random Numbers
30 random numbers that were generated Hence, this group of 30 employees is a simple random sample Note that the random numbers shown on the right in Figure
sample, and employee 13 in the population (see row 14 of the worksheet on the left) has been
included as the 22nd observation in the sample (row 23 of the worksheet on the right)
Trang 38Sampling from an infinite Population
Sometimes we want to select a sample from a population, but the population is
Trang 39infinite population case With an infinite population, we cannot select a simple random
sample because we cannot construct a frame consisting of all the elements In the infinite
population case, statisticians recommend selecting what is called a random sample.
Care and judgment must be exercised in implementing the selection process for obtaining a random sample from an infinite population Each case may require a different
selection procedure Let us consider two examples to see what we mean by the conditions: (1) Each element selected comes from the same population, and (2)
Trang 40With a production operation such as this, the biggest concern in selecting a random sample is to make sure that condition 1, the sampled elements are selected from the
selecting some boxes when the process is operating properly and other boxes when the process is not operating properly and is underfilling or overfilling the boxes With a production process such as this, the second condition, each element is selected independently,
is satisfied by designing the production process so that each box of cereal is filled independently With this assumption, the quality-control inspector need only worry about satisfying the same population condition.
As another example of selecting a random sample from an infinite population,