Phan Tich Kinh Te Bang Excel.docx

Chúng ta thấy biến miles có 1 missing value. Để biến missing value ở đâu chúng ta sort cột miles từ thấp đến cao. Các missing value luôn nằm ở cuối bảng. Xác định các outlier lỗi và các giá trị lỗi trong bộ dữ liệu Chúng ta có thể xác định outlier của dữ liệu bằng cách sử dụng trung bình, độ lệch chuẩn cho các biến định lượng.

Trang 1

CHAPTER 1: DATA CLEANSING

Kiểm tra missing data của biến trong excel

Chúng ta có thể dùng hàm countblank để đếm số missing value trong một biến của excel

Chúng ta thấy biến miles có 1 missing value.

Để biến missing value ở đâu chúng ta sort cột miles từ thấp đến cao Các missing value luôn nằm ở cuối bảng.

Xác định các outlier lỗi và các giá trị lỗi trong bộ dữ liệu

Chúng ta có thể xác định outlier của dữ liệu bằng cách sử dụng trung bình, độ lệch chuẩn cho các biến định lượng

Trong ví dụ dữ liệu hình trên, chúng ta gõ vào ô H3 công thức =AVERAGE(C2:C457)

Sau đó chúng ta gõ vào ô H4 công thức =STDEV.S(C2:C457)

Chúng ta làm tương tự cho các ô còn lại

Kết quả cho thấy các biến này đều có trung bình và độ lệch chuẩn ổn định không phát hiện bất thường về outlier.

Trang 2

Chúng ta cũng có thể tính giá trị tối đa và tối thiểu cho từng biến Nhập công thức vào ô H5

Nhập công thức vào ô H6

= MAX(C2:C457)

Chúng ta thấy giá trị tối thiểu và tối đa của biến life of tires là 1.8 months và

601.0 Giá trị tối đa này (50 năm) là không phù hợp đối với biến life of tires Để xác định xe nào có outlier này chúng ta cần sort toàn bộ dữ liệu theo biến Life of Tire (Months) và cuốn đến các dòng cuối cùng của dữ liệu.

We see in Figure 2.34 that the observation with Life of Tire (Months) value of

Tire (Months) for the other three tires from this automobile is 60.1 This suggests that the decimal for Life of Tire (Months) for this automobile’s left rear tire value

Trang 3

rear tire value is also misplaced Both of these erroneous entries can now be corrected By repeating this process for the remaining variables in the data (Tread Depth and Miles) in columns I and J, we determine that the minimum and maximum values

are in error and if so, what might be the correct value

Not all erroneous values in a data set are extreme; these erroneous values are much more difficult to find However, if the variable with suspected erroneous values has a relatively strong relationship with another variable in the data, we can use this knowledge

to look for erroneous values Here we will consider the variables Tread Depth and Miles;

because more miles driven should lead to less tread depth on an automobile tire, we expect these two variables to have a negative relationship A scatter chart will

Trang 4

whether any of the tires in the data set have values for Tread Depth and Miles that are

counter to this expectation

The red ellipse in Figure 2.35 shows the region in which the points representing

Trang 5

Closer examination of outliers and potential erroneous values may reveal an error

Trang 6

technique Dimension reduction is the process of removing variables from the analysis without losing crucial information One simple method for reducing the number of variables is to examine pairwise correlations to detect variables or

may supply similar information Such variables can be aggregated or removed to allow

more parsimonious model development

A critical part of data mining is determining how to represent the measurements of the

variables and which variables to consider The treatment of categorical variables is particularly important Typically, it is best to encode categorical variables with 0–1

Trang 7

different categories results in a large number of variables In these cases, the use of PivotTables is helpful in identifying categories that are similar and can possibly be

reduce the number of 0–1 dummy variables For example, some categorical

code, product model number) may have many possible categories such that, for the purpose of model building, there is no substantive difference between multiple

therefore the number of categories may be reduced by combining categories

Often data sets contain variables that, considered separately, are not particularly insightful but that, when appropriately combined, result in a new variable that reveals an important relationship Financial data supplying information on stock

may be as useful as the derived variable representing the price/earnings (PE) ratio.

Trang 8

Lập bảng phân phối tần suất cho biến định tính trong excel Giả sử chúng ta có bảng dữ liệu như sau

Chúng ta muốn lập bảng phân phối tần suất của từng loại thức uống, chúng ta sử dụng hàm countif như sau

Kết quả như sau

Trang 9

Lập bảng phân phối tần suất cho biến định lượng trong Excel

Để lập bảng phân phối tần suất cho biến định lượng chúng ta dùng hàm frequency

Vẽ tổ chức đồ (histogram) cho biến định lượng

Để có thể vẽ histogram chúng ta cần phải có công cụ Data Analysis Toolpak.

Step 1 Click the Data tab in the RibbonStep 2 Click Data Analysis in the Analyze groupStep 3 When the Data Analysis dialog box opens, choose Histogram from the

Trang 10

list of

Analysis Tools, and click OK

Trong hộp Input Range: chúng ta nhập chuỗi dữ liệu vào ở đây ví dụ chúng ta

nhập A2:A21

Trong hộp Bin Range: chúng ta nhập các giới hạn trên của từng bin Ở đây chúng

ta nhập C2:C6

Under Output Options:, select New Worksheet Ply:

Select the check box for Chart Output (see Figure 2.13)Click OK

Kết quả là chúng ta được một sheet mới với histogram

Trang 11

Vẽ box plot cho biến định lượng

The step-by-step directions below illustrate how to create boxplots in Excel for

single variable and multiple variables First we will create a boxplot for a single variable

Step 1 Select cells B1:B13

Step 2 Click the Insert tab on the RibbonClick the Insert Statistic Chart button in the Charts groupChoose the Box and Whisker chart from the drop-down menu

The resulting boxplot created in Excel is shown in Figure 2.24 Comparing this figure to

Figure 2.22, we see that all the important elements of a boxplot are generated here Excel

orients the boxplot vertically, and by default it also includes a marker for the mean.

Next we will use the HomeSalesComparison file to create boxplots in Excel for

variables similar to what is shown in Figure 2.26.

Trang 12

Step1 Select cells B1:F11

Step 2 Click the Insert tab on the RibbonClick the Insert Statistic Chart button in the Charts groupChoose the Box and Whisker chart from the drop-down menu

The boxplot created in Excel is shown in Figure 2.25 Excel again orients the

Trang 14

CHƯƠNG 2: THỐNG KÊ MÔ TẢ

Tính các giá trị tập trung cho biến định lượng

Chúng ta có thể tính trung bình, trung vị, yếu vị của biến định lượng bằng cách dùng các hàm evarge, median, mode.

Trong bảng trên dãy dữ liệu của chúng ta có hai yếu vị là 138 và 25400000 do đó chúng ta phải dùng hàm MODE.MULTI (MULTI tượng trưng cho multimodal) Còn nếu dãy dữ liệu của chúng ta chỉ có 1 yếu vị, chúng ta sử dụng hàm MODE.SNGL.

Chúng ta có thể tính geometric mean với Excel.

The geometric mean is often used in analyzing growth rates in financial data In these

types of situations, the arithmetic mean or average value will provide misleading results.

Trang 15

To illustrate the use of the geometric mean, consider Table 2.10, which shows the percentage annual returns, or growth rates, for a mutual fund over the past 10 We refer to 0.779 as the growth factor for year 1 in Table 2.10 We can compute the balance at the end of year 1 by multiplying the value invested in the fund at the

year 1 by the growth factor for year 1: $100(0.779) $ 5 77.90 The balance in the fund at the end of year 1, $77.90, now becomes the beginning balance in year 2 So, with a percentage annual return for year 2 of 28.7%, the

In other words, the balance at the end of year 2 is just the initial investment at the beginning of year 1 times the product of the first two growth factors This result can be generalized to show that the balance at the end of year 10 is the initial

Trang 16

investment times the

compute the balance at the end of year 10 for any amount of money invested at the beginning of year 1 by multiplying the value of the initial investment by 1.335 For

Trang 17

What was the mean percentage annual return or mean rate of growth for this investment over the 10-year period? The geometric mean of the 10 growth factors

answer this question Because the product of the 10 growth factors is 1.335, the geometric

The geometric mean tells us that annual returns grew at an average annual rate of (1.029 2 1) 100, or 2.9% In other words, with an average annual growth rate of 2.9%, a $100 investment in the fund at the beginning of year 1 would grow to $100(1.029) $ 10 5 133.09 at the end of 10 years We can use Excel to calculate the

geometric mean for the data in Table 2.10 by using the function GEOMEAN In Figure 2.17, the value for the geometric mean in cell C13 is found using the formula

= GEOMEAN(C2:C11)

It is important to understand that the arithmetic mean of the percentage annual returns does not provide the mean annual growth rate for this investment The sum of

the 10 percentage annual returns in Table 2.10 is 50.4 Thus, the arithmetic mean

Trang 18

mean is appropriate only for an additive process For a multiplicative process, such as applications involving growth rates, the geometric mean is the appropriate

Trang 19

applications include changes in the populations of species, crop yields, pollution

death rates The geometric mean can also be applied to changes that occur over any number of successive periods of any length In addition to annual changes, the

often applied to find the mean rate of change over quarters, months, weeks, and even days

Tính các số phân tán cho biến định lượng

Chúng ta có thể dùng các công thức sau để tính các số phân tán cho biến định lượng

Lưu ý để tính độ lệch chuẩn cũng như variance chúng ta dùng S nghĩa là tính cho sample chứ không phải cho toàn bộ dân số.

Variance is a measure of variability in the values of a random variable It is a

weighted

Trang 20

average of the squared deviations of a random variable from its mean where the

referred to as the variance The notations Var(x) and s 2 are both used to denote the

variance of a random variable

The calculation of the variance of the number of payments made per year by a mortgage

customer is summarized in Table 4.12 We see that the variance is 42.360 The

deviation, s , is defined as the positive square root of the variance Thus, the

standard deviation for the number of payments made per year by a mortgage

Trang 21

The Excel function SUMPRODUCT can be used to easily calculate equation

a custom discrete random variable We illustrate the use of the SUMPRODUCT

We can also use Excel to find the variance directly from the data when the values

occur with relative frequencies that correspond to the probability distribution of the random

variable Cell F305 in Figure 4.12 shows that we use the Excel formula

=VAR.P(F2:F301) to calculate the variance from the complete data This formula

which is the same as that calculated in Table 4.12 and Figure 4.13 Similarly, we

formula 5STDEV.P(F2:F301) to calculate the standard deviation of 6.508.

As with the AVERAGE function and expected value, we cannot use the Excel functions

VAR.P and STDEV.P directly on the x values to calculate the variance andstandard deviation of a custom discrete random variable if the x values are not

Instead we must either use the formula from equation (4.14) or use the Excel

the entire data set as shown in Figure 4.12

Trang 23

variables, each with different standard deviations and different means Tính các chỉ số phân phối của biến định lượng

A z-score allows us to measure the relative location of a value in the data set More

specifically, a z-score helps us determine how far a particular value is from the

mean relative to the data set’s standard deviation Suppose we have a sample of n

values denoted by x1 2 , , x x , , n In addition, assume that the sample mean, x , and

the sample standard deviation, s, are already computed Associated with each

value called its z-score Equation (2.8) shows how the z-score is computed for eachxi:

Trang 24

The z-score is often called the standardized value The z-score, zi, can be

that the value of the observation is equal to the mean.

The z-scores for the class size data are computed in Table 2.13 Recall the

Trang 25

deviations below the mean.

The z-score can be calculated in Excel using the function STANDARDIZE Figure

Trang 26

Tính đồng phương sai (covariance) giữa hai biến định lượng

Covariance is a descriptive measure of the linear association between two

the deviation of each xi from its sample mean ( ) x x i 2 by the deviation of the corresponding yi from its sample mean ( ) y y i 2 ; this sum is then divided by n 2 1

To measure the strength of the linear relationship between the high temperature x

that for our calculations, x 5 84.6 and y 5 26.3

The covariance calculated in Table 2.15 is s 12.8

than 0, it indicates a positive relationship between the high temperature and sales

water This verifies the relationship we saw in the scatter chart in Figure 2.26 that

Trang 27

high temperature for a day increases, sales of bottled water generally increase The sample covariance can also be calculated in Excel using the COVARIANCE.S function Figure 2.27 shows the data from Table 2.14 entered into an Excel

covariance is calculated in cell B17 using the formula 5COVARIANCE.S(A2:A15,B2:B15) A2:A15 defines the range for the x variable (high temperature), and

range for the y variable (sales of bottled water)

Trang 28

For the bottled water, the covariance is positive, indicating that higher

are associated with higher sales (y) If the covariance is near 0, then the x and y

are not linearly related If the covariance is less than 0, then the x and y variablesare negatively related, which means that as x increases, y generally decreases.

Figure 2.28 demonstrates several possible scatter charts and their associated

One problem with using covariance is that the magnitude of the covariance value is difficult to interpret Larger sxy values do not necessarily mean a stronger linear

Trang 31

Tính correlation coefficient giữa hai biến định lượng

The correlation coefficient measures the relationship between two variables, and, unlike

covariance, the relationship between two variables is not affected by the units of

measurement for x and y For sample data, the correlation coefficient is defined as

Trang 32

scales the correlation coefficient so that it will always take values between 21 and 11.

Let us now compute the sample correlation coefficient for bottled water sales at Queensland Amusement Park Recall that we calculated sxy 5 12.8 using equation (2.9).

Using data in Table 2.14, we can compute sample standard deviations for x and y

The sample correlation coefficient is computed from equation (2.10) as follows: 12.8

The correlation coefficient can take only values between 21 and 11 Correlation

coefficient values near 0 indicate no linear relationship between the x and y

variables Correlation coefficients greater than 0 indicate a positive linear

variables The closer the correlation coefficient is to 11, the closer the x and y

to forming a straight line that trends upward to the right (positive slope) Correlation coefficients less than 0 indicate a negative linear relationship between

The closer the correlation coefficient is to 21, the closer the x and y values are to

Trang 33

we can see in Figure 2.26, one could draw a straight line with a positive slope that would

close to all of the data points in the scatter chart Because the correlation coefficient defined here measures only the strength of the

cooling) and the daily high outside temperature for 100 consecutive days The sample correlation coefficient for these data is rxy 5 20.007 and indicates that there is no linear relationship between the two variables However, Figure 2.29 provides

strong visual evidence of a nonlinear relationship That is, we can see that as the daily high

Trang 34

outside temperature increases, the money spent on environmental control first

less heating is required and then increases as greater cooling is required We can compute correlation coefficients using the Excel function CORREL The correlation coefficient in Figure 2.27 is computed in cell B18 for the sales of

Trang 35

CHƯƠNG 3: THỐNG KÊ SUY DIỄN

Sampling from a Finite Population

Statisticians recommend selecting a probability sample when sampling from a finite population because a probability sample allows you to make valid statistical

random sample of size n from a finite population of size N is defined as follows.

Procedures used to select a simple random sample from a finite population are

generated is called a random number because the mathematical procedure used by the RAND function guarantees that every number between 0 and 1 has the same

Trang 36

N involves two steps.

Step 1 Assign a random number to each element of the population.

Step 2 Select the n elements corresponding to the n smallest random numbers.

Because each set of n elements in the population has the same probability of beingassigned the n smallest random numbers, each set of n elements has the same

probability of

being selected for the sample If we select the sample using this two-step procedure, every

sample of size n has the same probability of being selected; thus, the sample

selected satisfies the definition of a simple random sample.

Let us consider the process of selecting a simple random sample of 30 EAI

Step 1 In cell D1, enter the text Random Numbers

Step 2 In cells D2:D2501, enter the formulaRAND()Step 3 Select the cell range D2:D2501

Step 4 In the Home tab in the Ribbon:Click Copy in the Clipboard group

Click the arrow below Paste in the Clipboard group When the Pastewindow appears, click Values in the Paste Values area

Press the Esc key

Trang 37

Step 5 Select cells A1:D2501

Step 6 In the Data tab on the Ribbon, click Sort in the Sort & Filter groupStep 7 When the Sort dialog box appears:

Select the check box for My data has headers

In the first Sort by dropdown menu, select Random Numbers

30 random numbers that were generated Hence, this group of 30 employees is a simple random sample Note that the random numbers shown on the right in Figure

sample, and employee 13 in the population (see row 14 of the worksheet on the left) has been

included as the 22nd observation in the sample (row 23 of the worksheet on the right)

Trang 38

Sampling from an infinite Population

Sometimes we want to select a sample from a population, but the population is

Trang 39

infinite population case With an infinite population, we cannot select a simple random

sample because we cannot construct a frame consisting of all the elements In the infinite

population case, statisticians recommend selecting what is called a random sample.

Care and judgment must be exercised in implementing the selection process for obtaining a random sample from an infinite population Each case may require a different

selection procedure Let us consider two examples to see what we mean by the conditions: (1) Each element selected comes from the same population, and (2)

Trang 40

With a production operation such as this, the biggest concern in selecting a random sample is to make sure that condition 1, the sampled elements are selected from the

selecting some boxes when the process is operating properly and other boxes when the process is not operating properly and is underfilling or overfilling the boxes With a production process such as this, the second condition, each element is selected independently,

is satisfied by designing the production process so that each box of cereal is filled independently With this assumption, the quality-control inspector need only worry about satisfying the same population condition.

As another example of selecting a random sample from an infinite population,