Bài tập lớn Xác suất thống kê ĐH BK

PROBABILITY AND STATISTICS’S PROJECT GROUP 05 TABLE OF CONTENTS LIST OF FIGURE ACKNOWLEDGMENT INTRODUCTION 1.1 Topic introduction and requirements 1.1.1 Subject 1.1.2 R-studio 1.1.3 Our problems 1.2 Theoretical basis DATA CORRECTION 12 2.1 Improt data 12 2.2 Data cleaning 12 2.3 Data clarification 13 2.4 Logistics Regression 26 2.5 Prediction 36 CODE R 39 REFERENCES 43 2|Page PROBABILITY AND STATISTICS’S PROJECT GROUP 05 LIST OF FIGURE Figure 1: R code and results after reading data 12 Figure 2: R code and results when checking missing data in file "diabetes" 12 Figure 3: R code and results when performing descriptive statistics 13 Figure 4: R code and results when performing quantitative statistics for the variable "Outcome" 13 Figure 5: The result when plotting the histogram of the variable ”Pregnancies” and ”Glucose” 14 Figure 6: The result when plotting the histogram of the variable ”Blood Pressure” and ”Skin Thickness” 15 Figure 7: The result when plotting the histogram of the variable ”Insulin” and ”BMI” 16 Figure 8: The result when plotting the histogram of the variable ”Diabetes Pedigree Function” and 17 Figure 9: R code 18 Figure 10: The result of histogram shows the distribution of the number of pregnancies for people having and not having diabetes 18 Figure 11: R code 19 Figure 12: The result of histogram shows the distribution of skin thickness for people having and not having diabetes 19 Figure 13: R code 20 Figure 14: The result of histogram shows the distribution of glucose level for people having and not having diabetes 20 Figure 15: R code 21 Figure 16: The result of histogram shows the distribution of blood pressure for people having and not having diabetes 21 Figure 17: R code 22 Figure 18: The result of histogram shows the distribution of insulin level for people having and not having diabetes 22 Figure 19: R code 23 Figure 20: The result of histogram shows the distribution of BMI (Body mass index) for people having and not having diabetes 23 Figure 21: R code 24 Figure 22: The result of histogram shows the distribution of diabetes pedigree function for people having and not having diabetes 24 Figure 23: R code 25 Figure 24: The result of histogram shows the distribution of age for people having and not having diabetes 25 Figure 25: R code and results 26 Figure 26: R code and results when removing skinthickness variable from model 28 Figure 27: R code and results when removing Insulin variable from model 29 Figure 28: R code and results when removing age variable from model 30 Figure 29: R code and results when comparing the efficiency between model and model 30 Figure 30: R code and results when comparing the efficiency between model and model 31 Figure 31: R code and results when comparing the efficiency between model and model 31 Figure 32: R code 32 Figure 33: Results when building an equation with all variables 32 Figure 34: Result when Skinthickness variable from the first model 32 3|Page PROBABILITY AND STATISTICS’S PROJECT GROUP 05 Figure 35: R code and summary of model results 33 Figure 36: R code and results 34 Figure 37: R code and results 35 Figure 38: R code and the results of forecasting based on the original data set, save the results in the file diabetes 36 Figure 39: R Code and Statistical Results 36 Figure 40: R Code and Statistical Results 37 Figure 41: R code and comparison results 37 Figure 42: R Code and test results 37 Figure 43: R Code and test results 38 Figure 44: R code and evaluation results 38 4|Page PROBABILITY AND STATISTICS’S PROJECT GROUP 05 ACKNOWLEDGMENT First of all, we would like to express our gratitude to Professor Nguyen Tien Dung for having enabled our group to have a chance to interact with R studio software We are also grateful that you have shown us an abundant amount of knowledge about Probability and Statistics This is an opportunity for us to operate the R studio We also understand that R studio is important material to the world of Mathematics nowadays The software increases not only our knowledge but also the ideas for future projects 5|Page PROBABILITY AND STATISTICS’S PROJECT GROUP 05 PROJECT OF PROBABILITY AND STATISTIC FOR CHEMICAL ENGINEERING (MT2013) INTRODUCTION 1.1 Topic introduction and requirements 1.1.1 Subject Probability is a part of mathematics that deals with numerical descriptions of the probability that an event will occur, or the probability that a proposition is true The probability of an occurrence is a number between and 1, where denotes the impossibility of the event and represents certainty It's usually applied in fields such as mathematics, statistics, economics, gambling, science (particularly physics), artificial intelligence, machine learning, computer science, philosophy, and so on Statistics is the study of several disciplines, including data analysis, interpretation, presentation, and organization It plays a critical part in the research process by providing analytically significant statistics to assist statistical analysts in obtaining the most correct results to address associated difficulties with social activities To sum up, Probability and Statistics nowadays is becoming significant in our modern life, especially with student whose major is in natural science, technology, and economy, 1.1.2 R-studio R is a programming language and environment that is widely used in statistical computing, data analysis, and scientific research It is a popularly used programming language for data collecting, cleaning, analysis, graphing, and visualization R is the next generation language of the “S language” in reality The S programming language allows users and students of engineering and technology university to calculate and modify data As a language, one can use R to develop specialized software for a particular computational problem 1.1.3 Our problems This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases The objective of the dataset is to diagnostically predict whether a patient has diabetes, based on certain diagnostic measurements included in the dataset Several constraints were placed on the selection of these instances from a larger database In particular, all patients here are females at least 21-year-old of Pima Indian heritage Information about dataset attributes: • • • • • Pregnancies: To express the Number of pregnancies Glucose: To express the Glucose level in blood Blood Pressure: To express the Blood pressure measurement Skin Thickness: To express the thickness of the skin Insulin: To express the Insulin level in blood 6|Page PROBABILITY AND STATISTICS’S PROJECT • • • • GROUP 05 BMI: To express the Body mass index Diabetes Pedigree Function: To express the Diabetes percentage Age: To express the age Outcome: To express the final result is Yes and is No Implementation steps: • • • • Import data: diabetes.csv Data cleaning: NA (missing data) Data visualization o Convert the variable (if necessary) o Descriptive statistics: using sample statistics and graphs Logistic regression model: Using a suitable logistic regression model to evaluate factors affecting of diabetes 1.2 Theoretical basis Logistic regression (often referred to simply as binomial logistic regression) is used to predict the probability that an observation falls into one of the categories of the dependent variable based on one or more independent variables that may be continuous or classified On the other hand, if your sea of dependencies is a count, the statistical method that should be considered is Poisson regression Also, if you have more than two types of dependent variables, that is when multinomial logistic regression should be used For example, you can use binomial logistic regression to understand whether test performance can be predicted based on review time and test anxiety (i.e., where belonging to “test performance”, measured on a proportional scale – “pass” or “fail” – and you have two independent variables: “review time” and “test anxiety”) Logistics regression model Logistic regression models are used to predict a categorical variable by one or more continuous or categorical independent variables The dependent variable can be binary, ordinal or multicategorical The independent variable can be interval/scale, dichotomous, discrete or a mixture of all The logistic regression equation (in case the dependent variable is binary) is: 𝑃(𝑌𝑖 = 1) = 𝑒 −(𝛽0+𝛽1𝑥1𝑖 +𝛽2𝑥2𝑖 +⋯+𝛽𝑘𝑥𝑘𝑖 ) + 𝑒 −(𝛽0+𝛽1𝑥1𝑖 +𝛽2𝑥2𝑖 +⋯+𝛽𝑘𝑥𝑘𝑖 ) Where: - P is the probability of observing a case i in the outcome variable Y with a value = - e is an Euler mathematical constant with a value close to 2.71828; - And the regression coefficients β corresponding to the observed variables 7|Page PROBABILITY AND STATISTICS’S PROJECT GROUP 05 We often use regression models to estimate the effect of X variables on an Odds (Y=1) Effects in logistic regression For estimation and prediction purposes, the probabilities are severely limited First, they are bound to the range to This implies that if the real effect of variable X on the outcome of variable Y exceeds 1, interpretation may be problematic The second limit, the probability cannot be negative Assuming that the effect of an independent variable on variable Y is negative, the logistic regression coefficient interpretation is meaningless One problem is that the regression coefficient should only be positive To solve the above two problems, we have a two-step approach through the implementation of two variables change First, we convert the probabilities in Odds (O) to: 0= 𝑃 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑒𝑣𝑒𝑛𝑡 ℎ𝑎𝑝𝑝𝑒𝑛𝑖𝑛𝑔 = − 𝑃 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑒𝑣𝑒𝑛𝑡 𝑛𝑜𝑡 ℎ𝑎𝑝𝑒𝑛𝑖𝑛𝑔 𝑂 ; 𝑤ℎ𝑒𝑟𝑒 𝑂 𝑖𝑠 𝑂𝑑𝑑; 𝑃 𝑖𝑠 𝑡ℎ𝑒 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 1+𝑂 That is, the odds that an event will occur is the ratio of the number of times it is expected that the event will happen to the number of times it is expected that the event will not happen This is a direct relationship between Odds (Y=1) and probability Y=1 𝑃= Thus, given that Odds can be infinite, the probability with Odds now allows the regression coefficient to have any value The next step is to solve the second problem Relationship between Odds and Probability, slightly expanded Algebraically, we can restate the Odds (O) formula above in terms of the logarithm of Odds (Y=1): 𝑙𝑜𝑔𝑒 [𝑂𝑑𝑑(𝑌𝑖 = 1)] = 𝛽0 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + ⋯ + 𝛽𝑘 𝑥𝑘𝑖 To calculate the logarithm for a random case in the population for the value of an independent variable or a covariate Add to the dependent sea Y the value (for example, (vote for Obama in 2008), (vote for McCain in 2008, in the US election) Obama P(Y=1) is 0.218 ; and so 1-P = 0.782 We calculate Odds as: Odds=0.218/0.782=0.279 This value just shows us the resulting Odds, now we have to continue to assume that the logistic regression coefficients involved are in the correct direction, so we need to use the logarithmic formula of Odds Accordingly, the natural logarithm (loge, symbol ln) of Odds (eg ln0, 279 = −1, 276) Therefore, the logarithm of the probability of voting for Obama is -1.276' So, if we just stop at probabilistic predictions, we can get false results (a positive number) 8|Page PROBABILITY AND STATISTICS’S PROJECT GROUP 05 Second, the true effect of the covariates involved is underrated (underestimated) The main advantage of logarithmic Odds is that the coefficients are constrained, and that they can be negative as well as positive, ranging from negative infinity to positive infinity Stated this way, logistic regression looks exactly like multiple regression on the right side of the logarithmic Odds equation The left side of the equation is not the score of Y It is the logarithm of Odds (Y=1) This means that each unit of X has the effect of β on the logarithm of Odds of Y Estimation of a logistic regression model with maximum likelihood (Maximum Likelihood) Because logistic regression operates on a categorical variable, the ordinary least squares (OLS) method is unusable (it assumes a normally distributed dependent variable) Therefore, a more general estimator is used to detect a good fit of the parameters This is called the maximum likelihood estimation Maximum likelihood is an interactive estimation technique to select parameter estimates that maximize the likelihood of a sample dataset being observed In logistic regression, maximum reasonably selects coefficient estimates that maximize the logarithm of the probability of observing a particular set of values of the dependent variable in the sample for a given set of X values Because logistic regression uses the method of maximum likelihood, the coefficient of determination (R-) may not be directly estimated Thus, we have two dilemmas for the interpretation of logistic regression: First, how can we also measure the goodness of fit – a general null hypothesis? Second, how we estimate the partial effect of each variable X? Statistical inference and null hypothesis First question, how can we also measure the goodness of fit – a general null hypothesis? The statistical inferences, together with the null hypothesis, are interpreted according to the following steps: • The first step in the regression interpretation is to evaluate the global null hypothesis that the independent seas not have any relationship with Y In the OLS regression method, we perform This is equal to testing whether R2 must be in the population using an F-test While logistic regression uses the method of maximum likelihood (non-OLS): The null hypothesis H0 is β0 = β1 = β2 = We measure the size of the residuals from this model with a statistical logarithm likelihood statistic • We then estimate the model again, assuming that the null hypothesis is false, that we find the maximum reasonable value of the coefficients β in the sample Again, we measure the size of the residuals from this model with a statistical logarithm of reasonableness • Finally, we compare the two statistics by computing a test statistic: −2[ln(𝐿𝑛𝑢𝑙𝑙) − ln (𝐿𝑚𝑜𝑑𝑒𝑙)] This statistic tells us how much residual (or prediction error) can be reduced using X variables The null hypothesis suggests that the reduction is ; if the statistic is large enough (in a chi-square 9|Page PROBABILITY AND STATISTICS’S PROJECT GROUP 05 test with df = number of independent variables), we reject the null hypothesis Here, we conclude that at least one independent variable has a logarithmic Odds effect SPSS also runs R2 statistics to help evaluate the strength of associations But it as a pseudo R2 , should not be interpreted because logistic regression does not use R2 like linear regression Second question, how we estimate the partial effect of each variable X? When the general null hypothesis is rejected, we evaluate the partial effects of the predictors As in multiple linear regression, in logistic regression this implies that the null hypothesis for each independent variable included in the equation The null hypothesis is that each regression coefficient is zero, or it has no effect on the logarithm of Odds Each coefficient estimator B has a standard error – the extent to which, on average, we would expect B to vary from one sample to another by chance To check the significance of B, a test statistic (not a t-test, but a Wald Chi-squared) is calculated, with 1df – degrees of freedom It should be remembered that the coefficient B expresses the effects of a unit change of X on logarithmic Odds In education, the effect is positive, as education increases, the logarithm of Odds also increases The Exp(B) value of an independent variable X is used to predict the probability of an event occurring based on the change in one unit change in an independent variable when all other independent variables are held constant It indicates that when it is increased by one, the Odds for the "yes" event is multiplied by one value of the value Exp(B) (this is a function e to the power B, say 1.05, which is an increase of 5%) Optimal model selections One of the difficult and sometimes difficult problems in multivariable logistic regression analysis is choosing a model that can adequately describe the data A study with a dependent variable y and independent variables x1, x2 and x3, we can have the following models to predict y : y = f(x1), y = f(x2), y = f(x3), y = f(x1, x2), y = f(x1, x3), y = f(x2, x3), and y = f(x1, x2, x3), where f is a function number In general with k independent variables x1, x2, x3, , xk, we have many models (2k ) to predict y An optimal model must meet the following three criteria: • Simple • Full • Has the practical meaning The simple criterion requires a model with few independent variables, because too many variables make interpretation difficult, and sometimes impractical In a simile, if we spend 50,000 VND to buy 500 pages of a book, it is better than spending 60,000 VND to buy the same number 10 | P a g e PROBABILITY AND STATISTICS’S PROJECT GROUP 05 of pages Similarly, a model with independent variables that has the same ability to describe data as a model with independent variables, then the first model is chosen A pattern is simply a save! (English is called a parsimonious model) The adequate criterion here means that the model must describe the data satisfactorily, i.e it must predict close (or as close as possible) to the actual observed value of the dependent variable y If the observed value of y is 10, and if there is a predictive model of and a predictive model of 6, the former must be considered more complete A criterion of “practical significance”, as it is called, means that the model has to be supported by theory or has biological significance (if it is biological research), and clinical significance (if it is a research study) clinical studies), etc It's possible that phone numbers are somehow related to fracture rates, but of course, such a model makes no sense This is an important criterion, because if a statistical analysis results in a model that is very mathematically meaningful but has no practical significance, then the model is just a numbers game, with no real meaning real scientific value The third criterion (of practical significance) belongs to the theoretical realm, and we will not discuss it here We will discuss the standard simple and complete An important and useful metric for us to decide on a simple and complete model is the Akaike Information Criterion (AIC) The formula for calculating the AIC value: 𝐴𝐼𝐶 = −2 × log(𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑) + × 𝑘 = 2[𝑘 − log(𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑)] A simple and complete model should be one with as low AIC value as possible and the independent variables must be statistically significant So, the problem of finding a simple and complete model is really looking for the one (or more) with the lowest or near lowest AIC value 11 | P a g e PROBABILITY AND STATISTICS’S PROJECT GROUP 05 Model 3: Remove Insulin variable from model 2: model_3

Bài tập lớn Xác suất thống kê ĐH BK

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan