higher nationals in computing unit 14 business intelligence assignment 1

ASSIGNMENT 1 FRONT SHEET Qualification BTEC Level 5 HND Diploma in Computing Unit number and title Unit 14: Business Intelligence Submission date March 15, 2023 Date Received 1st submiss

Trang 1

Higher Nationals in Computing Unit 14: Business Intelligence

Assignment 1

Learner’s name: Nguyễn Lê Quang Tuấn Anh ID: GCS200729

Class: GCS0905A Subject code: 1641

Assessor name: Nguyen Xuan Sam

Trang 3

ASSIGNMENT 1 FRONT SHEET

Qualification BTEC Level 5 HND Diploma in Computing Unit number and title Unit 14: Business Intelligence

Submission date March 15, 2023 Date Received 1st submissionRe-submission Date Date Received 2nd submission

Student declaration

I certify that the assignment submission is entirely my own work and I fully understand the consequences of plagiarism I understand that making a false declaration is a form of malpractice

Student’s signature Grading grid

Trang 5

Summative Feedback: Resubmission Feedback:

Grade: Assessor Signature: Date:IV Signature:

Trang 7

Assessment Brief

Student Name/ID

Unit Number and Title 14: Business Intelligence Academic Year 2019-2020

Unit Tutor Unit 14: Business Intelligence Assignment Number &

Title Assignment 1: Discover business process and BI technologies Issue Date

Submission Date March 4, 2023 IV Name & Date

Submission Format

The submission is in the form of a Microsoft® PowerPoint® style presentation to be presented to your colleagues The presentation can include links to performance data with additional speaker notes and a bibliography using the Harvard referencing system The presentation slides for the findings should be submitted with speaker notes as one copy You are required to make effective use of headings, bullet points and subsections, as appropriate Your research should be referenced using the Harvard referencing system The recommended word limit is 500 words, including speaker notes, although you will not be penalised for exceeding the total word limit

Unit Learning Outcomes

LO1 Discuss business processes and the mechanisms used to support business making

decision-LO2 Compare the tools and technologies associated with business intelligence functionality Assignment Brief

Trang 8

Your company is currently working in [Assumed Domain] for 2 years For a new, young company, the competition in the market is very high Therefore, the Board of Director has decided to apply Business Intelligence to improve the company business process by making better decisions

The Board of Directors assigns a small group including you in Research & Development Department to study business intelligence to apply for the company in the coming years You need to research about business processes and decision support processes in the company and identify the types of data (unstructured, semi-structured or structured) generated by these processes with examples You also need to research about current software used in the business process or decision support process and evaluate these usages (benefits and drawbacks)

Next you need to understand the types of support for decision-making at different levels (operational, tactical and strategic) within the company and study which business intelligence features can help on that types of support Study the information systems or technologies (of BI) can be used in this case, compare and contrast them to conclude which should be used Your group needs to present the research results to the board in a presentation of 30 minutes

Trang 9

Learning Outcomes and Assessment Criteria

LO1 Discuss business processes and the mechanisms used to support business decision-making

D1 Evaluate the benefits and drawbacks of using application software as a mechanism for business processing

P1 Examine, using examples, the terms ‘Business Process’ and ‘Supporting Processes’

M1 Differentiate between unstructured and semi-structured data within an organisation

LO2 Compare the tools and technologies associated with business intelligence functionality

D2 Compare and contrast a range of information systems and technologies that can be used to support organisations at operational, tactical and strategic levels.P2 Compare the types of

support available for business decision-making at varying levels within an organisation.

M2 Justify, with specific examples, the key features of business intelligence functionality.

Trang 12

5.1 Conclusions 23

5.2 Future works 23

References 24

Appendix 25

Trang 13

Table of Fingures

Figure 1: The factors impact on house price 8

Figure 2 The summary of methodology 10

Figure 3 The raw dataset 11

Figure 4 Correlation 13

Figure 5 Two-variables Relationships 13

Figure 6 Correlation between Two Variables 14

Figure 7 Linear regression 14

Figure 8 Multiple regression 14

Figure 9 R-squares and Adjusted R-squares 1 15

Figure 10 R-squares and Adjusted R-squares 2 15

Figure 11 Model accuracy 1 16

Figure 14 Step 1: Install basic packages for this work 17

Figure 15 Step 2: Install packages for data visualization 17

Figure 16 Step 3: Install packages for modeling 1 17

Figure 17 Step 3: Install packages for modeling 2 17

Figure 18 Import in Jupyter 18

Figure 19 Input Data 18

Figure 20 Statements to describe data information 19

Figure 21 Heatmap 19

Figure 22 20

Figure 23 20

Figure 24 explore data 20

Figure 25 Price versus Number of bathrooms 21

Figure 26 Price versus Grade 21

Figure 27 Price versus Square Feet of the houses exicuding basement 21

Trang 14

g q g

Figure 28 Price versus Square Feet of 15 closest neighbors’ houses 22

Figure 29 OLS Regression Result between grade and price 22

Figure 30 Model visualization of grade and price 22

Figure 31 Appendix 1 25

Figure 32 Appendix 2 26

Trang 16

Scientists have already incorporated a large number of data projects into machine learning, and the most often used method is Random Forest A common supervised machine learning approach for Classification and Regression issues is random forest (Sruthi ER, ) And as we are aware, the goal of the model is to forecast future results in a variety of areas, including economics, business, sport, etc (Rachel, 2021) As a result, this approach is often used to develop models that use certain features to predict

Trang 17

1.2 Motivations

The foundations of high levels of transparency in the real estate sector include strictly enforced laws and regulations, high-quality, easily accessible market information and performance benchmarks, clear and fair practices, and high

professional standards To fulfill this role and operate efficiently, the real estate sector needs to be highly transparent These foundations enable governments to operate efficiently, bringing long-term benefits to local communities and the environment, while helping businesses and investors to make decisions with confidence (Jeremy, 2018)

People will search for a home that fits all of their specifications and is

affordable when they decide to purchase a home With the aid of machine learning, we can estimate home prices with ease and determine whether a particular home is better suited for purchase or higher-priced sale In this article, we'll make housing price predictions for King County, Washington When calculating the price of homes in regions like King County, Washington, predictive algorithms are complicated and tough to utilize (WA) Real estate sales prices in King County may be impacted by a number of independent factors The pricing can be significantly influenced by some characteristics, such as size, location, housing area, and so forth

1.3 Objectives

There are a few key goals in this work that I am concentrating on:

Trang 18

▪ What impact does the size of the bathroom (bathrooms) have on the price of a home?

▪ What effect does the grade (grade) around the house have on the price? ▪ How does the price of a house change depending on the square footage

of the home minus the basement (sqft_above)?

▪ How does the average size of indoor living space for the last 15 homes (sqft_living15) affect home prices?

Trang 19

I'll present the dataset in order to address the issues raised in the first chapter In order to extract information from raw data, there are various procedures Figure 2 below illustrates these stages, specifically data collecting

Figure 2 The summary of methodology

1.4 Summary

I described my work and laid out the project's goals in the first chapter The remaining components of this work are a dataset introduction, my approach and findings, and an application demo

2 Related works and dataset 2.1 Related works

The researchers (Madhuri et al., 2019) used a variety of techniques, including gradient boosting, multiple linear regression, ridge regression, LASSO regression, elastic net regression, and multiple linear regression The authors of that study wish to examine several methodologies and gauge how much model error is introduced by

Trang 20

each The findings demonstrate that multiple regression is one of the most effective models for forecasting home prices since it has a relatively low error statistic

The author of another study (Rahadi et al., 2015) categorizes the elements that influence home pricing into three categories: physical state, concept, and location A home's physical qualities include those that are visible to the naked eye, such as its size, number of bedrooms, the presence of a kitchen and garage, the presence of a garden, the size of the lot and adjacent structures, and the age of the house On the other side, conceptual characteristics are ideas that developers use to lure purchasers, such as the idea of a minimalist home, a healthy and eco-friendly atmosphere, or an upscale location A house's price is greatly influenced by its location This is because

Trang 21

the location affects the current land price (Xiao-zhu and Ling-wei, 2013) Furthermore, the location influences how convenient it is to get to family-friendly entertainment alternatives like malls, gourmet tours, or even locations with breathtaking scenery Public amenities like schools, campuses, hospitals, and health centers are also impacted by the location (Kisilevich et al., 2013) Research has shown that these characteristics have a significant impact on home prices

In conclusion, a lot of research has been done on how to anticipate home values using various machine learning techniques or models I'll be developing models and making predictions for my project using both linear regression and multiple regression The location in King County, Washington, United States, is where I will be working on my project I'll make use of every feature in this dataset and decide whether to create a strong model

2.2 Dataset

2.2.1 Data collection

The information I got from Kaggle (Lemsalu, 2017) The data set includes King County, Washington, home values from May 2014 to May 2015 There are 21 columns and more than 21000 entries in the raw dataset The price column in this dataset is the dependent variable, and all other columns aside from id and date— —are independent features The draw dataset's head is shown here

Trang 22

The price and the other factors are independent variables in Figure 3, which shows the dependent continuous value of this study

Trang 23

2.2.2 Description dataset

• Id: the house's individual identification number • Date: the date when the house was sold • Price: The home's price

• Bedrooms: number of bedrooms • Bathrooms: the number of bathrooms • Sqft_living: The home's square footage • Sqft_lot: The lot's square footage • Floors: number of floors

• Waterfront: house that has waterfront view • View: the house has view

• Condition: Rate the home's condition on a scale of 1 to 5 (overall) • Grade: The dwelling unit's grade on a scale of 0 to 10 (overall) • Sqft_above: living area of the home, excluding the basement • Sqft_basement: the basement's dwelling area in square feet

• Yr_built: year that the house built\sYr renovated: year that the house renovated • Zipcode: the home's zip code

• Latitude: a coordinate system • Longitude: a geographic location

• Sqft_living15: The interior space where the homes of the 15 closest neighbors are located

• Sqft_lot15: the sum of the 15 nearest neighbors' land lots in square feet 2.3 Summary

In this section, I go over the effort involved and mention a few more studies that make use of the same data but employ various approaches, allowing you to pick and choose what works best for you Indicate the number of dependent and

independent values in the data and how many columns and columns there are in total

Trang 24

Further to defining the raw data set's component names 3 Proposed model

3.1 Correlation

In essence, the correlation evaluates the difference between two variables (Hauke and Kossowski, 2011) According to the correlation coefficient formula (David Groebner, 2017)

Trang 25

Figure 4 Correlation

The Pearson product moment correlation is the name of the function described above The scatter plot's pattern can be seen like the illustration in Figure 5 below to determine whether the two variables are correlated:

Figure 5 Two-variables Relationships

The correlation coefficient, or r, can be positive or negative, with a perfect

Trang 26

correlation being +1.0 (the perfect negative correlation) There is no correlation between the x and y variables if r = 0 This is the ideal connection if the scatter plot's data points all fall along a straight line As a result, the correlation deviates from 0.0 to a greater extent the stronger the linear connection between the two variables The direction of the link is shown by the correlation coefficient's sign (David Groebner, 2017)

Trang 27

Figure 6 Correlation between Two Variables

3.2 Linear regression

Study of the fundamental equation for a single linear regression (David Groebner, 2017) The relationship is depicted as follows in the equation where x is the dependent variable and 1 is the dependent variable as the outcome:

Figure 7 Linear regression

3.3 Multiple regression

In this project, I utilize multiple regression to forecast the average book rating

Trang 28

based on three features: the volume of the book, the number of text reviews, and the number of ratings Here is the equation for multiple regression (David Groebner, 2017):

Figure 8 Multiple regression

3.4 R-squares and Adjusted R-squares

The coefficients of determination R^2 or modified R^2 are probably the most frequently used statistics in regression to assess how well a model fits the data These

Trang 29

statistics indicate how much variation in the response is explained by the model (Akossou and Palm, 2013)

Figure 9 R-squares and Adjusted R-squares 1

The likelihood that the regression line will accurately represent the actual data points is statistically assessed using the multiple coefficient of determination R-squared, in other words, shows how closely the data match the regression model R-squared values normally range from 0 to 1, from 0% to 100% If the R-squares value is negative, this indicates that the model's performance is subpar (Chicco et al., 2021) As an illustration, if the R-squares value is equal to 0.8, then the independent variables are responsible for 80% of the variation in the target variable The better the model fits, the greater the R-squared score

The percentage of variance that can be accounted for by simply the independent variables with a substantial influence on the explanation of the dependent variable is determined by the adjusted R-squares method Only when the independent variable has an impact on the dependent variable do the R-squares rise