Sử dụng một số thuật toán học máy để dự đoán thành tích học tập của học sinh

LỜI CAM ĐOAN Tôi xin cam đoan: Luận văn thạc sỹ chuyên ngành Khoa học máy tính, tên đề tài “Sử dụng số thuật toán học máy để dự đốn thành tích học tập học sinh” cơng trình nghiên cứu, tìm hiểu trình bày tơi thực hướng dẫn khoa học TS Đàm Thanh Phương, Trường Đại học Công nghệ Thông tin Truyền thông - Đại học Thái Nguyên Kết tìm hiểu, nghiên cứu luận văn hồn tồn trung thực, khơng vi phạm điều luật sở hữu trí tuệ pháp luật Việt Nam Nếu sai, tơi hồn tồn chịu trách nhiệm trước pháp luật Tất tài liệu, báo, khóa luận, công cụ phần mềm tác giả khác sử dụng lại luận văn dẫn tường minh tác giả có danh mục tài liệu tham khảo Thái Nguyên, ngày 18 tháng 10 năm 2020 Tác giả luận văn Nguyễn Bích Quỳnh i LỜI CẢM ƠN Tác giả xin chân thành cảm ơn TS Đàm Thanh Phương, trường Đại học Công nghệ thông tin truyền thông - Đại học Thái Nguyên, giáo viên hướng dẫn khoa học hướng dẫn tác giả hoàn thành luận văn này, xin cảm ơn thầy, cô giáo trường Đại học công nghệ thông tin truyền thông nơi tác giả theo học hồn thành chương trình cao học nhiệt tình giảng dạy giúp đỡ Xin cảm ơn trường THPT Lương Thế Vinh - Cẩm Phả - Quảng Ninh nơi tác giả công tác tạo điều kiện thuận lợi để tác giả thu thập liệu, hoàn thành nghiên cứu chương trình học tập Và cuối xin cảm ơn gia đình, bạn bè, đồng nghiệp động viên, giúp đỡ tác giả suốt thời gian học tập, nghiên cứu hoàn thành luận văn Xin chân thành cảm ơn Thái Nguyên, ngày 18 tháng năm 2020 Tác giả luận văn Nguyễn Bích Quỳnh ii DANH SÁCH HÌNH VẼ 2.1 Phiếu khảo sát thông tin 21 2.2 Phiếu khảo sát thông tin (tiếp) 21 2.3 Một số thuộc tính (a) 23 2.4 Một số thuộc tính (b) 23 2.5 Một số thuộc tính (c) 23 2.6 24 2.7 Thống kê thuộc tính bị thiếu liệu 25 2.8 Gom cụm học sinh theo trung bình mơn 26 2.9 Feature Selection với Lasso 28 3.1 Accuracy explode model sử dụng all features 39 3.2 Accuracy explode model sử dụng features selection 40 3.3 Kết dự đoán điểm số học sinh sử dụng all feature 43 3.4 Kết dự đoán điểm số học sinh sử dụng feature selection 44 iii DANH SÁCH BẢNG 3.1 Độ xác mơ hình training với liệu đủ thuộc tính 39 3.2 Độ xác mơ hình training với liệu lựa chọn thuộc tính 40 iv DANH MỤC KÝ HIỆU, TỪ VIẾT TẮT R Tập hợp số thực Z Tập hợp số nguyên C Tập hợp số phức Rn Không gian Euclide n chiều Ck Không gian hàm có đạo hàm cấp k liên tục ||.|| Chuẩn Euclide SV M Support Vector Machine- Máy véc tơ hỗ trợ LR Linear Regression - Hồi quy tuyến tính NB Navie Bayes-Định Luật xác suất Navie Bayes KNN K Nearest Neighbor- K lân cận gần TBCM Điểm trung bình mơn học học sinh MLE Phương pháp ước lượng hợp lý cực đại MAP Phương pháp ước lượng hậu nghiệm cực đại NBC Phân loại Navie Bayes RF Random Forest - Rừng ngẫu nhiên AD AdBooting GD Gradient booting IDE Integrated Development Environment - Môi trường viết code v MỤC LỤC Lời cam đoan i Lời cảm ơn ii Danh sách hình vẽ ii Danh sách bảng iii Danh mục ký hiệu, từ viết tắt v Mở đầu Chương TỔNG QUAN VỀ HỌC MÁY 1.1 Thuật toán học máy 1.2 Dữ liệu 1.3 Các toán machine learning 1.4 Phân nhóm thuật toán machine learning 12 1.5 Hàm mát tham số mô hình 17 Chương THU THẬP VÀ XỬ LÝ DỮ LIỆU 19 2.1 Phát biểu toán 19 2.2 Thu thập liệu 20 2.3 Feature Engineering 22 Chương TRAINING MƠ HÌNH VÀ ĐÁNH GIÁ KẾT QUẢ 30 3.1 Một số thuật tốn lựa chọn training mơ hình 30 3.2 Training mơ hình 38 3.3 Lựa chọn tối ưu hóa tham số mơ hình 40 vi 3.4 Kết đánh giá 43 Kết luận chung 45 Tài liệu tham khảo 45 PHỤ LỤC 47 3.1 Quá trình xử lý data - file process - data.py 47 3.2 file main 51 vii MỞ ĐẦU Ngày nay, xã hội ngày phát triển, việc đưa máy tính vào sử dụng, phục vụ cho công việc đời sống người sản sinh khối lượng liệu lớn phức tạp (big data), số hóa lưu trữ máy tính Những tập liệu lớn bao gồm liệu có cấu trúc, khơng có cấu trúc bán cấu trúc Đó liệu thông tin bán hàng trực tuyến, lưu lượng truy cập trang web, thơng tin cá nhân, thói quen hoạt động thường ngày người.v.v Chúng chứa đựng nhiều thông tin quý báu mà khai thác hợp lý trở thành tri thức, tài sản mang lại giá trị lớn Thách thức đặt cho người phải đưa phương pháp, thuật tốn cơng cụ hợp lý để phân tích lượng liệu lớn Người ta nhận thấy máy tính có khả phân tích, xử lí khối liệu lớn phức tạp, tìm mẫu quy luật, vượt q khả năng, tốc độ tính tốn ghi nhớ não người Khái niệm học máy từ hình thành Ý tưởng học máy máy tính học hỏi, học tự động theo kinh nghiệm [1] Máy tính phân tích lượng lớn liệu, tìm thấy mẫu, quy tắc ẩn liệu, sử dụng quy tắc để mô tả liệu cách tự động liên tục cải thiện Học máy có nhiều ứng dụng, bao gồm nhiều lĩnh vực Máy tìm kiếm sử dụng học máy để xây dựng mối quan hệ tốt cụm từ tìm kiếm trang web Bằng cách phân tích nội dung trang web, cơng cụ tìm kiếm xác định từ cụm từ quan trọng việc xác định trang web định họ sử dụng cụm từ để trả thông tin kết phù hợp cho cụm từ tìm kiếm định [2] Cơng nghệ nhận dạng hình ảnh sử dụng học máy để xác định đối tượng cụ thể, chẳng hạn khuôn mặt [5] Đầu tiên thuật tốn học máy phân tích hình ảnh có chứa đối tượng định Nếu cung cấp đủ hình ảnh cho q trình này, thuật tốn xác định hình ảnh có chứa đối tượng hay khơng [3] Ngồi học máy sử dụng để hiểu loại sản phẩm mà khách hàng quan tâm, cách phân tích sản phẩm q khứ mà người dùng mua Máy tính đưa đề xuất sản phẩm khách hàng mua với xác suất cao [1] Tất ví dụ có ngun tắc giống nhau: Máy tính xử lý học cách xác định liệu, sau sử dụng kiến thức để đưa định liệu tương lai Tùy theo loại liệu đầu vào, thuật toán học máy chia thành học có giám sát học khơng giám sát Trong học có giám sát, liệu đầu vào có nhãn kèm với cấu trúc biết [1], [5] Dữ liệu đầu vào gọi liệu huấn luyện Thuật tốn thường có nhiệm vụ tạo mơ hình dự đốn số thuộc tính từ thuộc tính biết Sau mơ hình tạo, sử dụng để xử lý liệu có cấu trúc giống liệu đầu vào Trong học khơng giám sát, liệu đầu vào chưa có nhãn, khơng có cấu trúc Nhiệm vụ thuật tốn xác định cấu trúc liệu.[2] Được gợi ý giáo viên hướng dẫn, em bước đầu tìm hiểu nghiên cứu ứng dụng học máy giáo dục nhằm thực nhiệm vụ: Dự đoán kết học tập học sinh dựa liệu thu thập học sinh Đây hướng nghiên cứu thu hút quan tâm nhiều nhà khoa học giới [6], [7], [8] Trong [7], tác giả sử dụng số phương pháp phân lớp mạng nơ ron, NB, Cây định kết hợp với Bagging Boosting Random Forest để nâng cao độ xác dự đốn Kết dự đoán đánh giá 80% Trong [8], tác giả hệ thống số báo nghiên cứu ứng dụng học máy giáo dục xem xét xu hướng ứng dụng thời gian tới [6] luận văn thạc sĩ sử dụng ba phương pháp hồi quy tuyến tính, định phân lớp Navie Bayes để dự đốn thành tích học tập Bên cạnh đó, [6] tập trung vào kỹ thuật feature engineering để xử lý liệu trích chọn đặc trưng nhằm nâng cao hiệu dự đoán Đề tài luận văn nghiên cứu số thuật toán học máy, thực nhiệm vụ làm liệu, trích chọn đặc trưng từ liệu thu thập được, từ xây dựng mơ hình dự đốn thành tích học sinh, cụ thể điểm phẩy trung bình mơn học học sinh Q trình phân tích liệu học sinh cung cấp hi vọng tìm tương quan, ảnh hưởng số yếu tố thể véc tơ đặc trưng liệu đến đầu kết học tập thể điểm trung bình cùa học sinh Từ đó, mong muốn đưa khuyến nghị để học sinh phát huy ưu điểm, hạn chế khuyết điểm nâng cao hiệu học tập Nội dung luận văn gồm chương: Chương Tổng quan học máy Chương trình bày kiến tổng quan học máy, khái niệm liên quan đến ứng dụng mơ hình học máy thuật ngữ liên quan Nội dung bao gồm 1.1 Tổng quan học máy 1.2 Dữ liệu 1.3 Các toán Machine Learning 1.4 Phân nhóm thuật tốn Machine Learning toán C: float number, default=1.0 Tham số regularization nghịch đảo - Một biến điều khiển trì sửa đổi cường độ regularization cách đặt vị trí nghịch đảo với điều chỉnh Lambda solver‘newtoncg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’, default=’lbfgs’ Thuật toán sử dụng tốn tối ưu hóa Đối với liệu nhỉ, ‘liblinear’ lựa chọn tốt nhất, ‘sag’ ‘saga’ sử dụng nhanh liệu Đối với vấn đề muti class, ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ xử lý Bộ param cho thuật toán Logistic Regression [4] ’C’: 10, ’penalty’: ’l2’, ’solver’: ’newton-cg’ Và độ xác đạt 71% Random Forest: Các param quan trọng thuật toán Random forest bao gồm [4]: n − estimatorsint, default=100 Số lượng rừng maxd epthint, default=None Độ sâu tối đa Nếu None, nút mở rộng tất tất chứa dataset max − f eatures“auto”, “sqrt”, “log2”, int or float, default=”auto” Số lượng feature cần xem xét tìm kiếm phân chia tốt nhất: Nếu int number, sau xem xét số lượng feature lần phân chia Nếu float, max − f eatures phân số feature làm tròn (max − f eatures * n − f eatures) xem xét lần phân chia Nếu “auto”, thì, max−f eatures = sqrt(n−f eatures) Nếu “sqrt”, max − f eatures = sqrt(n − f eatures) (giống auto auto) Nếu log2, max − f eatures = log2(n − f eatures) Nếu None, max − f eatures = n − f eatures Bộ param cho thuật toán RandomForest Classifier [4] ’max − depth’: 4, ’max − f eatures’: ’sqrt’, ’n − estimators’: 100 Và độ xác đạt 79% Gradient Boosting: Bộ param quan trọng Gradient Boosting bao gồm [4]: n − estimatorsint, default=100 Số lượng lần thúc đẩy để 41 thực Tăng cường độ dốc mạnh để giải vấn đề over-fitting nên số lớn thường mang lại hiệu suất tốt learning −ratef loat, default=0.1 Tỷ lệ học tập thu hẹp đóng góp cách learning−rate Có đánh đổi learning−rate n−estimators subsamplefloat, default=1.0 Tỷ lệ sample sử dụng để fit với từng nhóm Nếu nhỏ 1.0, điều dẫn đến Stochastic Gradient Boosting subsample tương tác với tham số n − estimators Chọn mẫu phụ nhỏ hơn1.0 dẫn đến giảm phương sai tăng bias Bộ param cho thuật toán GradientBoosting Classifier ’learning - rate’: 0.1, ’max - depth’: 4, ’n - estimators’: 10, ’subsample’: 1.0 Và độ xác đạt 77% Best score: 0.841492 using LogisticRegression ’C’: 1.0, ’penalty’: ’l2’, ’solver’: ’newton − cg ’ Score LogisticRegression Data Test= 0.8641304347826086 Best score: 0.838270 using KNeighborsClassifier ’metric’: ’euclidean’, ’n − neighbors’: 19, ’weights’: ’uniform’ Score KNeighborsClassifier Data Test= 0.8804347826086957 Best score: 0.838270 using SVM ’C’: 50, ’gamma’: ’auto’, ’kernel’: ’poly’ Score SVM Data Test= 0.8804347826086957 Best score: 0.839892 using DecisionTreeClassifier ’criterion’: ’gini’, ’maxd epth’: 2, ’maxf eatures’: ’log2’, ’min − samples − leaf ’: 10, ’min−samples−split’: Score DecisionTreeClassifier Test= 0.8804347826086957 Best score: 0.840432 using RandomForestClassifier ’max − depth’: 4, ’max − f eatures’: ’sqrt’, ’n − estimators’: 10 Score RandomForestClassifier Data Test= 0.8804347826086957 42 3.4 Kết đánh giá Việc thực dự đoán em xây dựng thành web demo cho thuận tiện Sau lựa chọn thuật toán, người dùng chọn tập test liệu cần dự đoán để upload lên Rồi chọn sử dụng tất feature hay sử dụng lựa chọn feature ảnh hưởng Khi chạy, hệ thống đưa điểm trung bình mơn học liệu cần dự đốn xếp loại tương ứng Hình ảnh 3.3 3.4 kết dự đoán bạn: Trần Thế Anh, Vũ Tuấn Anh Nguyễn Minh Lâm với việc sử dụng all feature feature selection tương ứng Kết với tất liệu test hiển thị theo 10 dòng 1, e đưa số bạn hình khơng hiển thị hết Khi cần kéo cuộn để biết Rõ ràng ta thấy sử dụng feature selection, kết xác Hình 3.3: Kết dự đoán điểm số học sinh sử dụng all feature Từ tập liệu data thu thập cách khách quan dễ dàng xác định feature đặc biệt ảnh hưởng đến thành tích học tập học sinh sau chuỗi bước thiết kế feature, thu thập thông tin chuẩn hóa liệu xây dựng model Cuối xây dựng model dự đoán lực học tập học sinh Từ ta ứng dụng 43 Hình 3.4: Kết dự đốn điểm số học sinh sử dụng feature selection phát triển thêm để đưa đánh giá lực học sinh, tìm vấn đề khó khăn vướng mắc nan giải đưa phương pháp học tập phù hợp Độ xác model đạt 80% chưa phải số lý tưởng phần chấp nhận sử dụng Trong luận văn kết đạt có nhiều hạn chế mắc phải tồn đọng cần phải giải thêm Lượng liệu cịn chưa phản ánh bao qt cho tồn toán cần giải dẫn tới lực dự đốn Dữ liệu Null cịn nhiều việc thu thập cách xử lý loại liệu chưa tốt Nên hạn chế loại liệu feature nhạy cảm khơng thể tính tốn Thu thập thêm liệu tìm kiếm thăm dò phường pháp process data khác mục tiêu cuối làm cho liệu phong phú phản ánh cách đắn khách quan 44 KẾT LUẬN CHUNG Dưới bảo tận tình của Giáo viên hướng dẫn, vào đề cương luận văn phê duyệt, luận văn đạt số nhiệm vụ sau: (1) Tìm hiểu mơ hình giải tốn thực tế học máy Vận dụng mơ hình để tiếp cận giải vấn đề cụ thể Nghiên cứu chi tiết thuật tốn học máy từ phân tích tốn học, đến tìm nghiệm tối ưu, cài đặt thành thạo môi trường Python IDE chuyên dùng (2) Hiểu rõ thực quy trình thao tác với đặc trưng liệu từ làm sạch, chuẩn hóa, điền khuyết, lựa chọn thuộc tính ảnh hưởng nhiều đến liệu (3) Hiểu thực hành q trình training tối ưu hóa, lựa chọn tham số cho mơ hình học máy (4) Đưa số dự đoán, đánh giá với học sinh có nhãn chưa có nhãn Trên sở kết đạt được, tiếp tục nghiên cứu luận văn tảng tốt để nghiên cứu thêm số vấn đề sau: • Các phương pháp học máy ứng dụng cho dự đốn, phân lớp • Tối ưu hóa tham số mơ hình học máy • Xây dựng ứng dụng thực tế có tính khả thi hiệu cao 45 TÀI LIỆU THAM KHẢO [1] Vũ Hữu Tiệp, Machine learning Ebook on machinelearningcoban.com, 2020 [2] Hoàng Xuân Huấn , Giáo Trình học máy NXB ĐHQG Hà Nội, 2015 [3] Aurélien Géron, Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow Published by O’Reilly Media, 2019 [4] Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar, Foundations of Machine Learning Massachusetts Institute of Technology, 2018 [5] Peter Flach, Machine learning, the art and science of algorithms that make sence of data Cambridge, 2012 [6] Murat Pojon, Using Machine Learning to Predict Student Performance M Sc Thesis, University of Tampere, 2017 [7] Elaf Abu Amrieh, Thair Hamtini and Ibrahim Aljarah, Mining Educational Data to Predict Student’s academic Performance using Ensemble Methods International Journal of Database Theory and Application Vol.9, No.8 (2016), pp.119-136 [8] Kuca Danijel, Juricic Vedran and Dambic Goran, Machine Learning in Education - a Survey of Current Research Trends Proceedings of the 29th DAAAM International Symposium, pp.0406-0410, B Katalinic (Ed.), Published by DAAAM International, ISBN 978-3902734-20-4, ISSN 1726-9679, Vienna, Austria, 2018 46 PHỤ LỤC -CODE CHƯƠNG TRÌNH 3.1 import import import import import import from from from from from from from from from from from from from Quá trình xử lý data - file process - data.py pandas as pd numpy as np matplotlib.pyplot as plt seaborn as sb random pickle sklearn.linear_model import LogisticRegression sklearn.neighbors import KNeighborsClassifier sklearn.tree import DecisionTreeClassifier sklearn.naive_bayes import GaussianNB sklearn.svm import SVC sklearn.ensemble import RandomForestClassifier sklearn.ensemble.bagging import BaggingClassifier sklearn.ensemble.gradient_boosting import GradientBoostingClassifier sklearn.ensemble.weight_boosting import AdaBoostClassifier sklearn.model_selection import train_test_split sklearn import model_selection sklearn.model_selection import RepeatedStratifiedKFold sklearn.model_selection import GridSearchCV from keras.models import Sequential from keras.layers import Dense def grid_serarch(grid, model, X_train, y_train): cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=5, random_state=451) grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring=’accuracy’,error_score=0) return grid_search.fit(X_train, y_train) def tuning_logistic(X_train , y_train): model = LogisticRegression() solvers = [’newton-cg’, ’lbfgs’, ’liblinear’] penalty = [’l1’, ’l2’] c_values = [500, 100, 10, 1.0, 0.1, 0.01, 0.001] # define grid search grid = dict(solver=solvers,penalty=penalty,C=c_values) grid_result = grid_serarch(grid, model, X_train ,y_train) # summarize results print("Best score: %f using LogisticRegression %s" % (grid_result.best_score_, grid_result.best_params_)) return grid_result.best_params_ 47 def tuning_knn(X_train , y_train): model = KNeighborsClassifier() n_neighbors = range(1, 100, 2) weights = [’uniform’, ’distance’] metric = [’euclidean’, ’manhattan’, ’minkowski’] # define grid search grid = dict(n_neighbors=n_neighbors,weights=weights,metric=metric) grid_result = grid_serarch(grid, model, X_train ,y_train) # summarize results print("Best score: %f using KNeighborsClassifier %s" % (grid_result.best_score_, grid_result.best_params_)) return grid_result.best_params_ def tuning_svm(X_train , y_train): model = SVC() kernel = [’poly’, ’rbf’, ’sigmoid’] C = [50, 10, 1.0, 0.1, 0.01, 0.01, 0.05, 0.001] gamma = [’scale’, ’auto’] grid = dict(kernel=kernel,C=C,gamma=gamma) grid_result = grid_serarch(grid, model, X_train ,y_train) # summarize results print("Best score: %f using SVM %s" % (grid_result.best_score_, grid_result.best_params_)) return grid_result.best_params_ def tuning_dt(X_train , y_train): model = DecisionTreeClassifier() grid = { ’max_features’: [’log2’, ’sqrt’,’auto’], ’criterion’: [’entropy’, ’gini’], ’max_depth’: [2, 3, 5, 10, 50], ’min_samples_split’: [2, 3, 50, 100], ’min_samples_leaf’: [1, 5, 8, 10] } # define grid search grid_result = grid_serarch(grid, model, X_train ,y_train) # summarize results print("Best score: %f using DecisionTreeClassifier %s" % (grid_result.best_score_, grid_result.best_params_)) return grid_result.best_params_ def tuning_rf(X_train , y_train): model = RandomForestClassifier() n_estimators = [10, 30, 50, 70, 100] max_depth = [2, 3, 4] max_features = [’sqrt’, ’log2’] # define grid search grid = dict(n_estimators=n_estimators,max_features=max_features, max_depth=max_depth) grid_result = grid_serarch(grid, model, X_train ,y_train) # summarize results print("Best score: %f using RandomForestClassifier %s" % (grid_result.best_score_, grid_result.best_params_)) return grid_result.best_params_ 48 def tuning_gd(X_train , y_train): model = GradientBoostingClassifier() n_estimators = [5, 10, 20] learning_rate = [0.001, 0.01, 0.1] subsample = [0.5, 0.7, 1.0] max_depth = [2, 3, 4] # define grid search grid = dict(learning_rate=learning_rate, n_estimators=n_estimators, subsample=subsample, max_depth=max_depth) grid_result = grid_serarch(grid, model, X_train ,y_train) # summarize results print("Best: %f using GradientBoostingClassifier %s" % (grid_result.best_score_, grid_result.best_params_)) return grid_result.best_params_ def compare_model(X, y): models = [] models.append((’LR’, LogisticRegression())) models.append((’KNN’, KNeighborsClassifier())) models.append((’Tree’, DecisionTreeClassifier())) models.append((’NB’, GaussianNB())) models.append((’SVM’, SVC())) models.append((’RF’, RandomForestClassifier())) models.append((’AD’, AdaBoostClassifier())) models.append((’GD’, GradientBoostingClassifier())) models.append((’BG’, BaggingClassifier())) results = [] names = [] scoring = ’accuracy’ for name, model in models: kfold = model_selection.KFold(n_splits=5, random_state=42) cv_results = model_selection.cross_val_score(model, X, y, cv=kfold, scoring=scoring) results.append(cv_results) names.append(name) msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std()) print(msg) # boxplot algorithm comparison fig = plt.figure(figsize=(20,10)) fig.suptitle(’Algorithm Comparison’) ax = fig.add_subplot(111) plt.boxplot(results) ax.set_xticklabels(names) plt.show() def explode_model(X, y, columns = ’all’): path_model = ’Model/’ X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) param = tuning_logistic(X_train, y_train) model = LogisticRegression(C = param[’C’], penalty = param[’penalty’], solver = param[’solver’]).fit(X_train, y_train) pickle.dump(model, open(path_model + ’logistic.sav’ if columns == ’all’ else path_model + ’logistic_selection.sav’, ’wb’)) print(’Score LogisticRegression Data Test=’, model.score(X_test, y_test)) print(’====================================================================================================== param = tuning_knn(X_train, y_train) model = KNeighborsClassifier(n_neighbors = param[’n_neighbors’], weights = 49 param[’weights’], metric = param[’metric’]).fit(X_train, y_train) pickle.dump(model, open(path_model + ’knn.sav’ if columns == ’all’ else path_model + ’knn_selection.sav’, ’wb’)) print(’Score KNeighborsClassifier Data Test=’, model.score(X_test, y_test)) print(’====================================================================================================== param = tuning_svm(X_train, y_train) model = SVC(C = param[’C’], kernel = param[’kernel’], gamma = param[’gamma’]).fit(X_train, y_train) pickle.dump(model, open(path_model + ’svm.sav’ if columns == ’all’ else path_model + ’svm_selection.sav’, ’wb’)) print(’Score SVM Data Test=’, model.score(X_test, y_test)) print(’====================================================================================================== param = tuning_dt(X_train, y_train) model = DecisionTreeClassifier(max_features = param[’max_features’], criterion = param[’criterion’], max_depth = param[’max_depth’], min_samples_split = param[’min_samples_split’], min_samples_leaf = param[’min_samples_leaf’] ).fit(X_train, y_train) print(’Score DecisionTreeClassifier Test=’, model.score(X_test, y_test)) print(’====================================================================================================== param = tuning_rf(X_train, y_train) model = RandomForestClassifier(n_estimators = param[’n_estimators’], max_depth= param[’max_depth’], max_features = param[’max_features’]).fit(X_train, y_train) pickle.dump(model, open(path_model + ’rf.sav’ if columns == ’all’ else path_model + ’rf_selection.sav’, ’wb’)) print(’Score RandomForestClassifier Data Test=’, model.score(X_test, y_test)) print(’====================================================================================================== param = tuning_gd(X_train, y_train) model = GradientBoostingClassifier(learning_rate = param[’learning_rate’], max_depth = param[’max_depth’], n_estimators = param[’n_estimators’], subsample = param[’subsample’]).fit(X_train, y_train) pickle.dump(model, open(path_model + ’xgb.sav’ if columns == ’all’ else path_model + ’xgb_selection.sav’, ’wb’)) print(’Score GradientBoostingClassifier Data Test=’,model.score(X_test, y_test)) def create_model_mlp(X, y, ip = 43, epochs=20): X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=451) # define the keras model model = Sequential() model.add(Dense(128, input_dim=ip, activation=’relu’)) model.add(Dense(64, activation=’relu’)) model.add(Dense(16, activation=’relu’)) model.add(Dense(4, activation=’softmax’)) # compile the keras model model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[’accuracy’]) model.fit(X_train, y_train, epochs=25, batch_size=4) score, acc = model.evaluate(X_test, y_test, batch_size=4) print(’Test score:’, score) print(’Test accuracy:’, acc) model_json = model.to_json() path = ’Model/model_mlp’ if ip > 30 else ’Model/model_mlp_select’ with open(path + ".json", "w") as json_file: 50 json_file.write(model_json) # serialize weights to HDF5 model.save_weights(path + ".h5") print("Saved model to disk") 3.2 import import import import import import from from from from from from from from from from from from from file main pandas as pd numpy as np matplotlib.pyplot as plt seaborn as sb random pickle sklearn.linear_model import LogisticRegression sklearn.neighbors import KNeighborsClassifier sklearn.tree import DecisionTreeClassifier sklearn.naive_bayes import GaussianNB sklearn.svm import SVC sklearn.ensemble import RandomForestClassifier sklearn.ensemble.bagging import BaggingClassifier sklearn.ensemble.gradient_boosting import GradientBoostingClassifier sklearn.ensemble.weight_boosting import AdaBoostClassifier sklearn.model_selection import train_test_split sklearn import model_selection sklearn.model_selection import RepeatedStratifiedKFold sklearn.model_selection import GridSearchCV from keras.models import Sequential from keras.layers import Dense def grid_serarch(grid, model, X_train, y_train): cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=5, random_state=451) grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring=’accuracy’,error_score=0) return grid_search.fit(X_train, y_train) def tuning_logistic(X_train , y_train): model = LogisticRegression() solvers = [’newton-cg’, ’lbfgs’, ’liblinear’] penalty = [’l1’, ’l2’] c_values = [500, 100, 10, 1.0, 0.1, 0.01, 0.001] # define grid search grid = dict(solver=solvers,penalty=penalty,C=c_values) grid_result = grid_serarch(grid, model, X_train ,y_train) # summarize results print("Best score: %f using LogisticRegression %s" % (grid_result.best_score_, grid_result.best_params_)) return grid_result.best_params_ def tuning_knn(X_train , y_train): model = KNeighborsClassifier() 51 n_neighbors = range(1, 100, 2) weights = [’uniform’, ’distance’] metric = [’euclidean’, ’manhattan’, ’minkowski’] # define grid search grid = dict(n_neighbors=n_neighbors,weights=weights,metric=metric) grid_result = grid_serarch(grid, model, X_train ,y_train) # summarize results print("Best score: %f using KNeighborsClassifier %s" % (grid_result.best_score_, grid_result.best_params_)) return grid_result.best_params_ def tuning_svm(X_train , y_train): model = SVC() kernel = [’poly’, ’rbf’, ’sigmoid’] C = [50, 10, 1.0, 0.1, 0.01, 0.01, 0.05, 0.001] gamma = [’scale’, ’auto’] grid = dict(kernel=kernel,C=C,gamma=gamma) grid_result = grid_serarch(grid, model, X_train ,y_train) # summarize results print("Best score: %f using SVM %s" % (grid_result.best_score_, grid_result.best_params_)) return grid_result.best_params_ def tuning_dt(X_train , y_train): model = DecisionTreeClassifier() grid = { ’max_features’: [’log2’, ’sqrt’,’auto’], ’criterion’: [’entropy’, ’gini’], ’max_depth’: [2, 3, 5, 10, 50], ’min_samples_split’: [2, 3, 50, 100], ’min_samples_leaf’: [1, 5, 8, 10] } # define grid search grid_result = grid_serarch(grid, model, X_train ,y_train) # summarize results print("Best score: %f using DecisionTreeClassifier %s" % (grid_result.best_score_, grid_result.best_params_)) return grid_result.best_params_ def tuning_rf(X_train , y_train): model = RandomForestClassifier() n_estimators = [10, 30, 50, 70, 100] max_depth = [2, 3, 4] max_features = [’sqrt’, ’log2’] # define grid search grid = dict(n_estimators=n_estimators,max_features=max_features, max_depth=max_depth) grid_result = grid_serarch(grid, model, X_train ,y_train) # summarize results print("Best score: %f using RandomForestClassifier %s" % (grid_result.best_score_, grid_result.best_params_)) return grid_result.best_params_ def tuning_gd(X_train , y_train): model = GradientBoostingClassifier() n_estimators = [5, 10, 20] learning_rate = [0.001, 0.01, 0.1] 52 subsample = [0.5, 0.7, 1.0] max_depth = [2, 3, 4] # define grid search grid = dict(learning_rate=learning_rate, n_estimators=n_estimators, subsample=subsample, max_depth=max_depth) grid_result = grid_serarch(grid, model, X_train ,y_train) # summarize results print("Best: %f using GradientBoostingClassifier %s" % (grid_result.best_score_, grid_result.best_params_)) return grid_result.best_params_ def compare_model(X, y): models = [] models.append((’LR’, LogisticRegression())) models.append((’KNN’, KNeighborsClassifier())) models.append((’Tree’, DecisionTreeClassifier())) models.append((’NB’, GaussianNB())) models.append((’SVM’, SVC())) models.append((’RF’, RandomForestClassifier())) models.append((’AD’, AdaBoostClassifier())) models.append((’GD’, GradientBoostingClassifier())) models.append((’BG’, BaggingClassifier())) results = [] names = [] scoring = ’accuracy’ for name, model in models: kfold = model_selection.KFold(n_splits=5, random_state=42) cv_results = model_selection.cross_val_score(model, X, y, cv=kfold, scoring=scoring) results.append(cv_results) names.append(name) msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std()) print(msg) # boxplot algorithm comparison fig = plt.figure(figsize=(20,10)) fig.suptitle(’Algorithm Comparison’) ax = fig.add_subplot(111) plt.boxplot(results) ax.set_xticklabels(names) plt.show() def explode_model(X, y, columns = ’all’): path_model = ’Model/’ X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) param = tuning_logistic(X_train, y_train) model = LogisticRegression(C = param[’C’], penalty = param[’penalty’], solver = param[’solver’]).fit(X_train, y_train) pickle.dump(model, open(path_model + ’logistic.sav’ if columns == ’all’ else path_model + ’logistic_selection.sav’, ’wb’)) print(’Score LogisticRegression Data Test=’, model.score(X_test, y_test)) print(’====================================================================================================== param = tuning_knn(X_train, y_train) model = KNeighborsClassifier(n_neighbors = param[’n_neighbors’], weights = param[’weights’], metric = param[’metric’]).fit(X_train, y_train) pickle.dump(model, open(path_model + ’knn.sav’ if columns == ’all’ else path_model + ’knn_selection.sav’, ’wb’)) print(’Score KNeighborsClassifier Data Test=’, model.score(X_test, y_test)) 53 print(’====================================================================================================== param = tuning_svm(X_train, y_train) model = SVC(C = param[’C’], kernel = param[’kernel’], gamma = param[’gamma’]).fit(X_train, y_train) pickle.dump(model, open(path_model + ’svm.sav’ if columns == ’all’ else path_model + ’svm_selection.sav’, ’wb’)) print(’Score SVM Data Test=’, model.score(X_test, y_test)) print(’====================================================================================================== param = tuning_dt(X_train, y_train) model = DecisionTreeClassifier(max_features = param[’max_features’], criterion = param[’criterion’], max_depth = param[’max_depth’], min_samples_split = param[’min_samples_split’], min_samples_leaf = param[’min_samples_leaf’] ).fit(X_train, y_train) print(’Score DecisionTreeClassifier Test=’, model.score(X_test, y_test)) print(’====================================================================================================== param = tuning_rf(X_train, y_train) model = RandomForestClassifier(n_estimators = param[’n_estimators’], max_depth= param[’max_depth’], max_features = param[’max_features’]).fit(X_train, y_train) pickle.dump(model, open(path_model + ’rf.sav’ if columns == ’all’ else path_model + ’rf_selection.sav’, ’wb’)) print(’Score RandomForestClassifier Data Test=’, model.score(X_test, y_test)) print(’====================================================================================================== param = tuning_gd(X_train, y_train) model = GradientBoostingClassifier(learning_rate = param[’learning_rate’], max_depth = param[’max_depth’], n_estimators = param[’n_estimators’], subsample = param[’subsample’]).fit(X_train, y_train) pickle.dump(model, open(path_model + ’xgb.sav’ if columns == ’all’ else path_model + ’xgb_selection.sav’, ’wb’)) print(’Score GradientBoostingClassifier Data Test=’,model.score(X_test, y_test)) def create_model_mlp(X, y, ip = 43, epochs=20): X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=451) # define the keras model model = Sequential() model.add(Dense(128, input_dim=ip, activation=’relu’)) model.add(Dense(64, activation=’relu’)) model.add(Dense(16, activation=’relu’)) model.add(Dense(4, activation=’softmax’)) # compile the keras model model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[’accuracy’]) model.fit(X_train, y_train, epochs=25, batch_size=4) score, acc = model.evaluate(X_test, y_test, batch_size=4) print(’Test score:’, score) print(’Test accuracy:’, acc) model_json = model.to_json() path = ’Model/model_mlp’ if ip > 30 else ’Model/model_mlp_select’ with open(path + ".json", "w") as json_file: json_file.write(model_json) # serialize weights to HDF5 model.save_weights(path + ".h5") 54 print("Saved model to disk") 55 ... thức tổng quan học máy, khái niệm liên quan đến ứng dụng mơ hình học máy để có sở nghiên cứu nội dung sau 1.1 Thuật toán học máy Một thuật toán machine learning thuật toán có khả học tập từ liệu... để có mơ hình dự đốn mơ hình học máy để đưa kết dự đoán Nhiệm vụ đưa kết học tập dự đoán, từ kinh nghiệm nhiều học sinh biết kết sở mô tả điểm liệu học sinh có thành phần (véc tơ liệu), kết dự. .. liệu huấn luyện Hay chương trình học máy có khả dự đốn thành tích học tập học sinh mục tiêu đề tài luận văn giải Máy tính dựa vào liệu học sinh biết thành tích, phân tích đặc trưng ảnh hưởng đến

Sử dụng một số thuật toán học máy để dự đoán thành tích học tập của học sinh

Thông tin tài liệu

Từ khóa liên quan

Mục lục

Lời cam đoan

Lời cảm ơn

Danh sách hình vẽ

Danh sách bảng

Danh mục ký hiệu, từ viết tắt

Mở đầu

TỔNG QUAN VỀ HỌC MÁY

Thuật toán học máy

Dữ liệu

Các bài toán cơ bản trong machine learning

Phân nhóm các thuật toán machine learning

Hàm mất mát và tham số mô hình

THU THẬP VÀ XỬ LÝ DỮ LIỆU

Phát biểu bài toán

Thu thập dữ liệu

Feature Engineering

TRAINING MÔ HÌNH VÀ ĐÁNH GIÁ KẾT QUẢ

Một số thuật toán lựa chọn training mô hình

Training mô hình

Lựa chọn và tối ưu hóa tham số mô hình

Kết quả và đánh giá

Kết luận chung

Tài liệu tham khảo

Tài liệu cùng người dùng

Tài liệu liên quan