Nghiên cứu ứng dụng công nghệ nhận diện giọng nói vào việc xây dựng phần mềm hỗ trợ luyện tập phát âm tiếng anh trên thiết bị di động

BỘ GIÁO DỤC VÀ ĐÀO TẠO TRƯỜNG ĐẠI HỌC BÀ RỊA - VŨNG TÀU BÁO CÁO ĐỀ TÀI KHOA HỌC VÀ CƠNG NGHỆ CẤP TRƯỜNG Nghiên cứu ứng dụng cơng nghệ nhận diện giọng nói vào việc xây dựng phần mềm hỗ trợ luyện tập phát âm tiếng Anh thiết bị di động Chủ nhiệm đề tài: TS Phan Ngoc Hoàng BÀ RỊA - VŨNG TÀU 02/2020 Tên đề tài: Nghiên cứu ứng dụng công nghệ nhận diện giọng nói vào việc xây dựng phần mềm hỗ trợ luyện tập phát âm tiếng Anh thiết bị di động Mã số: 10201 Chủ nhiệm đề tài: TS Phan Ngọc Hồng, Phó trưởng Khoa, Khoa CNTT – Điện – Điện tử Danh sách cán tham gia chính: TS Phan Ngọc Hồng, Phó trưởng Khoa, Khoa CNTT – Điện – Điện tử TS Bùi Thị Thu Trang, Phó trưởng ngành CNTT, Khoa CNTT – Điện – Điện tử Nội dung chính: Nhóm nghiên cứu mong muốn tạo giải pháp thực phù hợp để hỗ trợ người học sinh viên, giảng viên Trường Đại học Bà Rịa-Vũng Tàu nói riêng, người học cộng đồng nói chung, giải vấn đề khó khăn việc luyện tập phát âm Anh Với phát triển nhanh chóng vượt bậc cơng nghệ nhận diện giọng nói tiện lợi mang lại thiết bị di động, giải pháp nhóm nghiên cứu hướng tới việc ứng dụng cơng nghệ nhận diện giọng nói vào việc xây dựng phần mềm hỗ trợ phát âm tiếng Anh thiết bị di động Mục đích cuối giải pháp tạo phần mềm thiết bị di động hỗ trợ người học tiếng Anh Kết đạt được: + Nhóm nghiên cứu hồn thiện việc xây dựng phần mềm ứng dụng hỗ trợ luyện tập phát âm thiết bị di động áp dụng công nghệ nhận diện giọng nói + Phần mềm ứng dụng xây dựng tảng iOS tích hợp cơng nghệ nhận diện giọng nói bật sử dụng trợ lý ảo thông minh Siri Apple + Sản phẩm phần mềm ứng dụng nhóm tác giả đánh giá cao đạt giải nhì thi Sáng tạo khoa học kỹ thuật tỉnh Bà Rịa-Vũng Tàu năm 20182019 + Kết nghiên cứu công bố thông qua 01 báo 01 tạp chí khoa học thuộc danh mục ISI/SCOPUS sau: Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, Vol 298, pp 157-166, Springer, 2019, (SCOPUS – Q4), ISSN 1867-8211 Thời gian nghiên cứu: từ 11/2018 đến 11/2019 Phòng KHCN & HTQT Trưởng Khoa/ HĐKH Khoa Chủ nhiệm đề tài MỤC LỤC ĐẶT VẤN ĐỀ GIẢI QUYẾT VẤN ĐỀ THỰC HIỆN GIẢI PHÁP 10 3.1 Thiết kế xây dựng CSDL 10 3.1.1 Bài học (Lesson) 10 3.1.2 Cách phát âm (Pronunciation) 13 3.1.3 Bài tập phát âm (Practice) 14 3.1.4 Từ tiếng Anh dùng để luyện tập (Word) 15 3.1.5 Xây dựng CSDL Core Data 15 3.2 Thiết kế xây dựng phần mềm tảng iOS 16 3.2.1 Chức xem danh sách học 17 3.2.2 Chức xem cách phát âm 20 3.2.3 Chức xem danh sách luyện tập 21 3.2.4 Chức chọn chế độ luyện tập 22 3.2.5 Chức luyện tập với từ đơn 23 3.2.6 Chức tổng hợp kết luyện tập 26 3.2.7 Chức thiết lập lại luyện tập 27 KẾT QUẢ ĐẠT ĐƯỢC 29 TÀI LIỆU THAM KHẢO 33 ĐẶT VẤN ĐỀ Trước xu hội nhập tồn cầu hóa, tiếng Anh xem ngôn ngữ sử dụng phổ biến giới Trong gần 60 quốc gia sử dụng tiếng Anh ngơn ngữ chính, ngồi bên cạnh tiếng mẹ để có gần 100 quốc gia sử dụng tiếng Anh ngơn ngữ thứ hai Vì ngoại ngữ chìa khóa quan trọng thời kỳ hội nhập tồn cầu hóa Trong bối cảnh đó, mối quan hệ người hợp tác, đầu tư lĩnh vực từ kinh doanh, thương mại, giao thông, công nghệ, truyền thông, du lịch, hội học tập, làm việc mở rộng phạm tất nước toàn giới Tiếng Anh cơng cụ hữu hiệu đóng vai trị quan trọng thành cơng nhiều cá nhân doanh nghiệp Đối với tiếng Anh ngôn ngữ khác, phát âm kỹ đóng vai trị tảng định cho người bắt đầu học tiếng Anh Phát âm yếu tố có ảnh hưởng tới việc học tất kỹ lại như: từ vựng, nghe, nói, đọc, viết, Phát âm chuẩn giúp người nghe dễ hiểu hơn, người phát âm chưa chuẩn người nghe hiểu, đơi họ phải cố gắng hiểu người nói muốn diễn đạt Ngồi phát âm chuẩn có nghĩa người nói biết cách phát âm nào, điều hữu ích cho kỹ nghe hiểu người phát âm chuẩn Từ giúp người nghe hiểu dễ dàng đoạn video, radio hay đoạn hội thoại Trong trường hợp người nói phát âm sai từ đó, chắn khơng thể hiểu nghe người khác nói từ mà phát âm sai Người học tiếng Anh có nhiều phương pháp tự học cơng cụ hỗ trợ đắc lực việc luyện phát âm chuẩn Chẳng hạn người học dùng phương pháp cổ điển phát âm nhìn vào gương để nhận biết chuyển động môi miệng cách xác việc phát âm Hiện có nhiều phần mềm ứng dụng luyện tập phát âm tiếng anh thiết bị di động Bằng việc sử dụng cơng cụ hỗ trợ này, người học ghi âm lại tất họ nói so sánh với phát âm mẫu để chỉnh sửa lỗi sai Các ứng dụng phần mềm hỗ trợ học phát âm tiếng Anh hướng đến chức chung này, cụ thể ứng dụng hiển thị cách phát âm từ, cho phép người học nghe đoạn phát âm mẫu, sau người học ghi âm lại nội dung phát âm tự so sánh với đoạn phát âm mẫu Hoặc người học nghe/nhìn từ gõ lại từ/phiên âm từ để phần mềm đánh giá sai Hình Ví dụ phần mềm luyện tập phát âm thiết bị di động Các ứng dụng hầu hết chưa tích hợp tính nhận diện giọng nói vào phần mềm để kiểm tra phát âm người học Hoặc có số ứng dụng tích hợp chưa dùng để kiểm tra tổng hợp mức độ hoàn thành người học âm cần học Hình Ví dụ phần mềm luyện tập phát âm thiết bị di động Với cách học người học khó khăn có khả để nhận biết cách phát âm cá nhân hay sai, đặc biệt người bắt đầu học tiếng Anh Để giải vấn đề này, thông thường người học phải có hướng dẫn trực tiếp từ giáo viên tiếng Anh địa giáo viên tiếng Anh giàu kinh nghiệm khóa học Vì người học tốn khơng chi phí, đồng thời có hội trau dồi phát âm tiếng Anh ngày GIẢI QUYẾT VẤN ĐỀ Mục đích nhóm nghiên cứu mong muốn tạo giải pháp thực phù hợp để hỗ trợ người học sinh viên, giảng viên Trường Đại học Bà RịaVũng Tàu nói riêng, người học cộng đồng nói chung, giải vấn đề khó khăn việc luyện tập phát âm nêu Với phát triển nhanh chóng vượt bậc cơng nghệ nhận diện giọng nói tiện lợi mang lại thiết bị di động, giải pháp nhóm nghiên cứu hướng tới việc ứng dụng cơng nghệ nhận diện giọng nói vào việc xây dựng phần mềm hỗ trợ phát âm tiếng Anh thiết bị di động Mục đích cuối giải pháp tạo phần mềm thiết bị di động hỗ trợ người học tiếng Anh: Thụ hưởng cơng nghệ nhận diện giọng nói để giúp người học tự kiểm tra việc phát âm tiếng Anh thân có điều chỉnh phù hợp Cung cấp cho người học chức vốn có cơng cụ hỗ trợ luyện phát âm tiếng Anh, cụ thể danh sách từ vựng luyện theo âm, phiên âm phát âm mẫu từ Hỗ trợ người học luyện phát âm tiếng Anh lúc, nơi hồn tồn miễn phí Để thực mục tiêu nêu nhóm phát triển tiến hành nghiên cứu xây dựng phần mềm ứng dụng thiết bị di động hỗ trợ luyện tập phát âm tiếng Anh sử dụng cơng nghệ nhận diện giọng nói dựa vào nguồn sau: + Nghiên cứu phương pháp, tài liệu, nội dung liên quan đến việc luyện tập phát âm tiếng Anh để đưa vào phần mềm cho phù hợp + Nghiên cứu cơng nghệ nhận diện giọng nói phát triển, khả phù hợp để tích hợp chúng vào phần mềm + Nghiên cứu thiết kế giao diện, ngơn ngữ lập trình liên quan để xây dựng phần mềm; Phần mềm ứng dụng hỗ trợ luyện tập việc phát âm tiếng Anh sử dụng công nghệ nhận diện giọng nói phải đảm bảo thực nhiệm vụ sau: + Chuyển đổi nội dung luyện tập phát âm từ nguồn sang dạng hệ thống thông tin + Cho phép người dùng xem danh sách học âm tiếng Anh lựa chọn học tương ứng + Dựa vào âm lựa chọn, cho phép người dùng xem lại cách phát âm âm + Dựa vào âm lựa chọn, cho phép người dùng xem danh sách luyện tập tương ứng lựa chọn tập để luyện tập + Cho phép người dùng lựa chọn chế độ luyện tập từ chưa hoàn thành luyện tập tất từ tập + Đối với từ luyện tập: - cho phép người dùng xem phiên âm từ; - nghe cách phát âm mẫu người nói tiếng Anh địa; - kiểm tra việc phát âm từ hay sai dựa vào cơng nghệ nhận diện giọng nói + Dựa vào kết phát âm từ tập, phần mềm tự động tổng hợp cho phép người dùng biết kết chung mức độ phát âm tập + Dựa vào kết tập, phần mềm tự động tổng hợp cho phép người dùng biết kết chung mức độ phát âm học âm + Cho phép người dùng thiết lập lại kết tập để luyện tập tập lại từ đầu + Cho phép người dùng thiết lập lại kết học âm để luyện tập học lại từ đầu THỰC HIỆN GIẢI PHÁP 3.1 Thiết kế xây dựng CSDL Công việc phần thiết kế xây dựng CSDL nhằm thực nhiệm vụ chuyển đổi thông tin, tài liệu liên quan đến việc luyện tập phát âm tiếng Anh sang hệ thống CSDL phục vụ cho việc xây dựng phần mềm ứng dụng 3.1.1 Bài học (Lesson) Để phát âm từ đúng, cần phát âm dựa vào phần phiên âm từ không nhìn vào mặt chữ từ Trong ví dụ hình 3, thấy, từ viết wind, nhiên cách phát âm từ lại hoàn toàn khác Từ thứ nhất, danh từ, phát âm /wɪnd/, từ thứ động từ, phát âm /waɪnd/ Hình Ví dụ quan trọng phát âm dựa vào phiên âm Chính vậy, muốn phát âm xác từ, cần phát âm dựa vào phần phiên âm từ Để hiểu phần phiên âm tiếng Anh này, sử dụng bảng mẫu tự ngữ âm quốc tế IPA (International Phonetic Alphabet) cho tiếng Anh Bảng IPA tiếng Anh chứa 44 âm (sounds) biểu diễn hình Trong đó, có 20 ngun âm (vowel sounds) 24 phụ âm (consonant sounds) Các âm kết hợp với hình thành cách phát âm từ 10 Preface The 8th EAI International Conference on Context-Aware Systems and Applications (ICCASA 2019) and the 5th EAI International Conference on Nature of Computation and Communication (ICTCC 2019) are international scientific events for research in the field of smart computing and communication These two conferences were jointly held during November 28–29, 2019, in My Tho City, Vietnam The aim, for both conferences, is to provide an internationally respected forum for scientific research in the technologies and applications of smart computing and communication These conferences provide an excellent opportunity for researchers to discuss modern approaches and techniques for smart computing systems and their applications The proceedings of ICCASA 2019 and ICTCC 2019 are published by Springer in the Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering series (LNICST; indexed by DBLP, EI, Google Scholar, Scopus, Thomson ISI) For this eighth edition of ICCASA and fifth edition of ICTCC, repeating the success of the previous years, the Program Committee received submissions from 12 countries and each paper was reviewed by at least three expert reviewers We chose 20 papers after intensive discussions held among the Program Committee members We appreciate the excellent reviews and lively discussions of the Program Committee members and external reviewers in the review process This year we had three prominent invited speakers, Prof Herwig Unger from Fern Universität in Hagen, Germany, Prof Phayung Meesad from King Mongkut’s University of Technology North Bangkok (KMUTNB) in Thailand, and Prof Waralak V Siricharoen from Silpakorn University in Thailand ICCASA 2019 and ICTCC 2019 were jointly organized by The European Alliance for Innovation (EAI), Tien Giang University (TGU), and Nguyen Tat Thanh University (NTTU) These conferences could not have been organized without the strong support of the staff members of these three organizations We would especially like to thank Prof Imrich Chlamtac (University of Trento), Lukas Skolek (EAI), and Martin Karbovanec (EAI) for their great help in organizing the conferences We also appreciate the gentle guidance and help from Prof Nguyen Manh Hung, Chairman and Rector of NTTU, and Prof Vo Ngoc Ha, Rector of TGU November 2019 Phan Cong Vinh Abdur Rakib Organization Steering Committee Imrich Chlamtac (Chair) Phan Cong Vinh Thanos Vasilakos University of Trento, Italy Nguyen Tat Thanh University, Vietnam Kuwait University, Kuwait Organizing Committee Honorary General Chairs Vo Ngoc Ha Nguyen Manh Hung Tien Giang University, Vietnam Nguyen Tat Thanh University, Vietnam General Chair Phan Cong Vinh Nguyen Tat Thanh University, Vietnam Program Chairs Abdur Rakib Vangalur Alagar The University of the West of England, UK Concordia University, Canada Publications Chair Phan Cong Vinh Nguyen Tat Thanh University, Vietnam Publicity and Social Media Chair Cao Nguyen Thi Tien Giang University, Vietnam Workshop Chair Nguyen Ngoc Long Tien Giang University, Vietnam Sponsorship and Exhibits Chair Bach Long Giang Nguyen Tat Thanh University, Vietnam Local Chair Duong Van Hieu Tien Giang University, Vietnam Web Chair Do Nguyen Anh Thu Nguyen Tat Thanh University, Vietnam viii Organization Technical Program Committee Chernyi Sergei Chien-Chih Yu David Sundaram Duong Van Hieu Franỗois Siewe Gabrielle Peko Giacomo Cabri Haz Mahfooz Ul Haque Huynh Trung Hieu Huynh Xuan Hiep Ijaz Uddin Iqbal Sarker Issam Damaj Krishna Asawa Kurt Geihs Le Hong Anh Le Nguyen Quoc Khanh Manisha Chawla Muhammad Athar Javed Sethi Nguyen Duc Cuong Nguyen Hoang Thuan Nguyen Manh Duc Nguyen Thanh Binh Ondrej Krejcar Pham Quoc Cuong Prashant Vats Rana Mukherji Tran Huu Tam Tran Vinh Phuoc Vijayakumar Ponnusamy Waralak V Siricharoen Zhu Huibiao Admiral Makarov State University of Maritime and Inland Shipping, Russia National ChengChi University, Taiwan The University of Auckland, New Zealand Tien Giang University, Vietnam De Montfort University, UK The University of Auckland, New Zealand University of Modena and Reggio Emilia, Italy University of Lahore, Pakistan Industrial University of Ho Chi Minh City, Vietnam Can Tho University, Vietnam The University of Nottingham, UK Swinburne University of Technology, Australia The American University of Kuwait, Kuwait Jaypee Institute of Information Technology, India University of Kassel, Germany University of Mining and Geology, Vietnam Nanyang Technological University, Singapore Google, India University of Engineering and Technology (UET) Peshawar, Pakistan Ho Chi Minh City University of Foreign Languages – Information Technology, Vietnam Can Tho University of Technology, Vietnam University of Ulsan, South Korea Ho Chi Minh City University of Technology, Vietnam University of Hradec Kralove, Czech Republic Ho Chi Minh City University of Technology, Vietnam Fairfield Institute of Management & Technology in Delhi, India The ICFAI University Jaipur, India University of Kassel, Germany Ho Chi Minh City Open University, Vietnam SRM IST, India Silpakorn University, Thailand East China Normal University, China Contents ICCASA 2019 Declarative Approach to Model Checking for Context-Aware Applications Ammar Alsaig, Vangalur Alagar, and Nematollaah Shiri Planquarium: A Context-Aware Rule-Based Indoor Kitchen Garden Rahat Khan, Altaf Uddin, Ijaz Uddin, Rashid Naseem, and Arshad Ahmad 11 Text to Code: Pseudo Code Generation Altaf U Din and Awais Adnan 20 Context-Aware Mobility Based on p-Calculus in Internet of Thing: A Survey Vu Tuan Anh, Pham Quoc Cuong, and Phan Cong Vinh High-Throughput Machine Learning Approaches for Network Attacks Detection on FPGA Duc-Minh Ngo, Binh Tran-Thanh, Truong Dang, Tuan Tran, Tran Ngoc Thinh, and Cuong Pham-Quoc IoT-Based Air-Pollution Hazard Maps Systems for Ho Chi Minh City Phuc-Anh Nguyen, Tan-Ri Le, Phuc-Loc Nguyen, and Cuong Pham-Quoc Integrating Retinal Variables into Graph Visualizing Multivariate Data to Increase Visual Features Hong Thi Nguyen, Lieu Thi Le, Cam Thi Ngoc Huynh, Thuan Thi My Pham, Anh Thi Van Tran, Dang Van Pham, and Phuoc Vinh Tran An Approach of Taxonomy of Multidimensional Cubes Representing Visually Multivariable Data Hong Thi Nguyen, Truong Xuan Le, Phuoc Vinh Tran, and Dang Van Pham A System and Model of Visual Data Analytics Related to Junior High School Students Dang Van Pham and Phuoc Vinh Tran 38 47 61 74 90 105 x Contents CDNN Model for Insect Classification Based on Deep Neural Network Approach Hiep Xuan Huynh, Duy Bao Lam, Tu Van Ho, Diem Thi Le, and Ly Minh Le Predicting of Flooding in the Mekong Delta Using Satellite Images Hiep Xuan Huynh, Tran Tu Thi Loi, Toan Phung Huynh, Son Van Tran, Thu Ngoc Thi Nguyen, and Simona Niculescu Development English Pronunciation Practicing System Based on Speech Recognition Ngoc Hoang Phan, Thi Thu Trang Bui, and V G Spitsyn 127 143 157 Document Classification by Using Hybrid Deep Learning Approach Bui Thanh Hung 167 A FCA-Based Concept Clustering Recommender System G Chemmalar Selvi, G G Lakshmi Priya, and Rose Bindu Joseph 178 Hedge Algebra Approach for Semantics-Based Algorithm to Improve Result of Time Series Forecasting Loc Vuminh, Dung Vuhoang, Dung Quachanh, and Yen Phamthe 188 ICTCC 2019 Post-quantum Commutative Encryption Algorithm Dmitriy N Moldovyan, Alexandr A Moldovyan, Han Ngoc Phieu, and Minh Hieu Nguyen 205 Toward Aggregating Fuzzy Graphs a Model Theory Approach Nguyen Van Han, Nguyen Cong Hao, and Phan Cong Vinh 215 An Android Business Card Reader Based on Google Vision: Design and Evaluation Nguyen Hoang Thuan, Dinh Thanh Nhan, Lam Thanh Toan, Nguyen Xuan Ha Giang, and Quoc Bao Truong Predicted Concentration TSS (Total Suspended Solids) Pollution for Water Quality at the Time: A Case Study of Tan Hiep Station in Dong Nai River Cong Nhut Nguyen 223 237 Applying Geostatistics to Predict Dissolvent Oxygen (DO) in Water on the Rivers in Ho Chi Minh City Cong Nhut Nguyen 247 Author Index 259 Development English Pronunciation Practicing System Based on Speech Recognition Phan Ngoc Hoang*, Bui Thi Thu Trang* and Spitsyn V.G.** *Ba Ria-Vung Tau University, 80, Truong Cong Dinh, Vung Tau, Ba Ria-Vung Tau, Vietnam ** National Research Tomsk Polytechnic University, 30, Lenin Avenue, Tomsk, Russia {hoangpn285,trangbt.084}@gmail.com {hoangpn,trangbtt}@bvu.edu.vn {spvg}@tpu.ru Abstract The relevance of the research is caused by the need of application of speech recognition technology for language teaching The speech recognition is one of the most important tasks of the signal processing and pattern recognition fields The speech recognition technology allows computers to understand human speech and it plays very important role in people’s lives This technology can be used to help people in a variety way such as controlling smart homes and devices; using robots to perform job interviews; converting audio into text, etc But there are not many applications of speech recognition technology in education, especially in English teaching The main aim of the research is to propose an algorithm in which speech recognition technology is used English language teaching Objects of researches are speech recognition technologies and frameworks, English spoken sounds system Research results: The authors have proposed an algorithm based on speech recognition framework for English pronunciation learning This proposed algorithm can be applied to another speech recognition framework and different languages Besides the authors also demonstrated how to use the proposed algorithm for development English pronunciation practicing system based on iOS mobile app platform The system also allows language learners can practice English pronunciation anywhere and anytime without any purchase Keywords: Speech recognition, English pronunciation, Hidden Markov Models, Neural networks, mobile application Introduction 1.1 Speech recognition technology Speech recognition technology has been researched and developed over the past several decades In the 1960’s this technology was developed based on filter-bank analyses, simple time normalization methods and the beginning of sophisticated dynamic programming methodologies In this time technology could recognize small vocabularies (10-100 words) of isolated words using simple acoustic phonetic properties of speech sounds [1] In the 1970’s the key technologies of speech recognition were the pattern recognition models, spectral representation using LPC methods, speaker-independent recognizers using pattern clustering methods and dynamic programming methods for connected word recognition During this time, we able to recognize medium vocabularies (1001000 words) using simple template-based and pattern recognition methods [1] In the 1980’s the speech recognition technology started to solve the problems of large vocabulary (1000 – unlimited number of words) using statistical methods and neural networks for handling language structures The important technologies used in this time were the Hidden Markov Model (HMM) and stochastic language model [1] Using HMMs allowed to combine different knowledge sources such as acoustics, language, and syntax, in a unified probabilistic model In the 1990’s the key technologies of speech recognition were stochastic language understanding methods, statistical learning of acoustic and language models, finite state transducer framework and FSM library In this time speech recognition technology allow us to build large vocabulary systems using unconstrained language models and constrained task syntax models for continuous speech recognition and understanding [1] In the last few years, the speech recognition technology can handle with very large vocabulary systems based on full semantic models, integrated with text-to-speech (TTS) synthesis systems, and multi-modal inputs In this time, the key technologies were highly natural concatenative speech synthesis systems, machine learning to improve both speeches understanding and speech dialogs [1] 1.2 Key speech recognition methods Dynamic time warping (DTW) Dynamic time warping (DTW) is an approach that was historically used for speech recognition This method is used to recognize about 200-word vocabulary [2] DTW divide speech into short frames (e.g 10ms segments) and then it processes each frame as a single unit During the time of DTW, achieving speaker independence remained unsolved DTW was applied for automatic speech recognition to cope with different speaking speeds It allows to find an optimal match between two given sequences (e.g., time series) with certain restrictions Hidden Markov Models (HMM) DTW has been displaced by the more successful Hidden Markov Models-based approach HMMs are statistical models that output a sequence of symbols or quantities In HMMs a speech signal can be a piecewise stationary signal or a short-time stationary signal And speech can be approximated as a stationary process in a short time-scale (e.g., 10 milliseconds) By the mid-1980s a voice activated typewriter called Tangora was created It could handle a 20,000-word vocabulary [3] It processes and understands speech based on using statistical modeling techniques like HMMs However, HMMs are too simplistic to account for many common features of human languages [4] But it proved to be a highly efficiency model for speech recognition algorithm in the 1980s [1] Neural networks Neural networks have been used in speech recognition to solve many problems such as phoneme classification, isolated word recognition, audiovisual speech recognition, audiovisual speaker recognition and speaker adaptation [5, 6] By comparing with HMMs, neural networks make fewer explicit assumptions about feature statistical properties Neural networks allow discriminative training in a natural and efficient manner, so they are effectiveness in classifying short-time units such as individual phonemes and isolated words [7] However, because of their limited ability to model temporal dependencies, neural networks are not successfully used for continuous speech recognition To solve this problem, neural networks are used to pre-process speech signal (e.g feature transformation or dimensionality reduction) and then use HMM to recognize speech based on the features received from neural networks [8] In recently, related Recurrent Neural Networks (RNNs) have showed an improved performance in speech recognition [9–11] Like shallow neural networks, Deep Neural Networks (DNNs) can used to model complex non-linear relationships The architectures of these DNNs generate compositional models, so DNNs have a huge learning capacity and they are potential for modeling complex patterns of speech data [12] In 2010, the DNN with the large output layers based on context dependent HMM states constructed by decision trees have been successfully applied in large vocabulary speech recognition [13–15] End-to-end automatic speech recognition Traditional HMM-based approaches required separate components and training for the pronunciation, acoustic and language model And a typical n-gram language model, required for all HMM-based systems, often takes several gigabytes memory to deploy them on mobile devices [16] However, since 2014 end-to-end ASR models jointly learn all the components of the speech It allows to simplify the training and deployment process Because of that, the modern commercial ASR systems from Google and Apple are deployed on the cloud Connectionist Temporal Classification (CTC) based systems was the first end-toend ASR and introduced by Alex Graves of Google DeepMind and Navdeep Jaitly of the University of Toronto in 2014 [17] In 2016, University of Oxford presented LipNet using spatiotemporal convolutions coupled with an RNN-CTC architecture It was the first end-to-end sentence-level lip reading model And it was better than human-level performance in a restricted grammar dataset [18] In 2018 Google DeepMind presented a large-scale CNN-RNN-CTC architecture In the results this system achieved times better performance than human experts [19] 1.3 Speech recognition applications With speech recognition technology computers now can hear and understand what people speak to them and can what people want they The speech recognition technology can be used in a variety way and plays very important role in people’s lives For example, this technology can be used in-car systems or smart home systems to help people simple thing by voice commands such as: play music or select radio station, initiate phone calls, turn on/off lights, televisions and other electrical devices For education, speech recognition technology can be used to help students who are blind or have very low vision They can use computer by using voice commands instead of having a look at the screen and keyboard [20] Besides, students who are physically disabled or suffer from injuries having difficulty in writing, typing or working can benefit from using this technology They can use speech-to-text programs to their homework or school assignments [21] Speech recognition technology can allow students to become better writers They can improve the fluidity of their writing by using speech-to-text programs When they say to computer, they don’t worry about spelling, punctuation, and other mechanics of writing [21] In addition, speech recognition technology can be useful for language learning They can teach people proper pronunciation and help them to develop their speaking skills [22] Recently, all people have their own mobile devices and they can use them anywhere, anytime Most of mobile apps and devices runs on two main operating systems: iOS and Android OS These operating systems are equipped with the best speech recognition technology developed by Google or Apple There are many mobile apps that use these speech recognition technologies for playing games, controlling devices, making phone calls, sending text messages etc There are also many software applications to practice English pronunciation on mobile devices By using these support tools, learners can record all what they say and compare with sample pronunciation of native speakers to correct errors The applications often display the pronunciation of words, allowing learners to listen to sample pronunciation, then the learners will record their pronunciation and compare themselves with the sample pronunciation The application has not integrated the voice recognition feature into the software to test the learner's pronunciation Because of that, building a mobile app using speech recognition technologies for language pronunciation learning is urgent and perspective In this paper we present an algorithm that use speech recognition technology to help people determine if they properly pronounce an English sound The proposed algorithm is used for building mobile app based on speech recognition technology This algorithm is tested Proposed algorithm In this paper, we propose an algorithm based on speech recognition framework for English pronunciation learning The framework used to test proposed algorithm in this paper is Apple speech recognition technology [23] Besides, in this paper we demonstrate how to use the proposed algorithm for development English pronunciation practicing system based on iOS mobile app platform This proposed algorithm can be applied to another speech recognition framework (e.g Google speech recognition) and different languages The main aim of developing of proposed algorithm to help learners can use speech recognition technologies to test their own English pronunciation and make appropriate adjustments The application will provide learners with the inherent functions of an English pronunciation training tool and support learners to completely free practice English pronunciation anytime, anywhere 2.1 Apple speech recognition technology The Apple speech recognition framework allow to recognize spoken words in recorded or live audio It can be used to translate audio content to text, handle recognize verbal commands etc The framework is fast and works in near real time Besides the framework is accurate and can interpret over 50 languages and dialects [23] The process of speech recognition task using Apple technology can be presented in Fig Fig Process of speech recognition task on speech recognition framework Audio Input is an audio source from which transcription should occur Audio source can be read from recorded audio file or can be captured audio content, such as audio from the device’s microphone The audio input is then sent to Recognizer that is used to check for the availability of the speech recognition service, and to initiate the speech recognition process At the end, the process gives the partial or final results of speech recognition [23] 2.2 One-word pronunciation assessment Based on this speech recognition framework, we propose an algorithm to assess the language learner’s pronunciation The process of pronunciation assessment for one word is presented in Fig Fig Process of pronunciation assessment for one word At first the language learner pronounces a word which is used to practice pronunciation Then the learner’s pronunciation is handled by speech recognition framework which gives the recognition result After that, the recognition result is compared with target word to determine if the learner correctly pronounce the target word (Fig 3) Fig Learner’s pronunciation assessment for one word 2.3 One sound pronunciation assessment In order to assess one sound pronunciation, we need to assess the pronunciations of list of words which contain the target sound The process of pronunciation assessment for one sound can be then presented in Figure Fig Process of pronunciation assessment for one sound At first the language learner pronounces one word of the list which contains the sound used to practice pronunciation Then the learner’s pronunciation is handled by recognition process After that the recognition result are processed by pronunciation asserting The language learner repeats these steps for other words of the list until all words of the list have been pronounced Based on the pronunciation results of words in the list, we can calculate the sound pronunciation fluency of the language learner by following formula: Sound pronunciation fluency = Total number of correctly pronounced words / Total number of words in the list 2.4 English pronunciation practicing system The English language contains 44 sounds divided into three main groups: vowels (12 sounds), diphthongs (8 sounds) and consonants (24 sounds) The vowel sounds consist of two sub-groups: long sounds and short sounds The consonant sounds consist of three sub-groups: voiced consonants, voiceless consonants and other consonants The phonemic chart of 44 English spoken sounds is presented in Table Based on the phonemic chart of spoken English sounds, proposed algorithm for word and sound pronunciation asserting, we developed an iOS app for English pronunciation practicing system The main aim of this system is to allow language learners can know if they correctly pronounce English sounds Based on the results, provided by this system, language learners will have proper adjustment to improve their English pronunciation Besides the app allows language learners can freely practice pronunciation anywhere and anytime Table Phonemic chart English sounds Vowels English sounds Short sounds Long sounds Diphthongs Consonants Voiceless consonants Voiced consonants Other ɪ e æ ʌ ʊ ə ɒ i: ɜ: u: ɔ: ɑ: eɪ ɔɪ aɪ eə ɪə ʊə əʊ aʊ p f θ t s ʃ ʧ k b v ð d z ʒ ʤ g m n ŋ h w l r j The English pronunciation practicing system consists of 44 lessons according to 44 spoken English sounds (Fig a.) Each lesson has its own practicing exercises and depending on the sound these exercies normally divided into the following types: the sound is at the beginning of words; the sound is in middle of words; the sound is at the end of words; the sound is followed by a vowel/consonant; the sound is after a vowel/consonant (Fig b.) Fig English pronunciation practicing system: a) list of lessons, b) examples of exercise types of sound p The language learners must practice with all words in the list of exercise, and then the system will automatic give recognition and pronunciation results according each word (Fig 6) After that the system calculates the pronunciation fluency for each sound and shows the results to the language learners (Fig 7) a b c Fig Example of one practice: a) practice overview and mode, b) practice answer mode, c) pronunciation result of one word a b c Fig Example of pronunciation assessment: a) pronunciation result for one practice, b) pronunciation result for practices, c) pronunciation result for sound Conclusion In this paper, we propose an algorithm based on speech recognition framework for English pronunciation learning This proposed algorithm can be applied to another speech recognition framework (e.g Google speech recognition) and different languages Besides we also demonstrate how to use the proposed algorithm for development English pronunciation practicing system based on iOS mobile app platform This system allows language learners can determine if they correctly pronounce English sounds Based on these results, the language learners will have proper adjustment to improve their English pronunciation The system also allows language learners can practice English pronunciation anywhere and anytime without any purchase, which they can not in the classroom References Juang B H., Rabiner L R (2015) Automatic speech recognition–a brief history of the technology development [Online] Available: https://web.ece.ucsb.edu/Faculty/Rabiner/ece259/Reprints/354_LALI-ASRHistory-final10-8.pdf Benesty J., Sondhi M M., Huang Y., Springer Handbook of Speech Processing, Springer Science & Business Media, 2008 Jelinek F (2015) Pioneering Speech Recognition [Online] Available: https://www.ibm.com/ibm/history/ibm100/us/en/icons/speechreco/ Huang X., Baker J., R Reddy, A Historical Perspective of Speech Recognition, Communications of the ACM, vol 57, no 1, pp 94-103, 2014 Hanazawa T., Hinton G., Shikano K., Lang K J., “Phoneme recognition using time-delay neural networks,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol 37, no 3, pp 328-339, 1989 Wu J., Chan C., Isolated Word Recognition by Neural Network Models with CrossCorrelation Coefficients for Speech Dynamics, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 15, no 11, pp 1174-1185, 1993 Zahorian S A., Zimmer A M., Meng F., Vowel Classification for Computer based Visual Feedback for Speech Training for the Hearing Impaired, ICSLP, 2002 Hu H., Zahorian S A., Dimensionality Reduction Methods for HMM Phonetic Recognition, ICASSP, 2010 Sak H., Senior A., Rao K., Beaufays F., Schalkwyk J., Google voice search: faster and more accurate, Wayback Machine, 2016 10 Fernandez S., Graves A., Hinton G., Sequence labelling in structured domains with hierarchical recurrent neural networks, Proceedings of IJCAI, 2007 11 Graves A., Mohamed A., Schmidhuber J., Speech recognition with deep recurrent neural networks, ICASSP, 2013 12 Deng L., Yu D., Deep Learning: Methods and Applications, Foundations and Trends in Signal Processing, vol 7, no 3, pp 197-387, 2014 13 Yu D., Deng L., Dahl G., Roles of Pre-Training and Fine-Tuning in Context-Dependent DBN-HMMs for Real-World Speech Recognition, NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2010 14 Dahl G E., Yu D., Deng L., Acero A., Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition, IEEE Transactions on Audio, Speech, and Signal Processing, vol 20, no 1, pp 30-42, 2012 15 Deng L., Li J., Huang J., Yao K., Yu D., Seide F., Recent Advances in Deep Learning for Speech Research at Microsoft, ICASSP, 2013 16 Jurafsky D., James H M., Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Stanford University, 2018 17 Graves A., Towards End-to-End Speech Recognition with Recurrent Neural Networks, ICML, 2014 18 Yannis M A., Brendan S., Shimon W N., Nando de Freitas, LipNet: End-to-End Sentencelevel Lipreading, Cornell University, 2016 19 Brendan S., Yannis A., Hoffman M W and others, Large-Scale Visual Speech Recognition, Cornell University, 2018 20 National Center for Technology Innovation (2010) Speech Recognition for Learning [Online] Available: http://www.ldonline.org/article/38655/ 21 Follensbee B., McCloskey-Dale S., Speech recognition in schools: An update from the field, Technology and Persons with Disabilities Conference, 2018 22 Forgrave K E., Assistive Technology: Empowering Students with Disabilities, The Clearing House, vol 7, no 3, pp 122-126, 2002 23 Apple Inc (2010) Speech framework [Online] Available: https://developer.apple.com/documentation/speech ... thiện việc xây dựng phần mềm ứng dụng hỗ trợ luyện tập phát âm thiết bị di động áp dụng cơng nghệ nhận di? ??n giọng nói + Phần mềm ứng dụng xây dựng tảng iOS tích hợp cơng nghệ nhận di? ??n giọng nói. .. cơng nghệ nhận di? ??n giọng nói tiện lợi mang lại thiết bị di động, giải pháp nhóm nghiên cứu hướng tới việc ứng dụng cơng nghệ nhận di? ??n giọng nói vào việc xây dựng phần mềm hỗ trợ phát âm tiếng Anh. .. nêu nhóm phát triển tiến hành nghiên cứu xây dựng phần mềm ứng dụng thiết bị di động hỗ trợ luyện tập phát âm tiếng Anh sử dụng cơng nghệ nhận di? ??n giọng nói dựa vào nguồn sau: + Nghiên cứu phương

Nghiên cứu ứng dụng công nghệ nhận diện giọng nói vào việc xây dựng phần mềm hỗ trợ luyện tập phát âm tiếng anh trên thiết bị di động

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan