Cải tiến chất lượng dịch máy thống kê dựa vào thông tin cú pháp phụ thuộc

ĐẠI HỌC QUỐC GIA HÀ NỘI BÁO CÁO TỎNG KẾT KÉT QUẢ THỰC HIỆN ĐÈ TÀI KH&CN CẤP ĐẠI HỌC QUÓC GIA Tên đề tài: “Cải tiến chất lượng dịch máy thống kê dựa vào thông tin CÚ pháp phụ thuộc” M ã số đề tài: QG.15.23 Chủ nhiệm đề tài: TS Nguyễn Văn Vinh ĐẠI HỌC QUỐC GIA HÀ NỘI n n\ í»KữGKM Ị V BÁO CÁO TỒNG KỂT KÉT QUẢ T H ự C HIỆN ĐÊ TÀI KH&CN CẤP ĐẬI h ọ c q u ố c g i a Tên đề tài: “Cải tiến chất lượng dịch máy thếng kê dựa vào thông tin cú pháp phụ thuộc” Mã số đề tài: QG.15.23 Chủ nhiệm đề tài: TS Nguyễn Văn Vinh PHẦN I THÔNG TIN CHƯNG 1.1 Tên đề tài: Cải tiến chất lượng dịch máy thống kê dựa vào thông tin cú pháp phụ thuộc 1.2 Mã số: QG.15.23 1.3 Danh sách chủ trì, thành viên tham gia thực đề tài _ Đơn vị cơng tác Vai trò thực đề tài TS Nguyễn Văn Vinh Trường Đại học Công nghệ - ĐHQGHN Chủ nhiệm đê tài NCS Trần Hồng Việt Trường Đại học Kinh tế - Kỹ thuật Công nghiệp Thư ký đề tài TT Chức danh, học vị, họ tên PGS TS Lê Anh Cường Trường Đại học Công nghẹ - ĐHQGHN Thành viên PGS TS Nguyễn Phương Thái Trường Đại học Công nghẹ - ĐHQGHN Thành viên Trường Đại học Hải Phòng Thành viên NCS Phạm Nghĩa Luân 1.4 Đơn vị chủ trì: Đại học Công nghệ - Đại học Quốc Gia Hà Nội 1.5 Thời gian thực hiện: 1.5.1 Theo hợp đồng: 24 tháng từ tháng năm 2015 đến tháng năm 2017 1.5.2 Gia hạn (nếu có): tháng năm 2017 đến tháng năm 2017 1.5.3 Thực thực tế: từ tháng năm 2015 đến tháng 12 năm 2017 1.6 Những thay đổi so vối thuyết minh ban đầu (nêu có)i (về mục tiêu, nội dung, phương pháp, kết nghiên cứu tổ chức thực hiện; Nguyên nhân; Y kiến Cơ quan quản lý) 1.7 Tổng kinh phí phê duyệt đề tài: 200 triệu đồng PHẢN II TỎNG QUAN KẾT QUẢ NGHIÊN c ứ u Đặt vấn đề Sự bùng nổ cách tiếp cận dịch máy tạo sản phâm thương mại đươc sư dụng rọng rãi ừên giới (hệ dịch Googỉe Microso/t2, )• Một ừong vấn đề quan trọng dịch máy thống kê dựa vào cụm liên quan đến việc làm để sinh thứ tự từ (cụm) xác ngơn ngữ đích Trong hệ dịch máy thống kê dựa cụm (Phrase-Based Statistical Machine Translation- PBSMT), việc đảo cụm từ đơn giản chất lượng chưa cao Bên caiủ đó, ngơn ngữ có nhiều đặc điểm khác (đặc biệt khác vê thứ tự tư ngơn ngữ) dẫn tới khơng thể mơ hình hóa xác q trình dịch [Och Ney, 2004] Điều dẫn đến có nhiều hướng quan tâm nghiên cứu để giải vấn đề đảo trật tự từ bên ĩrong hệ thống dịch máy thống kê dựa vào cụm Một sô nghiên cứu theo hướng tiep cạn tien xư ly cho 2^ S; ^ $latf - ạ0° l e-C0f ♦ h ://w w w m icrosofttran slator,com I ĐẬĨ H Ọ C Q U Ỗ C Gí A HẢ NỤ I TRUNG TAM ĨH Q N G ĨiN THƯ V IẺ' ■ Ị "ọ O O Ẽ 0000509 vấn đề đảo trật tự tò cho kết tổt [Peng Xu cộng sự, 2009; Jason Katz-Brown cộng sự, 2011; C a i c ộ n g sự, 2014] Ý tưởng vấn đề đào cụm từ tiền xử lý câu ngôn ngữ nguồn (tiếng Anh) để có thứ tự từ gần ngơn ngữ đích (tiếng Việt) Hai hướng nghiên cứu để giải vấn đề nêu dựa vào tiền xử lý là: phân tích cú pháp thành phần câu nguồn phân tích cú pháp phụ thuộc câu nguồn Một số nghiên cứu sử dụng thơng tin cú pháp nhằm giải tốn đảo trật tự từ Một phương pháp phân tích cú pháp ngơn ngữ nguồn luật xếp bước tiền xử lý Ý tưởng chuyển đổi câu nguồn để câu đích có thứ tự từ gần có thể, việc huấn luyện dễ dàng chất lượng gióng từ tốt hon Mục tiêu • Đề xuất cải tiến phương pháp giải tốn đảo cụm tò dịch máy thống kê dựa vào cụm theo hướng tiếp cận tiền xử lý dựa phân tích cú pháp phụ thuộc • Tìm cách tích hợp thơng tin phân tích cú pháp phụ thuộc vào hệ dịch máy thống kê (lựa chọn thông tin cú pháp, xây dựng luật đảo trật tự thủ công tự động cặp ngôn ngữ) Tập trung thừ nghiệm & đánh giá cặp ngơn ngữ Anh-Việt • Xây dựng chương trình thủ nghiệm dịch từ Việt sang Anh, tích hợp kỹ thuật đề xuất cải tiến đề tài Phương pháp nghiên cứu Chúng áp dụng phương pháp, kỹ thuật sau: - Dựa vào cú pháp phụ thuộc ngôn ngữ nguồn & thông tin ngôn ngữ để đưa giải pháp hiệu cho toán đảo cụm từ bước tiền xử lý áp dụng cho cặp ngôn ngữ Anh-Việt - Dựa vào kỹ thuật học máy để tích hợp hiệu thông tin cú pháp phụ thuộc vào hệ thống dịch máy thống kê Tìm hiểu, khai phá mở rộng luật thủ công (xây dựng tay), luật tự động (trích rút tự động từ kho ngữ liệu) chuyển đổi cặp ngôn ngữ áp dụng để cải thiện chất lượng dịch máy thống kê - Đề xuất kỹ thuật tích hợp hiệu trí thức ngơn ngữ (cú pháp phụ thuộc) vào hệ thổng dịch máy thống kê Tổng kết kết nghiên cứu Đe tài thực nội dung nghiên cứu Bao gồm: Nội dung 1: Nghiên cứu phương pháp giải đảo cụm từ dựa vào cách tiếp cận tiền xừ lý - Nghiên cứu mơ hình, kỹ thuật đảo cụm từ cặp ngôn ngữ dựa vào tiền xử lý - Cài đặt thừ nghiệm kỹ thuật đảo cụm từ dựa vào tiền xử lý Nội dung 2: Nghiên cứu cách tích họp thơng tin cú pháp phụ thuộc vào hệ dịch máy thống kê - Nghiên cứu kỹ thuật tích hợp thơng tin cú pháp phụ thuộc vào hệ dịch máy thống kê - Cài đặt thủ nghiệm mơ hình tích họp Nội dung 3: Thu thập tài nguyên tiền xử lý phục vụ việc khai phá liệu song ngữ - Nghiên cứu nguồn tài nguyên văn thích hợp thu thập - Nghiên cứu xây dựng modul phân tích từ tố cho tiếng Anh - Nghiên cứu áp dụng kỹ thuật phân tách từ cho tiếng Việt Nội dung 4: Xây dựng hệ thống dịch máy thống kê Việt-Anh thử nghiệm - Xây dựng hệ dịch Việt-Anh sở Đánh giá kết đạt kết luận Kết đạt gồm: - Nghiên cứu tìm hiểu phương pháp giải đảo cụm từ dựa vào cách tiếp cận tiền xử lý - Nghiên cứu mơ hình, kỹ thuật đảo cụm từ cặp ngôn ngữ dựa vào tiền xử lý - Thử nghiệm kỳ thuật đảo cụm từ dựa vào tiền xử lý - Nghiên cửu cách tích hợp thơng tin cú pháp phụ thuộc vào hệ dịch máy thống kê - Nghiên cứu kỹ thuật tích hợp thơng tin cú pháp phụ thuộc vào hệ dịch máy thống kê Cài đặt thử nghiệm mơ hình tích hợp Sản phẩm có: - Mơ đun chương trình kỹ thuật đảo cụm từ - Kho ngữ liệu song ngữ Anh-Việt - 02 báo cáo mơ hình kỹ thuật đảo cụm từ - 01 báo tạp chí 05 báo cáo phương pháp đảo cụm từ hội thảo quốc tế Hệ thống dịch máy thống kê Việt - Anh thử nghiệm Tóm tắt kết (tiếng Việt tiếng Anh) Tiếng Việt Cải tiến chất lượng dịch máy thống kê dựa vào thông tin cú pháp phụ thuộc Tóm tắt: Sự bùng nổ cách tiếp cận dịch máy tạo sản phấm thương mại đươc sử dụng rộng rãi giới (hệ dịch Googỉe J Microsoýt 4, ) Một vấn đề quan trọng dịch máy thống kê dựa vào cụm liên quan đến việc làm để sinh thứ tự từ (cụm) xác ngơn ngữ đích Trong hệ dịch máy thống kê dựa cụm (Phrase-Based Statistical Machine Translation- PBSMT), việc đảo cụm từ đơn giản chất lượng chưa cao Bên cạnh đó, ngơn ngữ có nhiều đặc điểm khác (đặc biệt khác thứ tự từ ngôn ngữ) dẫn tới khơng thể mơ hình hỏa xác q trình dịch Điều dẫn đến có nhiều hướng quan tâm nghiên cứu để giải vấn đề đảo trật tự từ bên hệ thống dịch máy thống kê dựa vào cụm Ý tưởng vấn đề đảo cụm từ tiền xừ lý câu ngôn ngữ nguồn (tiếng Anh) đê có thứ tự từ gần ngơn ngữ đích (tiếng Việt) Hai hướng nghiên cứu để giải vấn đề nêu dựa vào tiền xử lý là: phân tích cú pháp thành phần câu nguồn phân tích cú pháp phụ thuộc câu nguồn Đã có số nghiên cứu hệ thống dịch máy thống kê dựa vào cụm cho cặp ngôn ngữ AnhViệt Nghiên cứu dịch máy thống kê dựa vào cụm sử dụng tiền xử lý với cú pháp phụ thuộc chưa nhiều Nghiên cứu đảo cụm từ sử dụng tiền xử lý chủ yếu cho chiều dịch Anh-Việt cú pháp thành phần Những vấn đề thách thức đặt ra: - Các nghiên cứu chủ yếu áp dụng cho chiều dịch Anh-Việt, chưa có chiều dịch Việt-Anh - Một sổ nghiên cứu áp dụng đảo trật tự tò dựa cú pháp phụ thuộc cho chiều AnhViệt Tuy nhiên nghiên cứu chủ yếu dùng luật tay, chưa áp dụng luật tự động tốn dịch - Có nghiên cứu sử dụng tiền xử lý dựa vào củ pháp phụ thuộc cho chiều Việt-Anh tồn nhiều hạn chế cần cải tiến để nâng cao chất lượng Đề tài:”Cải tiến chất lượng dịch máy thống kê dựa vào thông tin cú pháp phụ thuộc ’ tập trung giải thách thức nhằm cải tiến chất lượng dịch máy thống kê, nhiều nỗ lực nghiên cứu theo hướng sử dụng phân tích cú pháp phụ thuộc vào dịch thống kê áp dụng Tiếng Anh: Improving Statistical Machine Translation using Dependency Syntax Information Abstract: The explosion of the machine translation approach has created commercial Products widely used in the world (Google, Microsoít, Facebook, ) One of the key issues of phrase-based statistical https://translate.google.com http://www.microsofttrans1ator.com machine translation deals with how to generate the exact sequence of words in the target ỉanguage In Phrase-Based Statistical Machine Translation (PBSMT), phrase translation is still simple and the quality is not high In addition, since languages have many different characteristics (especially differences in order in words), it can not be accurately modeled during translation This led to many research directions to solve the problem of order reversal from within the cluster-based statistical machine translation system The main idea of the problem is to preíĩx the phrase pre-sentence in the source language (English) to get the closest possible order in the target language (Vietnamese) The two main research directions for addressing the problem based on preprocessing are: parsing the source sentence component and parsing the source sentence There have been some studies on cluster-based statistical machine translation systems for the English-Vietnamese language pair The research on statistical machine translation based on preprocessing using the syntactic tree is not very dependent Research on phrase islands uses preprocessing for the English-Vietnamese dimension using the syntaetical tree Some of challenges: - There are a lot of research applying for the English-Vietnamese translation However, there are no Vietnamese-English translation - Some of research have applied the dependency-based pre-ordering for English-Vietnamese Statistical Machine Translation But, these research use manual rules, no applying automatic rules in the translation - There are a lot of research using dependency-based pre-ordering on the Vietnamese-English direction and need improving the quaỉity of translation The topic: “Improving Statistical Machine Translation using Dependency Syntax Iníbrmation focus on the above challenges to improve the quality of statistical machine translation PHẦN III SẢN PHẨM, CÔNG BỐ VÀ KỂT QUẢ ĐÀO TẠO CỦA ĐÊ TÀI 3.1 Kết nghiên cứu Yêu cầu khoa học hoặc/và tiêu kinh tế -k ỹ thuật TT Tên sản phẩm Đăng ký Đạt Mơ đun chương trình vê kỹ thuât đảo cum từ 01 01 Báo cáo mô hình kỹ thuât đảo cum từ Bài báo vê phương pháp đảo cum từ Kho ngữ liệu song ngữ Anh-Viêt Hệ thông dịch máy thông kê Việt - Anh thử nghiệm 03 03 03 05 01 01 01 01 3.2 Hình thức, cấp độ cơng bổ kết Ghi địa cảm 0'n tài trợ Sản phẩm TT ĐHQGHN quy định Cơng trình cơng bổ tạp chí khoa học quốc tế theo hệ thống ISI/Scopus 1.1 1.2 Sách chuyên khảo xuât ký hợp đông xuât 2.1 2.2 Đăng ký sở hữu trí tuệ 3.1 3.1 Bài báo quốc tế không thuộc hệ thống ISI/Scopus 4.1 4.2 Bài báo tạp chi khoa học ĐHQGHN, tạp chí khoa học chuyên ngành quốc gia báo cáo khoa học đăng kỷ yếu hội nghị quốc tế Có Đã in 5.1 Viet-Hong Tran, Huyen Vu Thuong, Vinh Van Nguyen and Minh Le Nguyen, “Dependency-based PreTình trạng (Đã in/ chấp nhận in/ nộp đơn/ chấp nhận đơn hợp lệ/ cấp giấy xác nhận SHTT/ xác nhận sử dụng sàn phẩm) Đánh giá chung (Đạt, không đạt) Đạt o rd erin g F o r E ng lish -V ietn am ese Statistical Machine Translation”, In VNU Joumal of Science 5.2 Viet Tran Hong, Vinh Van Nguyen Đã in and Minh Le Nguyen, “Improving English-Vietnamese Statistical Có Đạt Đã in Có Đạt Châp nhận in Có Đạt Đã in Có Đạt M ach in e T n sla tio n U sin g Prep ro cessin g D e p e n đ e n c y S y n tactic” , Proceedings of Pacling 2015 5.3 Viet Tran Hong, Huyen Vu Thuong, Pham Nghia Luan, Vinh Nguyen Van and Trung Le Tien, “The EngỉishV ietn am ese M a c h in e T ran slatio n System for IWSLT 2015”, Proceedings of IWSLT 2015 5.4 Viet Tran Hong, Huyen Vu Thuong, V inh N g u y en V a n and M in h N gu y en Le “A Classiíĩer-based Preordering Approach for English-Vietnamese Statistical Machine Translation”, Proceedings of Ciclings 2016 (ISI/SCORPUS) 5.5 Viet Tran Hong, Huyen Vu Thuong, Thu Pham Hoai, Vinh Nguyen Van and N guy en L e M in h “ A R eo rd erin g Model For Vietnamese-English Statistical M ac h in e T ran slatio n U sin g Dependency Iníòrmation”, Próceedings of RIVF 2016 (IEEE) Có 5.6 Luan Nghia Pham, Vinh Van Nguyen Châp nhận in and Huy Quang Nguyen, Transỉation model adaptation for Statistical Machine Translation with domain classiíier, Proceeeding of the 31 st Pacifìc Asia Conference on Language, Iníòrmation and Computation (PACLIC 31), 2017 (SCORPUS) Báo cáo khoa học kiến nghị, tư vấn sách theo đặt hàng đơn vị sử dụng 6.1 6.2 Kết dự kiến ứng dụng quan hoạch định sách sở ứng dụng KH&CN 7.1 7.2 Đạt Ghi chú: Cột sàn phẩm khoa học công nghệ: Liệt kê thông tin sản phẩm KHCN theo thứ tự Các ấn phẩm khoa học (bài báo, báo cảo KH, sách chuyên khảo ) đươc chấp nhân có ghi nhận địa cảm ơn tài trợ ĐHQGHN đủng quy định Bản phơ tơ tồn văn ẩn phẩm phải đưa vào phụ lục minh chứng báo cáo Riêng sách chuyển khảo cần có phơ tơ bìa, trang đầu trang cuối có ghi thơng tin mã sổ xuất 3.3 Kết đào tạo Thời gian kỉnh phí TT Họ tên tham gia đề tài (sổ tháng/sổ tiền) Nghiên cứu sinh 24 tháng Trân Hông Việt Công trình cơng bố liên quan (Sản phẩm KHCN, luận án, luận văn) Đã bảo vệ 1.Viet Tran Hong, Vinh Van Nguyen and Minh Le Nguyen, “Improving English-Vietnamese Statistical Machine Translation Using Pre-processing Dependency Syntactic”, Proceeđings of the Pacling Association for Computational Linguistics 2015, pl 15pl21 Viet Tran Hong, Huyen Vu Thuong, Pham Nghĩa Luan, Vinh Nguyen Van and Trung Le Tien, “The EnglishVietnamese Machine Translation System for IWSLT 2015”, Proceeding of the 12th International Workshop on Spoken Phạm Nghĩa Luân 24 tháng Language Translation, 2015, p80-p84 Viet Tran Hong, Huyen Vu Thuong, Vinh Nguyen Van and Minh Nguyen Le “A Classiíĩer-based Preordering Approach for English-Vietnamese Statistical Machine Translation”, Proceedings of the 17th International Coníerence on Intelligent Text Processing and Computational Linguistics, 2016 Available: http://site.cicling.org/2016/accepted.html Viet Tran Hong, Huy en Vu Thuong, Thu Pham Hoai, Vinh Nguyen Van and Nguy en Le Minh “A Reordering Model For Vietnamese-English Statistical Machine Translation Using Dependency Iníòrmation”, Proceedings of International Conference on Computing & Communication Technologies, Research, Irmovation, and Vision for the Future (RIVF), 2016 Viet-Hong Tran, Huyen Vu Thuong, Vinh Van Nguyên and Minh Le Nguyen, “Dependency-based Pre-ordering For English-Vietnamese Statistical Machine Translation”, In VNU Joumal of Science: Computer Science and Communication Engineering, pages 175179,2017 Viet Tran Hong, Huyen Vu Thuong, Pham Nghia Luan, Vinh Nguyen Van and Trung Le Tien, “The EnglishVietnamese Machine Translation System for IWSLT 2015”, Proceeding of the 12th International Workshop on Spoken Language Translation, 2015, p80-p84 Available: http ://workshop2015.iwslt.org Luan Nghia Pham, Vinh Van Nguyen and Huy Quang Nguyen, Translation model adaptation for Statistical Machine Translation with domain classiĩier, Proceeeding of the 31 st Pacific Asia Conference on Language, Information and Computation (PACLIC 31), 2017 Hoc viên cao hoc Vũ Thương 12 tháng Huyền Viet Tran Hong, Huyen Vu Thuong, Đã bảo vệ Vinh Nguyen Van and Trung Le Tien, “The English-Vietnamese Machine Translation System for IWSLT 2015”, 10 V H Tran, H T Vu, v.v Nguyen, M L Nguyen Number children of head Number Descríptíon 79142 Family has children 40822 Pamily has children 26008 Family has children 15990 Family has children 7442 Famìly has children 2728 Famiìy has children 942 Family has children 307 Family has children 83 Family has children T able Statistical number o f íam ily on corpus E nglish-V ietnam ese Analysis and Discussion W e have found that in our experim ents work is suffĩciently correlated to the translation quality done manually B esid es, w e also have found som e eư or causes such as parse tree source sentence quality, word alignm ent quality and quality o f corpus All the above errors can effect autom atically reordering rules We íocu s m ainly on som e typical relations as noun phrase, adjectival and adverbial phrase, preposition and created m anually written reordering rule set for E nglish-V ietnam ese language pair Our study em ployed dependency syntactic and transĩormation rules to reorder the source sentence and applied to E nglish to V ietnam ese translation system s For exam ple, with noun phrase, there alw ays exists a head noun and the com ponents before and after ÌỊ T hese auxiliary com ponents w ill m ove to new positions according to V ietnam ese translational order T hese rules can popular source lin guistic phenom ena equivalent to target language ones as follow s: - the phrase-based system s applying rules with category JJ or JJS - the phrase-based system s applying rules w iứi category N N or N N S - the phrase-based system s applying rules w ith category IN or TO B ased on these phenom ena, ưanslation quality has significantly im proved W e carried out error analysis sentences and com pared to the golden reordering Our analysis has also the benìts o f autom atically reordering rules on translation quality In com bination with m achine leam in g m ethod in related work [10], it is show n that applying classìtier m ethod to solve reorđering problem s autom atically A ccording to typical differences o f word order betw een E nglish and V ietnam ese, w e have created a set o f autom atícally rules for reordering w ords in E nglish sentence according to V ietnam ese word order and types o f rules including noun phrase, adjectival and adverbial phrase, as w ell as preposition phrase Table gives statistical íam ilies w hich have larger or equal children in our corpus The number o f children in each íam ily has lim ited chilđren in our approach So in target language (V ictnam ese), the number o f children in each íam iỉy is ửje sam e We com pared experim ental results betvveen the phrase-baseđ SM T system s applying manual rules w ith the phrase-based SM T system s applying automatic rules B ecause the manual rules have good quality [7 ,1 ], the phrase-based SM T system s applying manual rules is better than the phrase-based SM T system s applying autom atically rules We b eliev e that the quality o f the phrase-based SM T system s applying automatic rules will be better when we have a better corpus A Classiíĩer-based Preordering Approach for English-Vietnamese SMT 11 Conclusion In this study, a preprocessing approach based on a dependency parser is presented We used classiíìer-based preordering by using SV M đ a ssiíìcation m odel [18] in Weka tools [19] for training the features-rich discrim inative classiíìers to extract automatically rules and apply these rules for reordering words in English sentence according to Vietnam ese word order We evaluated our approach on E nglish-V ietnam ese m achine translation tasks The experim ental results show ed that our approach achieved statistical im provem ents 0.96 BLEƯ point scores over a state-of-the-art phrase-based baseline system Our rules are leam t autom atically from corpus and can cover many linguistic reordering phenom ena We believe that such reordering rules benìt E nglish-V ìetnam ese language pairs In the íuture, we plan to investigate along this direction and extend the rules to other languages We w ould like to evaluate our m ethod with tree with higher and deeper syntactic structure and larger size o f corpus We also attempt to create more efficient prereordering rules by exploiting the iníorm ation in dependency structures Acknowledgements This work described in this paper has been partially íunded b y Hanoi National University (QG 15.23 project) References Koehn, p., Och, F.J., Marcu, D.: Statistical phrase-baseđ translation In: Proceedings of HLT-NAACL 2003, Edmonton, Canada (2003) 127-133 Och, F.J., Ney, H.: The alignment template approach to statistical machine translation Computational Linguistics 30(4) (2004) 417-449 Chiang, D.: A hierarchical phrase-based model for statistical machine translation In: Proceedings o f the 43rd Annual Meeting o f the Association for Computational Linguistics (ACL’05), Ann Arbor, Michigan (June 2005) 263-270 Zhang, Y., Zens, R., Ney, H.: Chunk-level reordering of source language sentences with automatically leameđ rules for statistical machine translation In: Proceedings of SSST, NAACL-HLT 2007 / AMTA Workshop on Syntax and Structure in Statistical Translation (2007) 1-8 Collins, M., Koehn, p„ Kucerovấ, I.: Clause restructuring for statistical machine translation In: Proc ACL 2005, Ann Arbor, USA (2005) 531-540 Quirk, c , Menezes, Ạ,, Cherry, c : Dependency treelet ưanslation: Syntactically iníormed phrasal smt In: Proceedings of ACL 2005, Ann Arbor, Michigan, USA (2005) 271-279 Xia, F„ McCord, M.: Improving a statistical mt system with automatically leamed rewrite pattems In: Proceedings of Coling 2004, Geneva, Switzerland, COLING (Aug 23-Aug 27 0 )50S-514 Xu, p„ Kang, J., Ringgaard, M„ Och, F.: Using a dependency parser to improve smt for subject-object-verb languages In: Proceedixigs of Human Language Technologies: The 2009 Annua! Coníerence of the North American Chapter of the Association for Computationai Linguistics, Boulder, Colorado, Association for Computational Linguistics (June 2009)245-253 Genzel, D.: Automatically leaming source-side reordering rules for large scale machine translation In: Proceedings of the 23rd International Conference on Computational Linguistics COLING '10, Stroudsburg, PA, USA, Association for Computational Linguistics (2010)376-384 12 V H Tran, H T Vu, v.v Nguyen, M L Nguyen 10 Lemer, u „ Petrov, s.: Source-side classiíìer preordering for machine ữanslation In: EMNLP (2013)513-523 11 Li, C.H., Lĩ, M., Zhang, D„ Li, M., Zhou, M., Guan, Y.: A probabilistic approach to syntax-based reordering for statistical machine translation In: ANNUAL MEETINGASSOCIATION FOR COMPUTATIONAL LINGUISTICS Volume 45 (2007) 720 12 Yang, N., Li, M„ Zhang, D., Yu, N.: A ranking-based approach to word reordering for statistica) machine translation In: Proceedings of the 50th Annua) Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, Association for Computational Linguistics (2012) 912—920 13 Jehl L„ de Gispert, A„ Hopkins, M., Byme, B.: Source-side preordering for translation using logistic regression and depth-first branch-and-bound search In: Proceedings o f the 14th Coníerence o f the European Chapter o f the Association for Computational Linguistics, Gothenburg, Sweden, Association for Computational Linguistics (April 2014) 239-248 14 Habash, N.: Syntactic preprocessing for statistical machine translation Proceedings of the I lth MT Summit (2007) 15 Jingsheng Cai, Masao Utiyama, E.S.Y.Z.: Dependency-based pre-ordering for chineseenglish machine translation In: Proceedings of the 52nd Annuai Meeting of the Association for Computationa] Linguistics (2014) 16 Hoshino, s., Miyao, Y., Sudoh, K., Hayashi, K., Nagata, M.: Discriminative preordering mcets kendairs X maximization In: Proccedings of the 53rd Annua! Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Beijing, China, Association for Computational Linguistics (Ju]y 2015) 139-144 17 Nakagawa, T.: Efficient top-down btg parsing for machine translation preordering In; Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Coníerence on Natural Language Processing (Volume 1: Long Papers), Beying, China, Association for Computational Linguistics (July 2015) 208-218 18 Wang, L.: Support Vector Machines: theory and applications Volume 177 Springer Science & Business Međia (2005) 19 Hall, M., Frànk, E., Holmes, G., Píahringer, B., Reutemann, p., Witten, I.H.: The weka data mining software: Afl update SIGKDD Explor Newsl 11(1) (November 2009) 10-18 20 Cer, D., de Mameffe, M.C., Jurafsky, D., Manning, C.D.: Parsing to staníord dependencies: Trade-offs between speed and accuracy In: 7th International Coníerence on Language Resources and Evaluation (LREC 2010) (2010) 21 Tran, V.H., Nguyen, V.V., Nguyen, M.L.: Improving english-vietnamese statistical machine translation using preprocessing dependency syntactic In Proceedỉngs of the 2015 Coníerence of the Paciíìc Associatíon for Computational Linguistics (Pacling 2015) 115-121 22 Koehn, p., Hoang, H., Bừch, A., Callison-Burch, c , Federico, M., Bertoldi, N., Cowan, B., Shen, w , Moran, c , Zens, R„ Dyer, c , Bojar, o , Constantin, A., Herbst, E.: Moses: Open source toolkit for statistical machine translation In: Proceedings o f ACL, Demonstration Session (2007) 23 Nguyen, T.P., Shimazu, A., Ho, T.B., Nguyen, M.L., Nguyen, V.V.: A tree-ío-string phrasebased model for statistica] machine translation In: Proceedings of the Twe!fth Conference on Computational Naturaỉ Language Leaming (CoNLL 2008), Manchester, England, Coling 2008 Organizing Committee (Ăugust 2008) 143-150 24 Stolcke, A.: Srilm - an extensible language modeling toolkit In: Proceedings of Interna tional Coníerence on Spoken Language Processing Volume 29 (2002) 901-904 25 Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models Computational Linguistics 29(1) (2003) 19-51 Translation model adaptation for Statístícal Machine Translatíon with domaỉn cỉassỉỉíer Luan Nghia Pham Vinh Van Nguyen Huy Quang Nguyen H aiphong U niversity U niversity o f E ngineering and Technology IT D epartm ent H aiphong, V ietnam V ietnam N ational U niversity V ocational College o f V inhphuc lu a n p n @ d h h p e d u H anoi, V ietnam V inhphuc, V ietnam v ìn h n v d v n u e d u Abstract In this paper, vve propose a new dom ain adaptation m ethod for SM T system s Sp eciíĩcally, our m ethod only using m onolingual data in a target language to adapting translation m odel, our system brings im provem ents overữ ie SM T baseline system We use tw o m ethods to im prove the quality o f SM T system : (i) cla ssiíy phrases on the phrase-table o f the SM T baselin e system , and (ii) adapt to translation m odel through updating the dircct translation probability o f phrases Our experim ents are applied to the E nglishV ietnam ese language pair and use o f law domain ịin -d o m a in ) and general dom ain ịoutd o m a in ) data sets The English-V ietnaraese parallel corpus is provided by the IW SLT organizers and the experim ental results show ed that our m ethod signiíĩcanU y outperformed the baseline system Our system im proved on the quality o f m achứie translation in the law dom ain up to B L E U score over baseline trained on the English-V ietnam ese data set Introduction Statistical M achine Translation (SMT) systems (Koehn et al., 2003) are usually trained on large amounts of bilingual data and monolingual target language data In general, these corpora may include quite heterogeneous topics and these topics usually dìne a set o f term inological lexicons Terminologies need to be translated taking into account the semantic context in which they appear q u a n g h u y i n f o r @ g m a i l co m The Neural Machine Translation (NMT) approach (Wu et al., 2016) has recently been proposed for machine translation However, the NM T method has some limitations that N M T system is too computationally costly and resource, system NM T also requires rauch more training time than compared to SM T system (Bentivogli et al„ 2016) Theròre, we are still researching on SMT M onolingual data are usually available in large amounts, bilingual data are a sparse resource for most language paừs, Collecting sufficiently large high-quality bilingual data is hard, especially on domain-speciílc data For thís reason, most o f the world languages are resource-poor for statistical machine translation, including the English-Vietnamese language paừ W hen SMT system is trained on the small amount of domain speciíic data leading to narrow lexical coverage which agaúi results in a lovv translation quality On the other hand, SMT systems are trained, tuned on specific-doraain data will períorm well on the corresponding domains, but períòrm ance deteriorates for out-domain sentences (Haddow and Koehn, 2012) Thereíore, SMT systems often suffer from domain adaptation problem during practical applications W hen the test data and the training data come from the same domains, SMT systems can achieve good quality Otherwise, the ưanslation quality degrades dramatically Thereíore, domain adaptation is o f signiíicant importance to developing translation systems which can be effectively ưansíerred from One domain to anoứier In recent years, domain adaptation problem in SMT becomes more important (Banerjee et al., 2010) and is an active field of research in SMT with more and more techniques being proposed and put into practicc (Foster and Kuhn, 2007); (Banerjee et al., 2010); (Foster et al., 2013); (Dahlmeier et al., 2013); (Chen et al., 2013); (Hasler et al„ 2014); (M asum ura et al„ 2015); (Onrust et al., 20] 6); (Cuong et al., 2016) The common techniques used to adapt two main components o f contemporary state-of-the-art SMT systems: the language model and the translation model In addition, there are also som e proposals for adapting the Neural Machine Translation (NMT) system to a new domain (Freitag and Al-Onaizan, 2016); (Chu et al.r 2017) A lthough the NMT system has begun to be studied more, dom ain adaptation for the SM T system still plays an important role, especially for machine translation systems in the English-Vietnamese language pair In this paper, we propose a new method to adapt translation model o f SMT system for the EnglishVietnam ese language pair, our method uses monolingual data in the target language and based on phrases classiíĩcation o f phrase-table into a specưĩc dom ain This is the first adaptive method for the translation model o f SM T system in the English-Vietnam ese language pair Experimental results showed that our method significantly outperform s the baseline system O ur system improved the translation quality o f machine translation system in the in-dom ain data (law domain) by up to 0.9 BLEU points over baseline The paper is organized as follows In the next section, we present related works on the problem o f adaptation in SMT; Section introduces our method; Section describes and discusses the experim ental results, Finally, we end with a conclusion and the future works in Section Related works Dom ain adaptation for machine translation is known to be a challenging research problem that has substantial real-world application and this has been one of the topics o f increasing interest for the recent years Recent work on machine translation domain adaptation has íòcused on either the language model com ponent or the translation mođel com ponent of an SM T system Some authors used monolingual in-domain data and adapted the language model The main advantage of language model adaptation in contrast to translation model adaptation is that monolingual indomaứi data is needed (Xu et al., 2007) used a source classiíĩcation document to classiíy an input document into a domain This work makes the translation model shared across different domains For many language pairs and domains, no newdomain parallel training data is available (Wu et al., 2008) machine translate new-domain source language monolingual corpora and use the synthetic parallel corpus as additional training data by using handmade đictionaries and monolingual source and target language text (Banerjee et al., 2010) build several đomain speciíĩc translation systems, and trained a classiíier to assign each incoming sentence to a domain and use the domain speciíic system to translate the corresponding sentence They assume that each sentence in test set belongs to One of the already existing domains (Cuong et al., 2016) build M T systems for different domains, it trains, tunes and deploys a single translation system that is capable o f producing adapted domain translations and preserving the original generic accuracy at the same time The approach unifies automatic domain detection and domain model parameterization into One system Above related works automatically detected the domain and the classifíer works as a “switch” between two independent M T decoding runs To adapt a translation model trained from the data in one domain to another, previous works paid more attention to the studies of parallel corpus whiie ignoring the in-domain monolingual corpora which can be obtained more easily Our method have some differences from above methods Firstly, we use the M aximum Entropy (ME) classification model to estimate the probability o f target phrases in-domain Secondly, our method uses monolingual data in the target language and we study ađaptation for the English-Vietnamese language pair Our method In the phrase-based SMT, a system is trained on a large parallel corpus of "general" (out-domain) training data, o f which some subset (unknown domain) is "in-domain" A phrase in the source language may have many translation hypotheses for translation into the target language with different probability, the hypothesis is selected to translate into target language often has the higher probability than other hypotheses, that selection is deíined b y fo r m u la M onolingual in-domain data are usually available in large amounts Thus, two basic adaptation approaches are usually pursued: • Generating synthetic paralell corpus by the SMT system and using this corpus to adapt its translation and reordering models; estim ated from a parallel corpus, i.e the set of foreign sentences and their corresponding translations into the target language M T system s are phrase-based, parallel data is used to derive a phrase-based lexicon (Koehn et al., 2003) The resulting lexicon consists of a list of pairs (seqe; s e q Ị ) where se.qe is a sequence of one or more íoreign words, s e q f is a predicted translation Each paứ com es with an associated score At decoding tim e, all phrases from sentence e are collected with theữ corresponding translations observed in training These are scored together with the language modeling scores and may include other íeatures The phrase-based approach by (Koehn et al„ 2003) uses a log-linear model (Och and Ney, 2002), and the best correction m aximizes the following: e — argmaxp(e\f) e • U s in g s y n th e tic or p r o v id e d target te x ts to adap t M its la n g u a g e m o d e l = arg m a x V e M ost of the related works in section use monolingual data to adapt language model, or to synthesize bilingual data, or to classify source data into specifying domains then training in-domain translation models The Vietnamese language has resourcepoor parallel corpus for SMT, so we propose a new m ethod which only uses monolingual in-domain data to adapt the transỉation model by classiíying phrases in the phrase-table and to update the phrase’s direct translation probability 3.1 (2 ) Overview o f Phrase-Based Statistical M achine Translation Àm h m (e, f ) (3) m =l where h m is a íeature íunction, such as language model score and translation scores, and Xrn corre- sponds to a feature weight Figure presents the architecture of Phrase-based Statistical M achine Translation system There is some translation knovvledge that can be used as language m odels, translation models, etc The combination o f com ponent models (language model, translation m odel, word sense disambiguation, reordering model ) Our baseline is a Standard phrase-based SMT system (Koehn et al., 2003) The statistical machine translation approach is based on the noisy-channel model The best translation for a foreign sentence f is: e = argma,xp(e)p(e\f) e (1) The model consists of two components: a language model assigning a probability p(e) for any target sentence e, and a translation model that assigns a conditional probability p(elf) The language model is learned using a m onolingual corpus in the target language The parameters o f the translation model are F igure 1: Architecture o f Phrase-based Statistical M achine Translation 3.2 Translation model adaptatìon based on phrase classiíỉcation Train basic component models o f SMT, include translation model and language model In this sectìon, we describe the đomain adaptatìon techniques that we applied to our experiments for translation model adaptation Apply classiíication model for phrases of the phrase-table State-oi-the-art SMT systems use a log-linear com bination of models to decide the best-scoring target sentence gi ven a source sentence Among these m odels, the basic ones are a translation model P(elf) and a target language model P(e) The translation model is based on phrases; we have a table of the probabilities o f translating a specified source phrase f into a specified target phrase e including phrase translation probabilities in both translation directions In this paper we conduct the experim ents with translation from English to Vietnamese, we consided the direct phrase translation probability (?Helf) To build a classification model, we use the Stanford C la ssifie r to o lk it1 w ith Standard íigu ration s This toolkit uses a maximum entropy classifíer with character n-grams íeatures, The M axim um Entropy classiíier is a probabilistic classiíier which belongs to the class of exponential models The Maximum Entropy is baseđ on the principle o f M aximum Entropy and from all the models that fit training data, selects the one which has the largest estimate probability The M aximum Entropy Classiíìer is uscd to classiíy effectively text M aximum Entropy m odels can be shovvn to have the fo]lowing íòrmula: exp ( T , h f k ( x , y ) ) p{y\x) = ^ - * 77 ỹr E e x p ( E Afc/fc(x, z)) (4) k where Xk are model parameters and f k are features of the model (Berger et al., 1996) We trained classification model with classes, Law and G eneral classes, Our m ethod to translation model adaptation can be sum m arised by the following general algorithm and our system can be described in Figure Buiìd a phrase classiíication model in the target language on the in-domain data !https://n]p.stanford,edu/software/classifier.html Update the phrase’s direct translation probability if that phrase is within the specific indomain class Algorithm: Update the phrase’s đirect translation probability Input: all phrases of phrase-table Output:translation model is adapted 1:for all phrase in phrase-table 2: classify phrases 3: if phrase has law lable then 4: update the phrase's direct translation probability 6: end if 7:end for 8:return translation model is adapted Experiments and Results In this section, we describe the domain adaptation technique that we applied to our experiments for translation model adaptation 4.1 D ata sets We conduct our experiments on the data sets of the English-Vietnamese language pair We consider two very different domains that are law domaứi and general domain Detailed statistics for the data sets are given in table In -d o m aỉn d a ta : We use monolingual law data in the Vietnamese language collecíed from documents, dictionaries of the legal domain This data set consists o f 2238 phrases, manually Iabeled, including 526 in-domain phrases (in law domain and labeled Law) and ỉ 712 out-đomain phrases (in general domain and labeled General) This data set is used to train the classification model with classes, Law and General class Figure 2: Architecture o f the our translation m odel adaptation system Adđitionally, 500 parallel sentences in law domain are used for test set O u t-d o m ain d a ta : We use all the general parallel training data which are provided by the IWSLT 2015 organisers for the English-Vietnamese translation task syllable (Ngo,, 2001) Some examples of Vietnamese words are shown as follows: + Single words: "nhà" - house, "nhặt" pick up, "mua" - buy and "bán" - sell + Compound words: "mua bán" - buy and sell, "bàn ghế" - table and chaừ, “cây cối”trees, “đường xá”- Street, "hành chính"- Additionally, 1000 parallel sentences are used for developm ent set and another 0 parallel sen ten ces administration are used for test set, P rep ro cessin g : Data pre-processing plays a very important role in any data-driven method We carried out preprocessing in two steps: Thus, a word in Vietnamese may consist of several syllables separted by whitespaces - M orphology in English: In English, whitespaces are useđ to separate words (A ronoff and Fudeman, 2011) • Cleaning Data: We períorm ed cleaning in two phases: We used vnTokenizer (Phuong et al., 2008) script to segment for Vietnamese data sets, this is the quite popular toolkit for Vietnamese segmentation and we used tokenizer script in Moses to segment for English data sets - Phase-1: following the cleaning process described in (Pal et ai., 2015) - Phase-2: using the corpus cleaning scripts in M oses toolkit (Koehn et al., 2007) with m inim um and maxiraum num ber of tokens set to and 80 respectively • Word Segm entỉon: - M orphology in Vietnamese: Vietnamese does not have morphology (Thompson, 1963) and (A ronoff and Fudeman, 2011) In Vietnamese, vvhitespaces are not used to separate words The smallest meaningful part of Vietnamese orthography is a 4.2 Experiments We performed experiments on ứie MT1 and MT2 system: • MT1 is a SM T b a s e lin e system T h is system is the phrase-based statistical machine translation with Standard settings in the Moses toolkit (Koehn et al„ 2007) The MT1 is trained The o n th e G en n e l d o m a in ( o u t- d o m a in ) data se t L anguage D a ta S ets English Sentences Training 122132 Average Length 15.93 15.58 Words 1946397 1903504 Vocabulary 40568 284 745 Sentences D ev Average Length 16.61 15.97 Words 12397 11921 V ocabulary 2230 1986 1046 Sentences G _test Average Length 16.25 15.97 Words 17023 16889 Vocabulary 2701 275 500 Sentences L_test V ietn a m ese Average Length 15.21 15.48 Words 7605 7740 Vocabulary 1530 1429 Table 1: The Sum m ary statistical o f data sets: English-V ietnam ese and the MT1 is evaluated sequentially on the G_test and L_test đata sets • The MT2 is the MT1 baseline system after being adapted to the translation model by applying classification model for phrases of the phrase-table Then update the phrase’s dừect translation probability if that phrase is within the speciíic in-domain class and the M T2 is evaluated on the L_test data set We use the BLEU score (Papineni et al., 2002) to evaluate translations quality of the MT1 and M T2 systems 4.2.1 Baseiine System Our MT1 baseline system is a Standard phrasebased SMT system based on the M oses SMT tooJkit2(Koehn et al., 2007), this is a state-of-theart open-source phrase-based SM T system It uses fo u r te e n featu res fu n c tio n s in a Standard c o n ĩig u r a - tion, each system scores translation candidates using Standard featu res: p h se-ta b le íe a tu r e s, in c lu d in g 2http://www.statmt.org/moses/ inverse phrase translation probability ự)(fle), inverse lexical weighting lex(fle), direct phrase translation probability (?Helf), direct lexical vveighting lex(elf) and phrase translation probabilities; a 7-feature lexicalized re-ordering model; one language model and previously, there was phrase penalty (always exp(l) = 2.718) The translation and the re-ordering model relied on “grow-diag-fmal” symmetrized word-toword alignments built using GIZA++ (Koehn et al„ 2003) and the training script o f Moses In our systems, íeature weights were optimized using MERT (Och, 2003) We train a language model with 4-gram and Kneser-Ney smoothing was used in all the experiments We used SRILM (Stolcke, 2002) as the language model toolkit 4.2.2 Results The MT1 baseline system is trained on the general domain data set and the MT1 is evaluated sequentially on G_test and L_test data sets The MT2 ■http://www.speech.sri.com/projects/srilrn/ S o u rc e s e n ỉe n c e s (T e s t oa L aw óom aìn) T a rg e t s e n te n c e s R e íe re n c e s e n te n c e s M T s y s t e m (B ase tỉn e) M T s y s t e m (O u r system ) th e vvorking p a r ty to o k n o te of th ís c o m m ttm e n t b ữ a tiệ c làm _ v lệc fỉé ì_ n/ụrcđầ c a m _ k ẽ t rìày n h ó m làm _ v íệc gh ỉ_n h ận c a m _ k ẽ t b a n c ô n g t c đ ã g h l_n ỉtận ca m _ k ê t a c c o rđ in g to th e g e n e ỉ s ta tis tic a ỉ o ffk x , se rv ic e s h a d a c c o u n te d fo r p e r c e n t of v ie tn a m 's g d p ỉn 20 t h e o tơ n g _q u ăt vẩn_phòng thõng k ê , dịch _ v ụ đ ã c h iế m % th e o tổng_cục th ó n g _ k ê , dịch _ v ụ đ ă ch ỉẽ m % c ủ a v ĩệ t_ n a m g d p vào năm 0 th e o tổng_cục th ố n g _ k ẽ , dịch_vụ chiếm ,9 % g d p n ă m 0 c ú a Víẽt n a m , ren ew ab ie c e rtĩíìc a te s valid fo r five y e a r s w e re g iìte d by t h e c o n s tru c tio n ứ e p a rtm e n ts o f c ĩtìe s a n d p ro v ín c e s g iã y c h ứ n g _ n h ặ n tá i_ tạ o c ổ giá_.trị tro n g n ă m q u a lã c ẫ u _ tn jc c ú a cá c ià n h _ p h õ đ ị j ụ c g íẩ y _ p h é p g ia hạn có g ỉá _ tiỊ tro n g n ă m bỡ í c ấ u _ trú c củ a c c th n h _ p h õ sở x ây d ự n g c c tìn h v t h n h _ p h õ cã p g lẵ y _ p h é p h ả n h _ n g h ề có h ỉ ê u j ữ c n ă m v c c g ỉẫ y _ p h é p n y c ổ _ th ế đ c g ia _ h n th e ec o n o m ic p olice re c e lv e d s p e d a fiz e d tr a ín ĩn g o n in teB ectu a í p ro p e rty en tu rcem en t aớ m ỉn istratìve ineasures on ỉy a p p líe đ to acts o f low g v rtỵ c ú a v rệ t_ n a m g d p v o n ă m 0 vả đ í j ụ c c ã n h _ s t tó n h _ tế đ ợ c đ o _ tạ o ch u y ê n v ề quan s l h ữ u _ t r í _ tu ệ c n h _ s t k in h _ tẽ đ ợ c đ o _ tạ o c h u y ê n v ề th ự c _ tíú q y ẽ n _ s _ h ữ u _ trí_ tu ệ cã r»h_sát k ĩn h _ tẽ đ ợ c đ o _ tạ o c h u y ê n _ s â u v ẽ th ự c ^ th i q y ẽ n _ s _ h ữ u trf_ tu ệ đo hành_chính chi p _ d ụ n g c h o hành_động c ú a trọng_lự cú \ãụ ÌM ện_pháp hàn h_chính ch i áp d ụ n g cho h n tì_ v i ngM êm _ trọ n g th ấ p biện_ph ảp hành_ chúìh chi p _ d ụ n g vđỉ nhữ ng h àiìh _ v i cỏ tín h nghfêtn_ trọ n g th ấ p , c h ứ n g _ c ứ ỬIU đ ợ c o n g q u _ ứ in h x _fý hàn h_ctìfnh sề đ ọ c s _ d ụ n g tạ i tò a d â n _ s ự n ẽ u th ã y c ă n _ th iế t th e o b ộ _ !u ậ t tô i.tụ n g dân_á!ự n ă m 0 e v íđ e n c e c o lie c te d d u rin g an b ln g _ c b ứ r ụ j th u _ tíìậ p đ ợ c o n g m ộ t ad m in is trativ e p ro ced vre c o u lđ be ca hành_ chỉnh c ó t h ế đ ợ c s _ d ụ n g b ằ n g _ c h ứ n g th u _ th â p tro n g tfìủ _ tụ c hành_dĩừ»h c _ th ế đ ợ c u s e đ by th e d ^ l c o u r t Èf n e c e s s a r y in a c c o rd a n c e w ỉth õ v ữ procedure code of 0 n h ữ n g t o d â n _ s ự n ế u c ã n _ th iẽ t t h e o thù_ tụ c d àtì_sự m ã c ủ a n ằ m 2004 s _ đ ụ n g bcfi to d â n _ s ự n ế u c ầ n _ th iẽ t th e o b ộ _ fu ậ t tfM Ỉ_tục dân _sự củ a năm 2004 Table 2: S om e exam ples in our experim ents system ịaỷter being adapted) is evaluated on the data set L_test Experimental results of MT1 and M T2 systems show that our system has better result than baseline, results are shown in table and some examples in our experiments are shown in table The phrase "took n o te " in context "the working party took note fíf this ựommitment" (source sentence in the table 2) should been translated into " g h i_ n h ậ n " as reíerence sentence The MT1 baseline system translated the phrase "took n o te " into "n ố t_ n h c ” and our M T2 system translated the phrase "took n o te " into "g h i_ n h ậ n " as reference sen ten ce S o m e other phrases in the source sen ten ce o f table show translation quality on our M T2 system is better than on the MT1 baseline system SYSTEM BLEU SCORE MT1 (withG_test) 313 MT1 (with L_test) 28.8 MT2 (with L_test) 29.7 TabJe 3: The experim ent results o f MT1 and M T2 system s te st data is o n th e la w d o m a in (h e r e is th e L _ t e s t d a ta set), the BLEU score will be 28.8 A nd if the MT2 s y s te m ( th e M T b a s e l in e s y s te m a f t e r b e in g a d a p te d to the transỉation m odel) is experimented on the law d o m a in (h e r e is th e L _ t e s t d a ta s e t) , th e B L E U sc o r e will be 29.7 The experìment results in table showed that the SM T system is trained on the general domain if the test domain is different with the training domain, the quality o f translation will be down In this experiments, the BLEU score was reduced 2.5 points from 31.3 to 28.8 The MT1 system is adapted by our technique will improve the quality o f the translation system though we are still experimenting with the L_test data set in Iaw domain In this experiments, the BLEU score is improved to 0.9 points from 28.8 up to 29.7 Conclusỉons and Future Works In this paper, we presented a new method for domain adaptation in SMT Our method only uses monoling u a l in -d o m a in data to adap t th e tra n sla tio n m o d e l Table shơws that the MT1 baseline system is trained on the general domain data set, if the test data (here is the G_test data set) is in the same domain as the Iraining data, the BLEU score vvill be 31.3 If the by classifying phrases in phrase-table and to update the phrase’s direct translation probability Our system obtaìned a improved on the quality o f machine translation in the law domain up to 0.9 BLEU points over baseline Experimental results show that our method is effective in improving the translation accuracy M arkus Freitag and Yaser A l-O naizan In íuture work, we intend to study the problem in the other m ulti-dom ain and integrate automatically our technique to decode of SM T system AcknowIedgement This work describeđ in this paper has been partially funded by Hanoi National University (QG 15.23 project) References Mark A ro n o ff and Kirsten Pudeman 2011 W hat is morphology, vol john w iley and sons Pratyush Banerjee, Jinhua Du, B aoli Li, Sudip Kr Naskar, A n dy Way, and J o se f van Genabith 2010 C om b in in g m ulti-dom ain statistical m achine translation m od els using automatic classiíiers In P roceedin gs o f A M T A 2010 2016 Fast domain adaptation for neural m achine translation Barry H addow and Philipp Koehn 2012 A nalysing the effect o f out-of-dom ain data on smt system s In Proceed in gs o f the Seventh W orkshop on Statistical M achine Translation, pages 42 —432 Eva Hasler, Phil B lunsom , Philipp K oehn, and B a y Haddoxv 2014 D ynam ic topic adaptation for phrasebased mt In P roceedings o f the 14th C oníerence o f the European Chapter o f The A ssociation for C om putational L ingu istics, pages -3 Philipp Koehn, Franz J o se f O ch, , and D aniel Marcu 2003 Statistical phrase-based translation In Proceeđings o f HLT-NAACL 0 , pages -1 3 Eđm onton, Canada Philipp K ochn, H ieu H oang, Alexandra Birch, Chris C alỉison-Burch, M arcello Federico, N icola Bertoldi, Brooke Cow an, W ade Shen, Christine Moran, Richard L uisa B en tivogli, Arianna B isazza, M auro C ettolo, and Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst 2007 M oses: open source toolkit for statistical m achine translation In A C L -2007: Pro- M arcello F eđ erico 2016 Neural versus phrase-based m achine translation quality: a case study ceedings o f dem o and poster session s, Prague, Czech Republic, pp 177-180 Adam Berger, Stephen D ella Pietra, and V incent D ella Pietra 1996 A m axim um entropy approach to natural language Processing Com putational L ingu istics, 22 B oxin g C hen, Roland Kuhn, and G eorge Foster 2013 V ector sp ace m odel for adaptation in statistical ma- Ghine translation Proceedings of the 51st Annual M eetin g o f the A ssociation for Com putational Lingu istics, pages -1 C henhui Chu, Raj Dabre, and Sadao Kurohashi 2017 An em pirical com parison o f sim ple dom ain adaptation m ethods for neural m achine translation H oang C u ong, K halil Sim a’an, and Ivan Titov 2016 Adapting to all dom ains at once: Revvarciìng dom ain invariance in smt P roceedings o f the Transactions o f the A ssociation for Com putational Linguistics (TACL) D an iel D ahlm eier, Hvvee Tou N g , and S iew M ei Wu4 B u ild in g a large annotated corpus o f learner english: T he nus corpus o f leam er english In Proceedin gs o f the N A A C L W orkshop on Innovative U se o f N L P for B u ild in g Educational A p p li- cations G eorge Foster and Roland Kuhn 2007 M ixturem odel adaptation for smt P roceedings o f the Second W orkshop on Statistical M achine Translation, pages -1 , Prague A ssociation for Com putational Linguistics G eorge Foster, B oxing Chen, and Roland Kuhn 2013 Sim ulating discrim inative training for linear mixture adaptation in statistical m achine translation Proceedin gs o f the M T Sum m it Ryo M asum ura, Taichi Asam , Takanobu Oba, Hirokazu M asataki, Sum itaka Sakauchi, and Akinori Ito 2015 H ierarchical latent words language m odels for robust m odelin g to o u t-o f dom ain tasks P roceedings o f the 2015 C onference on Em pirical M ethods in Natural Language Processing, pages 1896-1901 B inh N N go 2001 The vietnam ese Ianguage learning framework Joum al o f Southeast A sian Language Teãching 10 Pages -2 Franz J o se f Och 2003 M inim um e o r rate training in statistical m achine translation In P roceeđings o f ACL 2003, pages 160-167 L ouis Onrust, Antal van den B osch , and H ugo Van hatnme 2016 Im proving cross-dom ain n-gram language m od ellin g with skipgram s P roceedings o f the 54th Annual M eeting o f the A ssociation for Computational L ingu istics, pages -1 , Berlin, Germany Santanu Pal, Sudip Naskar, and J o se f van Genabith 2015 Uds-sant: E nglish -germ an hybrid m achine translation system In P roceedings o f the Tenth W orkshop on Statistical M achine Translation, pages -1 , L isbon, Portugal, September, A ssociation for Com putational L inguistics Papineni, K ishore, Salim Roukos, Todd Ward, and W eiJing Zhu 2002 Bleu: A m ethod for automatic evaluation o f raachine translaúon, ACL Le H ong Phuong, N gu yen Thi M inh H uyen, A zim Roussanaly, and H o Tuong Vinh 2008 A hybrid approach to word segm entation o f vietn am ese texts Andrcas Stolcke 2002 “srilm - an extensible language m odeling toolkit,” in proceedings o f international con- ícrence on spoken lan-guage Processing Laurence c T hom pson 1963 The problem o f the word in vietnam ese Word journal o f the International Linguistic A ssociation , 19( 1):39—52 Hua Wu, H aifeng W ang, and C hengqing Zong 2008 D om ain adaptation for statistical m achine translation with dom ain dictionary and m onolingual corpora In P roceedings o f the 22nd International C on íerence on Com putational Linguistics (C oling 08), M anchester, U K , pp.9 -1 0 Yonghui W u, M ike Schuster, Zhifeng Chen, Q uoc V L e, M oham m ad N orouzi, W olfgang M acherey, M axim Krikun, Yuan Cao, Qin G ao, K laus M acherey, Jeff Klingner, Apurva Shah, M elvin Johnson, X iaobing Liu, Lukasz Kaiser, Stephan G ouw s, Y oshikiyo Kato, Taku Kudo, H ideto Kazawa, Keith Stevens, G eorge Kurian, N ishant Patil, W ei W ang C liff Young, Jason Sm ith, Jason R iesa, A lex Rudnick, O riol V inyals, Greg Corrado, M acd u ff H ughes, and Jeffrey Dean 2016 G o o g le ’s neural m achine translation system : Bridging the gap betw een human and m achine translation CÕRR, a b s/1 08144 Jia Xu, Y onggang D eng, Yuqing G ao, and Hermann Ney 2007 D om ain dependent statistical m achine translation In P roceedings o f the M T Sum m it X I, pages 5 -5 đại h ọ c quốc gia h n ộ i c ộ n g h ò a x ã h ộ i c h ủ n g h ĩa v iệ t n a m TRƯỜNG ĐẠI HỌC CONG NGHỆ Độc lập - Tợ - Hạnh phúc SỐ: /QĐ-ĐT Hà Nội, ngày tháng 11 năm 20ỉ QUYẾT ĐỊNH việc công nhận tên đề tài luận án tiến sĩ cán hướng dẫn NCS đọ* năm 2013 HIỆU TRƯỞNG TRƯỜNG ĐẠI HỌC CÔNG NGHỆ Cặn Quỵ định Tổ chức Hoạt động Đại học Quốc gia Hà Nội ban hành theo định 50 600/TCCB ngày 01/10/2001 cua Giám đoc ĐHQGHN quy định quyền hạn trách nhiệm Hiệu trưởng trường Đại học thành viên; Căn Quy chế Đào tạo Sau Đại học Đại học Quốc gia Hà Nội, ban hành theo Quyêt định sô 1555/QĐ-ĐHQGHN ngày 25/05/2011 sửa đổi, bổ sung theo Quyết định sô 3050/QĐ-ĐHQGHN ngày 17/09/2012 Giám đốc Đại học Quốc gia Hà Nội; Căn Quyết định số 637/QĐ-CTSV ngày 16/09/2013 Hiệu trưởng Trường Đại học Công nghệ việc công nhận nghiên cửu sinh K20; Căn Công văn sẻ 197/CNTT-ĐT ngày 22/10/2013 ông Chủ nhiệm Khoa Công nghệ Thông tin vê việc phân công cán hướng dẫn NCS năm 2013; Xét đề nghị ơng Trường phòng Đảo tạo, QUYẾT ĐỊNH: Điều Công nhận tên đề tài ỉuận án tiến sĩ cán hướng dẫn cho nghiên cứu sinh Trần Hồng Việt, sinh ngày 16/11/1979 Hà Nội, đào tạo Trường Đại học Công nghệ, Đại học Quốc gia Hà Nội, khóa 20 (nhập học năm 2013) sau: Cái tiến chất lượng dịch máy thống kê cho cặp ngơn ngữ Anh-Việt dựa vào phân tích cú pháp phụ thuộc Chuyên ngành đào tạo: Khoa học Máy tính Mã số: 62 48 01 01 Cán hướng dẫn chính: TS Nguyễn Văn Vinh, Trường ĐH Cơng nghệ, ĐHQGHN Cán hướng đẫn phụ: TS Nguyễn Lê Minh, Viện KH&CN Tiên tiến Nhật Bản Điều rập thể cản hưóng dẫn nghiên cửu sinh có tên ứên hưởng quyên lợi có trách nhiệm thực đầy đủ nghĩa vụ quy định Quy chê Đào íạo Sau đại học Đại học Quốc gia Hà Nội quy định khác Trường Đại học Công nghệ Tên đê tài: Điều Ơng Trường phòng Tổ chức - Hành chính, ơng Trưởng phòng Đào tạo, ơng Chủ nhiệm Khoa Cơng nghệ Thơng tín, Trưởng phòng liên quan, tập thể cán hướng dân nghiên cứu sinh có tên Điều chịu trách nhiệm thi hành quyét định n ầ y.íị^- Nơỉ nhận: - Như Điều 3; - Đ H Q G H N (để báo cáo); -Lưu: VT, ĐT,DT10 DẠI 1ỈỌC QUỐC GIA HÀ NỘ] tr u n g đại h ọ c cô n g n gh ệ Số: ỊT Z CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM đ ộ c lập - Tụ - Hạnh phúc /QĐ-ĐT Hà Nội, ngày thảng 10 năm 20 ỉ Q U Y Ế T Đ ỊN H Vê việc công nhận tên để tài luận án tiến sĩ cán hưóng dẫn cho nghiên cứu sinh Phạm Nghĩa Luân, Khóa QH-2014 HIỆU TRƯỞNG TRƯỜNG ĐẠI h ọ c c ô n g nghệ Căn Quy định Tố chức hoạt động đơn vị thành viên đơn vị trực thuộc Đại học Quốc gia Hà Nội (ĐHQGIĨN) ban hành theo định số 3568/QĐĐ1IQGHN ngày 08/10/2014 Giám đốc ĐHQGHN; Căn Quy chế Đào tạo sau đại học ĐHQGHN, ban hành theo Quyết định số 1555/QD-ĐHQGHN ngày 25/05/2011 sửa đổi, bổ sung theo Quyết định số 3050/QĐ-ĐHQGHN ngày 17/09/2012 Giám đốc ĐHQGHN; Căn Quyết định số 642/QĐ-CTSV ngày 15/09/2014 Hiệu trưởng Trường Đại học Công nghệ việc công nhận nghiên cứu sinh K21; Căn Cộng văn số 162/CNTT-ĐT ngày 10/10/2014 Chủ nhiệm Khoa Công nghệ Thông tin việc phân công cán hướng dẫn nghiên cứu sinh đợt năm 2014; Xét đề nghị Trưởng phòng Đào tạo, QUYẾT ĐỊNH: Điều Công nhận tên đề tài luận án tiến sĩ cán hướng dẫn cho nghiên cứu sinh Phạm Nghĩa Luân, sinh ngày 17/03/1983 Hải Phòng, đào tạo Trường Đại học Cơng nghệ, ĐHQGHN, khóa 21 (nhập học năm 2014) sau: Tên đê tài: Nghiên cứu thích ứng miền dịch máy thống kê Chuyên ngành đào tạo: Hệ thống thông tin Mã số: 62 48 01 04 Cán hướng dẫn chính: TS Nguyễn Văn Vinh, Trường ĐH Công nghệ, ĐHQGHN Cán hướng dẫn phụ: TS Phạm Việt Thắng, v u Ưniversity Medical Center, Hà Lan Điều Tập thể cán hướng dẫn nghiên cứu sinh có tên Điều hưởng quyền lợi có trách nhiệm thực đầy đủ nghĩa vụ quy định Quy chê Đào tạo sau đại học ĐHQGHN quy định khác Trường Đại học Cơng nghệ Điều Trưởng phòng Tổ chức - Hành chính, Trưởng phòng Đào tạo, Chủ nhiệm Khoa Cơng nghệ Thơng tin, Trưởng Phòng, Ban liên quan, tập thể cán hướng dẫn nghiên cứu sinh có tên Điều chịu trách nhiệm thi hành định này./§> "/^ r ^ H lẸ U TRƯỞNG # s ' •'v ^ Nơi nhận: - N h Đ iề u 3; - Đ1IQO HN (để báo cáo); /o - Lưu: V T , ĐT, D T 10 \ é NG N ‘ R l/íT ' • B AI \ V " HOC ị V V J S g u y ễn Việt Hà ĐẠI HỌC QUỐC GIA HÀ NỘI T R Ư Ờ N G Đ Ạ• I H Ọ• C C Ơ N G N G H Ệ• VŨ T H Ư Ơ N G H U Y ÊN NGHIÊN CỨU MƠ HÌNH NGƠN NGỮ DựA TRÊN MẠNG NEURAL Ngành: Kỹ thuật phần mềm Chuyên ngành: Công nghệ thông tin Mã sổ: 60 48 10 LUẬN VĂN THẠC s ĩ N G Ư Ờ I HƯỚNG DẪN KHOA HỌC: TS NGUYỄN VĂN VINH H N ội- năm 20l3~ ... cần cải tiến để nâng cao chất lượng Đề tài: Cải tiến chất lượng dịch máy thống kê dựa vào thông tin cú pháp phụ thuộc ’ tập trung giải thách thức nhằm cải tiến chất lượng dịch máy thống kê, nhiều... phương pháp đảo cụm từ hội thảo quốc tế Hệ thống dịch máy thống kê Việt - Anh thử nghiệm Tóm tắt kết (tiếng Việt tiếng Anh) Tiếng Việt Cải tiến chất lượng dịch máy thống kê dựa vào thơng tin cú pháp. .. Nghiên cửu cách tích hợp thơng tin cú pháp phụ thuộc vào hệ dịch máy thống kê - Nghiên cứu kỹ thuật tích hợp thông tin cú pháp phụ thuộc vào hệ dịch máy thống kê Cài đặt thử nghiệm mô hình tích

Cải tiến chất lượng dịch máy thống kê dựa vào thông tin cú pháp phụ thuộc

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan