Nghiên cứu nhận dạng thực thể có tên và thực thể biểu hiện trong văn bản và ứng dụng

ĐẠI HỌC QUỐC GIA HÀ NỘI TRƢỜNG ĐẠI HỌC CÔNG NGHỆ TRẦN MAI VŨ NGHIÊN CỨU NHẬN DẠNG THỰC THỂ CÓ TÊN VÀ THỰC THỂ BIỂU HIỆN TRONG VĂN BẢN VÀ ỨNG DỤNG LUẬN ÁN TIẾN SĨ CÔNG NGHỆ THÔNG TIN Hà Nội–2018 ĐẠI HỌC QUỐC GIA HÀ NỘI TRƢỜNG ĐẠI HỌC CÔNG NGHỆ TRẦN MAI VŨ NGHIÊN CỨU NHẬN DẠNG THỰC THỂ CÓ TÊN VÀ THỰC THỂ BIỂU HIỆN TRONG VĂN BẢN VÀ ỨNG DỤNG Chuyên ngành: Hệ thống thơng tin Mã số: 62.48.05.01 LUẬN ÁN TIẾN SĨ CƠNG NGHỆ THÔNG TIN NGƢỜI HƢỚNG DẪN KHOA HỌC: PGS.TS Hà Quang Thụy PGS.TS Nguyễn Lê Minh Hà Nội – 2018 Họ tên tác giả luận án LỜI CAM ĐOAN Tơi xin cam đoan cơng trình nghiên cứu riêng Các kết đƣợc viết chung với tác giả khác đƣợc đồng ý đồng tác giả trƣớc đƣa vào luận án Các kết nêu luận án trung thực chƣa đƣợc công bố cơng trình khác Tác giả Trần Mai Vũ LỜI CẢM ƠN Luận án đƣợc thực Bộ môn Hệ thống thông tin - Khoa Công nghệ thông tin - Trƣờng Đại học Công nghệ - Đại học Quốc gia Hà Nội, dƣới hƣớng dẫn khoa học PGS.TS Hà Quang Thụy PGS.TS Nguyễn Lê Minh Trƣớc tiên tơi xin bày tỏ lòng biết ơn sâu sắc tới thầy PGS.TS Hà Quang Thụy PGS.TS Nguyễn Lê Minh, ngƣời đƣa đến với lĩnh vực nghiên cứu Các thầy tận tình giảng dạy, hƣớng dẫn giúp tiếp cận đạt đƣợc thành cơng cơng việc nghiên cứu Các thầy ln tận tâm động viên, khuyến khích dẫn giúp tơi hồn thành đƣợc luận án Tơi xin bày tỏ lòng biết ơn tới Thầy Cô thuộc Khoa Công nghệ thông tin cán Phòng Đào tạo - Trƣờng Đại học Cơng nghệ, tạo điều kiện thuận lợi giúp đỡ trình học tập nghiên cứu trƣờng Tôi xin cảm ơn PGS TS Nigel Collier cộng đóng góp ý kiến q báu giúp tơi hoàn thiện luận án Sự động viên, cổ vũ bạn bè nguồn động lực quan trọng để tơi hồn thành luận án Tơi xin bày tỏ lòng biết ơn sâu sắc tới gia đình, vợ tạo điểm tựa vững cho có đƣợc thành cơng nhƣ ngày hơm Tác giả Trần Mai Vũ MỤC LỤC LỜI CAM ĐOAN LỜI CẢM ƠN MỤC LỤC DANH MỤC CÁC KÍ HIỆU VÀ CHỮ VIẾT TẮT DANH MỤC CÁC BẢNG DANH MỤC CÁC HÌNH VẼ, ĐỒ THỊ 10 MỞ ĐẦU 11 Lý chọn đề tài 11 Mục tiêu cụ thể phạm vi nghiên cứu luận án 12 Cấu trúc luận án 15 Chƣơng - KHÁI QUÁT VỀ NHẬN DẠNG THỰC THỂ 1.1 Một số khái niệm 17 17 1.1.1 Định nghĩa toán nhận dạng thực thể 17 1.1.2 Thách thức 19 1.1.3 Độ đo đánh giá 19 1.1.4 Ứng dụng nhận dạng thực thể 21 1.2 Sơ lƣợc lịch sử nghiên cứu số hƣớng giải toán 22 1.3.Nhận dạng thực thể liệu văn tiếng Việt số nghiên cứu liên quan 24 1.3.1 Những thách thức xử lý liệu tiếng Việt 24 1.3.2 Động nghiên cứu 26 1.3.3 Các nghiên cứu liên quan 26 1.4.Nhận dạng thực thể liệu văn y sinh tiếng Anh số nghiên cứu liên quan 29 1.4.1 Những thách thức xử lý liệu y sinh 29 1.4.2 Động nghiên cứu 30 1.4.3 Các nghiên cứu liên quan 31 1.5 Tổng kết chƣơng 34 Chƣơng – NHẬN DẠNG THỰC THỂ TÊN NGƢỜI KẾT HỢP VỚI NHẬN DẠNG THUỘC TÍNH THỰC THỂ CĨ TÊN TRONG VĂN BẢN TIẾNG VIỆT 36 2.1 Giới thiệu 36 2.2 Các nghiên cứu liên quan 38 2.2.1 Các nghiên cứu liên quan giới 38 2.2.2 Các nghiên cứu liên quan Việt Nam 39 2.3 Một mơ hình giải tốn nhận dạng thực thể tên ngƣời kết hợp với nhận dạng thuộc tính thực thể 40 2.3.1 Mơ hình Entropy cực đại giải mã tìm kiếm chùm (MEM+BS) 40 2.3.2 Phƣơng pháp trƣờng ngẫu nhiên có điều kiện (CRF) 41 2.3.3 Mơ hình đề xuất 42 2.3.4 Tập đặc trƣng 46 2.4 Thực nghiệm, kết đánh giá 47 2.4.1 Công cụ liệu đánh giá 47 2.4.2 Kết thực nghiệm đánh giá toàn hệ thống 49 2.4.3 Kết thực nghiệm đánh giá nhãn 50 2.5 Mơ hình áp dụng vào hệ thống hỏi đáp tên ngƣời tiếng Việt 52 2.5.1 Khái quát toán 52 2.5.2 Đặc trƣng câu hỏi liên quan đến thực thể tên ngƣời tiếng Việt 53 2.5.3 Mơ hình đề xuất 55 2.5.4 Phƣơng pháp liệu đánh giá mơ hình hỏi đáp tự động 61 2.5.6 Thực nghiệm đánh giá 61 2.6 Tổng kết chƣơng 64 Chƣơng – NHẬN DẠNG THỰC THỂ BIỂU HIỆN TRONG VĂN BẢN Y SINH TIẾNG ANH 66 3.1 Giới thiệu 66 3.1.1 Động khái quát toán nhận dạng thực thể biểu 66 3.1.2 Một số khái niệm liên quan đến thực thể biểu số thực thể liên quan 69 3.1.3 Vấn đề thích nghi miền nhận dạng thực thể y sinh 73 3.2 Mơ hình nhận dạng thực thể biểu số thực thể liên quan 75 3.2.1 Cơ sở lý thuyết 75 3.2.2 Dữ liệu đánh giá tài ngun hỗ trợ 77 3.2.3 Mơ hình đề xuất 82 3.2.4 Tập đặc trƣng đánh giá đặc trƣng 83 3.2.5 Phƣơng pháp đánh giá 87 3.3 Thực nghiệm 88 3.3.1 Thực nghiệm 1: đánh giá hiệu mơ hình đề xuất với kỹ thuật học máy khác 88 3.3.2 Thực nghiệm 2: so sánh kết mơ hình đề xuất với số nghiên cứu liên quan 90 3.3.3 Thực nghiệm 3: đánh giá đóng góp tài nguyên kết nhận diện thực thể 93 3.3.4 Thực nghiệm 4: ứng dụng mơ hình đề xuất để nhận dạng thực thể y sinh thi BioCreAtIvE V CDR Task 94 3.4 Thích nghi miền liệu nhận dạng thực thể y sinh 96 3.4.1 Thực nghiệm 97 3.4.2 Kết đánh giá 98 3.5 Tổng kết chƣơng 100 Chƣơng – MỘT MƠ HÌNH NÂNG CẤP HIỆU QUẢ NHẬN DẠNG THỰC THỂ Y SINH DỰA TRÊN KỸ THUẬT LAI GHÉP VÀ HỌC XẾP HẠNG 102 4.1 Mơ hình nâng cấp nhận dạng thực thể biểu thực thể liên quan 102 4.2 Các phƣơng pháp lai ghép đƣợc đề xuất 104 4.2.1 Phƣơng pháp lai ghép sử dụng luật 104 4.2.2 Phƣơng pháp lai ghép sử dụng học máy gán nhãn chuỗi 107 4.2.3 Phƣơng pháp lai ghép sử dụng học xếp hạng 108 4.3 Thực nghiệm đánh giá kết 110 4.3.1 Phƣơng pháp đánh giá 110 4.3.2 Thực nghiệm đánh giá hiệu phƣơng pháp lai ghép 111 4.3.3 Thực nghiệm kiểm thử tin cậy trình đánh giá hiệu tài nguyên 113 4.3.4 Thảo luận phân tích lỗi 114 4.4 Kết luận chƣơng 117 KẾT LUẬN 119 DANH MỤC CƠNG TRÌNH KHOA HỌC CỦA TÁC GIẢ CÓ LIÊN QUAN ĐẾN LUẬN ÁN 121 TÀI LIỆU THAM KHẢO 122 DANH MỤC CÁC KÍ HIỆU VÀ CHỮ VIẾT TẮT Kí hiệu Tiếng Anh Tiếng Việt NER Named Entity Recognition Nhận dạng thực thể định danh NLP Natural Language Processing Xử lý ngôn ngữ tự nhiên BioNLP Biomedical Natural Language Xử lý ngôn ngữ tự nhiên cho Processing liệu y sinh IE Information Extraction Trích xuất thơng tin CRF Conditional Random Fields Trƣờng ngẫu nhiên có điều kiện SVM Support Vector Machine Máy véctơ hỗ trợ SVM-LTR SVM-Learn to rank Học xếp hạng máy véctơ hỗ trợ MEModel, Maxent Model Maximum Entropy Model Mơ hình Entropy cực đại MEM+BS Maximum Entropy with Beam Search Model Mơ hình Entropy cực đại với giải mã tìm kiếm chùm DANH MỤC CÁC BẢNG Bảng 2.1 Một ví dụ trích chọn thực thể tên ngƣời thuộc tính liên quan 37 Bảng 2.2 Các nhãn đƣợc sử dụng mơ hình 43 Bảng 2.3 Tập đặc trƣng đƣợc sử dụng 46 Bảng 2.4 Thống kê thực thể tập liệu đƣợc gán nhãn 48 Bảng 2.5 Kết đánh giá toàn hệ thống hai mơ hình với hai phƣơng pháp MEM+BS CRF 49 Bảng 2.6 Kết thực nghiệm nhãn 51 Bảng 2.7 Ví dụ số thành phần câu hỏi 56 Bảng 2.8 Các thành phần xuất câu hỏi thực thể tên ngƣời 57 Bảng 2.9 Ví dụ gán nhãn tổng quát cho câu hỏi thực thể tên ngƣời tiếng Việt 58 Bảng 2.10 Thống kê tập liệu câu hỏi đánh giá 61 Bảng 2.11 Kết đánh giá thành phần phân tích câu hỏi 62 Bảng 2.12 Kết đánh giá hệ thống trả lời tự động 63 Bảng 3.1 Danh sách bệnh tự miễn dịch đƣợc sử dụng để xây dựng liệu Phenominer A 78 Bảng 3.2 Các đặc điểm liệu Phenominer A bệnh tự miễn dịch Phenominer B bệnh tim mạch 80 Bảng 3.3 Các đặc trƣng sử dụng thực nghiệm 84 Bảng 3.4 Thực nghiệm so sánh phƣơng pháp học máy khác 89 Bảng 3.5 Thực nghiệm so sánh mơ hình đề xuất hệ thống khác 91 Bảng 3.6 Kết đánh giá tài ngun mơ hình nhận dạng thực thể 93 Bảng 3.7 Thống kê ba tập liệu nhiệm vụ CDR [WPL15] 95 Bảng 3.8 Kết mô hình nhận dạng tập liệu kiểm thử 96 Bảng 3.9 Kết F1 hệ thống NER sử dụng phƣơng pháp thực nghiệm 1-6 98 Bảng 4.1 Các đặc trƣng đƣợc MEM + BS sử dụng để định kết 108 Bảng 4.2 Kết mô hình tập liệu Phenominer A sử dụng phƣơng pháp khác để lai ghép kết 111 DANH MỤC CƠNG TRÌNH KHOA HỌC CỦA TÁC GIẢ CÓ LIÊN QUAN ĐẾN LUẬN ÁN [CTLA1] Nigel Collier, Ferdinand Paster, Mai-Vu Tran (2014) The impact of near domain transfer on biomedical named entity recognitionsLOUHI 2014, EACL 2014, Sweden, 2014 [CTLA2] Nigel Collier, Mai-Vu Tran, Hoang-Quynh Le, Quang-Thuy Ha, Anika Oellrich, Dietrich Rebholz-Schuhmann (2013).Learning to Recognize Phenotype Candidates in the Auto-Immune Literature Using SVM Re-Ranking PLoS ONE 8(10): e72965, October 2013 [CTLA3] Mai-Vu Tran, Duc-Trong Le (2013) vTools: Chunker and Part-ofSpeech tools, RIVF-VLSP 2013 Workshop [CTLA4] Nigel Collier, Mai-Vu Tran, Hoang-Quynh Le, Anika Oellrich, Ai Kawazoe, Martin Hall-May, Dietrich Rebholz-Schuhmann (2012).A Hybrid Approach to Finding Phenotype Candidates in Genetic Texts,COLING 2012: 647-662 [CTLA5] Mai-Vu Tran, Duc-Trong Le, Xuan-Tu Tran and Tien-Tung Nguyen (2012) A Model of Vietnamese Person Named Entity Question Answering System, PACLIC 2012, Bali, Indonesia, October 2012 [CTLA6] Hoang-Quynh Le, Mai-Vu Tran, Nhat-Nam Bui, Nguyen-Cuong Phan, Quang-Thuy Ha (2011) An Integrated Approach Using Conditional Random Fields for Named Entity Recognition and Person Property Extraction in Vietnamese Text IALP 2011:115-118 [CTLA7]Hoang-Quynh Le, Mai-Vu Tran, Thanh Hai Dang, Nigel Collier (2015).The UET-CAM System in the BioCreAtIvE V CDR Task In Proceedings of the fifth BioCreative challenge evaluation workshop, Sevilla, Spain, 2015 121 TÀI LIỆU THAM KHẢO Tiếng Việt [DH96] Diệp Quang Ban (chủ biên), Hoàng Văn Thung (1996), Ngữ pháp tiếng Việt T1, T2 - NXB Giáo dục- HN [NTH11] Nguyễn Thanh Hiên (2011) Phân giải nhập nhằng thực thể có tên dựa ontology đóng mở Luận án tiến sỹ Trƣờng Đại học Bách Khoa, Đại học Quốc Gia TP.HCM [SC13] Sam Chanrathany (2013).Trích rút thực thể có tên quan hệ thực thể văn tiếng Việt Luận án tiến sỹ Trƣờng Đại học Bách Khoa Hà Nội Tiếng Anh [AHB93] Appelt, D E., Hobbs, J R., Bear, J., Israel, D., & Tyson, M (1993, August) FASTUS: A finite-state processor for information extraction from realworld text In IJCAI (Vol 93, pp 1172-1178) [AZ05] Ando, R K., & Zhang, T (2005) A framework for learning predictive structures from multiple tasks and unlabeled data The Journal of Machine Learning Research, 6, 1817-1853 [AZ11b] A B Abacha and P Zweigenbaum Medical entity recognition: A comparison of semantic and statistical methods In Proceedings of BioNLP 2011 Workshop, pages 56–64, 2011 [AZ12] Aggarwal, C C., & Zhai, C (2012) Mining text data Springer Science & Business Media [BBD02] Banko, M., Brill, E., Dumais, S., & Lin, J (2002, March) AskMSR: Question answering using the worldwide Web In Proceedings of 2002 AAAI Spring Symposium on Mining Answers from Texts and Knowledge Bases (pp 7-9) [BPP96] Berger, A L., Pietra, V J D., & Pietra, S A D (1996) A maximum entropy approach to natural language processing Computational linguistics, 22(1), 39-71 122 [BR04] Bard, J B., & Rhee, S Y (2004) Ontologies in biology: design, applications and future challenges Nature Reviews Genetics, 5(3), 213-222 [BSS03] Blake, A., Sinclair, M T., & Sugiyarto, G (2003) Quantifying the impact of foot and mouth disease on tourism and the UK economy Tourism Economics,9(4), 449-465 [BSS08] Beisswanger, E., Schulz, S., Stenzhorn, H., & Hahn, U (2008) BioTop: An upper domain ontology for the life sciencesA description of its current structure, contents and interfaces to OBO ontologies Applied Ontology, 3(4), 205212 [CC03] Curran, J R., & Clark, S (2003, May) Language independent NER using a maximum entropy tagger In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume (pp 164-167) Association for Computational Linguistics [CC09] Cai, Y., & Cheng, X (2009, October) Biomedical named entity recognition with tri-training learning In Biomedical Engineering and Informatics, 2009 BMEI'09 2nd International Conference on (pp 1-5) IEEE [COG15] Collier, N., Oellrich, A., & Groza, T (2015) Concept selection for phenotypes and diseases using learn to rank Journal of biomedical semantics, 6(1), 24 [CF04] Chen, L., & Friedman, C (2004) Extracting phenotypic information from the literature via natural language processing Medinfo, 11(Pt 2), 758-62 [CGE11] Cohen, R., Gefen, A., Elhadad, M., & Birk, O S (2011) CSIOMIM-Clinical Synopsis Search in OMIM BMC bioinformatics, 12(1), 65 [COG13] Collier, N., Oellrich, A., & Groza, T (2013) Toward knowledge support for analysis and interpretation of complex traits Genome biology, 14(9), 214 [CTX06] Cam-Tu Nguyen, Trung Kien Nguyen, Xuan Hieu Phan, Le Minh Nguyen, and Quang Thuy Ha: Vietnamese Word Segmentation with CRFs and 123 SVMs: An Investigation, The 20th Pacific Asia Conference on Language, Information, and Computation (PACLIC), 1st-3rd November, 2006, Wuhan, China [CH08] Cohen, K B., & Hunter, L (2008) Getting started in text mining PLoS computational biology, 4(1), e20 [DA07] H Daume III 2007 Frustratingly easy domain adaptation In Annual meeting of the Association for Computational Linguistics (ACL 2007), pages 256– 263 [DCX12] Doan, S., Collier, N., Xu, H., Duy, P H., & Phuong, T M (2012) Recognition of medication information from discharge summaries using ensembles of classifiers BMC medical informatics and decision making, 12(1), 36 [DDS09] Nguyen, D Q., Nguyen, D Q., & Pham, S B (2009, October) A vietnamese question answering system In Knowledge and Systems Engineering, 2009 KSE'09 International Conference on (pp 26-32) IEEE [DMP04] Doddington, G R., Mitchell, A., Przybocki, M A., Ramshaw, L A., Strassel, S., & Weischedel, R M (2004, May) The Automatic Content Extraction (ACE) Program-Tasks, Data, and Evaluation In LREC [ES13] Ekbal, A., & Saha, S (2013) Stacked ensemble coupled with feature selection for biomedical entity extraction Knowledge-Based Systems, 46, 22-32 [EUL01] Eduard Hovy, Ulf Hermjakob and Lin, C.-Y The Use of External Knowledge in Factoid QA Paper presented at the Tenth Text REtrieval Conference (TREC 10), Gaithersburg, MD, 2001, November 13-16 [FEO02] K Franzén, G Eriksson, F Olsson, L Asker, P Lidén, and J Coster Protein names and how to find them International Journal of Medical Informatics, 67(1-3):49–61, 2002 [FIJ03] Florian, R., Ittycheriah, A., Jing, H and Zhang, T (2003) Named Entity Recognition through Classifier Combination Proceedings of CoNLL-2003 Edmonton, Canada [FPS96] Fayyad, Piatetsky-Shapiro, Smyth From Data Mining to Knowledge Discovery: An Overiew In Fayyad, Piatetsky-Shapiro, Smyth, Uthurusamy, 124 Advances in Knowledge Discovery and Data Mining, AAAI Press/The MIT Press, Menlo Park, 1996, 1-34 [FS03] Freimer, N., & Sabatti, C (2003) The human phenome project Nature genetics, 34(1), 15-21 [FTT98] Fukuda, K I., Tsunoda, T., Tamura, A., & Takagi, T (1998, January) Toward information extraction: identifying protein names from biological papers In Pac Symp Biocomput (Vol 707, No 18, pp 707-718) [GCS11] Gremse, M., Chang, A., Schomburg, I., Grote, A., Scheer, M., Ebeling, C., & Schomburg, D (2011) The BRENDA Tissue Ontology (BTO): the first all-integrating ontology of all organisms for enzyme sources Nucleic acids research, 39(suppl 1), D507-D513 [GFH08] Danilo Giampiccolo, Pamela Forner, Jesús Herrera, Anselmo Peñas, Christelle Ayache, Corina Forascu, Valentin Jijkoun, Petya Osenova, Paulo Rocha, Bogdan Sacaleanu, Richard F E Sutcliffe (2008) Overview of the clef 2007 multilingual question answering track In Advances in Multilingual and Multimodal Information Retrieval (pp 200-236) Springer Berlin Heidelberg [GKD15] Groza, T., Köhler, S., Doelken, S., Collier, N., Oellrich, A., Smedley, D., & Robinson, P N (2015) Automatic concept recognition using the Human Phenotype Ontology reference and test suite corpora Database, 2015 [GHZ12] Groza, T., Hunter, J., & Zankl, A (2012) Supervised segmentation of phenotype descriptions for the human skeletal phenome using hybrid methods.BMC bioinformatics, 13(1), 265 [GHZ13] Groza, T., Hunter, J., & Zankl, A (2013) Decomposing phenotype descriptions for the human skeletal phenome Biomedical informatics insights, 6, [GLR06] Giuliano, C., Lavelli, A., & Romano, L (2006, April) Exploiting shallow linguistic information for relation extraction from biomedical literature In EACL (Vol 18, pp 401-408) 125 [GNB10] Gerner, M., Nenadic, G., & Bergman, C M (2010) LINNAEUS: a species name identification system for biomedical literature BMC bioinformatics, 11(1), 85 [GR08] Girju R Semantic relation extraction and its applications ESSLLI 2008 Course Material, Hamburg, Germany, 4-15 August 2008 [GZH12] Groza, T., Zankl, A., & Hunter, J (2012) Experiences with modeling composite phenotypes in the SKELETOME project In The Semantic Web–ISWC 2012 (pp 82-97) Springer Berlin Heidelberg [HBK12] Hirschman, L., Burns, G A C., Krallinger, M., Arighi, C., Cohen, K B., Valencia, A., & Winter, A G (2012) Text mining for the biocuration workflow Database, 2012, bas020 [HC03] W.-J Hou and H.-H Chen Enhancing performance of protein name recognizers using collocation In Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine Volume 13, pages 25–32, 2003 [HEG00] Hovy, Eduard and Gerber, Laurie and Hermjakob, Ulf and Junk, Michael and Lin, Chin-yew (2000) Question answering in webclopedia In Proceedings of the Ninth Text REtrieval Conference (TREC-9) [HHH12] Hoehndorf, R., Harris, M A., Herre, H., Rustici, G., & Gkoutos, G V (2012) Semantic integration of physiology phenotypes with an application to the Cellular Phenotype Ontology Bioinformatics, 28(13), 1783-1789 [HL15] Huang, C C., & Lu, Z (2015) Community challenges in biomedical text mining over 10 years: success, failure and the future Briefings in bioinformatics, bbv024 [HOR10] Hoehndorf, R., Oellrich, A., & Rebholz-Schuhmann, D (2010) Interoperability between phenotype and anatomy ontologies Bioinformatics, 26(24), 3112-3118 [HSG11] Hoehndorf, R., Schofield, P N., & Gkoutos, G V (2011) PhenomeNET: a whole-phenome approach to disease gene discovery Nucleic acids research,39(18), e119-e119 126 [HSS09] Hettne, K M., Stierum, R H., Schuemie, M J., Hendriksen, P J., Schijvenaars, B J., Van Mulligen, E M., & Kors, J A (2009) A dictionary to identify small molecules and drugs in free text Bioinformatics, 25(22), 2983-2991 [HWY05] Huang, J., Wang, C., Yang, C., Chiu, M and Yee, G 2005 Applying Word Sense Disambiguation to Question Answering System for ELearning In Proceedings of the 19th International Conference on Advanced Information Networking and Applications Taipei, Taiwan, pp.157-62 [JAJ10] Javier Artiles, Andrew Borthwick, Julio Gonzalo, Satoshi Sekine, and Enrique Amigó WePS-3 Evaluation Campaign: Overview of the Web People Search Clustering and Attribute Extraction Tasks in the 3rd Web People Search Evaluation Workshop (WePS 2010) [Kai08] Kaisser, M (2008, June) The QuALiM question answering demo: Supplementing answers with paragraphs drawn from Wikipedia In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Demo Session (pp 32-35) Association for Computational Linguistics [KCO05] S Kinoshita, K B Cohen, P Ogren, and L Hunter BioCreAtIvE task 1A: Entity identification with a stochastic tagger BMC Bioinformatics, 6(Suppl 1):S4, 2005 [KLR15] Krallinger, M., Leitner, F., Rabal, O., Vazquez, M., Oyarzabal, J., & Valencia, A (2015) CHEMDNER: The drugs and chemical names extraction challenge J Cheminform, 7(Suppl 1), S1 [KM14] Khordad, Maryam (2014) Investigating Genotype-Phenotype relationship extraction from biomedical text Doctoral dissertation University of Western Ontario [KMR11] Khordad, M., Mercer, R E., & Rogan, P (2011) Improving phenotype name recognition In Advances in Artificial Intelligence (pp 246-257) Springer Berlin Heidelberg 127 [KOT03] Kim, J D., Ohta, T., Tateisi, Y., & Tsujii, J I (2003) GENIA corpus—a semantically annotated corpus for bio-textmining Bioinformatics, 19(suppl 1), i180-i182 [KOT04] Kim, J D., Ohta, T., Tsuruoka, Y., Tateisi, Y., & Collier, N (2004, August) Introduction to the bio-entity recognition task at JNLPBA In Proceedings of the international joint workshop on natural language processing in biomedicine and its applications (pp 70-75) Association for Computational Linguistics [LDN13] Le, N M., Do, B N., Nguyen, V D., & Nguyen, T D (2013, December) VNLP: an open source framework for Vietnamese natural language processing InProceedings of the Fourth Symposium on Information and Communication Technology (pp 88-93) ACM [LLL14] Le Trung, H., Le Anh, V., & Le Trung, K (2014) Bootstrapping and Rule-Based Model for Recognizing Vietnamese Named Entity In Intelligent Information and Database Systems (pp 167-176) Springer International Publishing [LMP01] Lafferty, J., McCallum, A., & Pereira, F C (2001) Conditional random fields: Probabilistic models for segmenting and labeling sequence data [LN10] Le, H T., & Nguyen, T H (2010, August) Name entity recognition using inductive logic programming In Proceedings of the 2010 Symposium on Information and Communication Technology (pp 71-77) ACM [LTC04] Lin, Y F., Tsai, T H., Chou, W C., Wu, K P., Sung, T Y., & Hsu, W L (2004, August) A maximum entropy approach to biomedical named entity recognition In BIOKDD (pp 56-61) [LV13] Le, H T., & Van Tran, L (2013, December) Automatic feature selection for named entity recognition using genetic algorithm In Proceedings of the Fourth Symposium on Information and Communication Technology (pp 8187) ACM [MAC07] Mabee, P M., Ashburner, M., Cronk, Q., Gkoutos, G V., Haendel, M., Segerdell, E., & Westerfield, M (2007) Phenotype ontologies: the bridge between genomics and evolution Trends in ecology & evolution, 22(7), 345-350 128 [MC07] McKusick, V A (2007) Mendelian Inheritance in Man and its online version, OMIM American journal of human genetics, 80(4), 588 [MFM05] Mitsumori, T., Fation, S., Murata, M., Doi, K., & Doi, H (2005) Gene/protein name recognition based on support vector machine using dictionary as features BMC bioinformatics, 6(Suppl 1), S8 [MFP00] McCallum, A., Freitag, D., & Pereira, F C (2000, June) Maximum Entropy Markov Models for Information Extraction and Segmentation In ICML (pp 591-598) [MHC04] A A Morgan, L Hirschman, M Colosimo, A S Yeh, and J B Colombe Gene name identification and normalization using a model organism database Journal of Biomedical Informatics, 37(6):396–410, 2004 [ML03] McCallum, A., & Li, W (2003, May) Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons InProceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume (pp 188-191) Association for Computational Linguistics [MO08] Michele Banko, Oren Etzioni ―The Tradeoffs Between Open and Traditional Relation Extraction ACL 2008: 28-36 [MPH03] Moldovan, D., Paşca, M., Harabagiu, S., & Surdeanu, M (2003) Performance issues and error analysis in an open-domain question answering system ACM Transactions on Information Systems (TOIS), 21(2), 133-154 [MR04] Mika, S., & Rost, B (2004) Protein names precisely peeled off free text Bioinformatics, 20(suppl 1), i241-i247 [MY14] Miwa, Makoto, and Yutaka Sasaki "Modeling Joint Entity and Relation Extraction with Table Representation." EMNLP 2014 [NBK13] Nédellec, C., Bossy, R., Kim, J D., Kim, J J., Ohta, T., Pyysalo, S., & Zweigenbaum, P (2013, August) Overview of BioNLP shared task 2013 In Proceedings of the BioNLP Shared Task 2013 Workshop (pp 1-7) 129 [NC12] Nguyen, T T., & Cao, T H (2012, February) Linguistically Motivated and Ontological Features for Vietnamese Named Entity Recognition In Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF), 2012 IEEE RIVF International Conference on (pp 1-6) IEEE [NCT99] C Nobata, N Collier, and J.-i Tsujii Automatic term identification and classification in biology texts In Proceedings of the Natural Language Pacific Rim Symposium, pages 369–374, 1999 [NE05] Nédellec, C (2005, August) Learning language in logic-genic interaction extraction challenge In Proceedings of the 4th Learning Language in Logic Workshop (LLL05) (Vol 7) [NN13] Nguyen, M T., & Nguyen, T T (2013, December) Extraction of disease events for a real-time monitoring system In Proceedings of the Fourth Symposium on Information and Communication Technology (pp 139-147) ACM [NP12] Nguyen, D B., & Pham, S B (2012) Ripple down rules for vietnamese named entity recognition In Computational Collective Intelligence Technologies and Applications (pp 354-363) Springer Berlin Heidelberg [NRV03] M Narayanaswamy, K E Ravikumar, and K Vijay-Shanker A biological named entity recognizer In Pacific Symposium on Biocomputing, pages 427–438, 2003 [NHP10] Nguyen, D B., Hoang, S H., Pham, S B., & Nguyen, T P (2010) Named entity recognition for Vietnamese In Intelligent Information and Database Systems (pp 205-214) Springer Berlin Heidelberg [OCQ09] Oanh Thi Tran, Cuong Anh Le Quang-Thuy Ha and Quynh Hoang Le An Experimental Study on Vietnamese POS tagging", International Conference on Asian Language Processing (IALP 2009):23-27, Dec 7-9, 2009, Singapore [OMT06] D Okanohara, Y Miyao, Y Tsuruoka, and J Tsujii Improving the scalability of semi-Markov conditional random fields for named entity recognition In Proceedings of the 21st International Conference on Computational Linguistics 130 and the 44th Annual Meeting of the Association for Computational Linguistics, pages 465–472, 2006 [OOG05] Özgür, A., Özgür, L., & Güngör, T (2005) Text categorization with class-based and corpus-based keyword selection In Computer and Information Sciences-ISCIS 2005 (pp 606-615) Springer Berlin Heidelberg [PGH07] Pyysalo, S., Ginter, F., Heimonen, J., Björne, J., Boberg, J., Järvinen, J., & Salakoski, T (2007) BioInfer: a corpus for information extraction in the biomedical domain BMC bioinformatics, 8(1), 50 [PNH10] Phan, T T., Nguyen, T C., & Huynh, T N (2010) Question semantic analysis in Vietnamese QA system In Advances in Intelligent Information and Database Systems (pp 29-40) Springer Berlin Heidelberg [PY10] Pan, S J., & Yang, Q (2010) A survey on transfer learning Knowledge and Data Engineering, IEEE Transactions on, 22(10), 1345-1359 [QU93] Quinlan, J R (1993) C4 5: programs for machine learning (Vol 1) Morgan kaufmann [RA89] Rabiner, L (1989) A tutorial on hidden Markov models and selected applications in speech recognition Proceedings of the IEEE, 77(2), 257-286 [RA91] Rau, L F (1991, February) Extracting company names from text In Artificial Intelligence Applications, 1991 Proceedings., Seventh IEEE Conference on(Vol 1, pp 29-32) IEEE [RA96] Ratnaparkhi, A (1996, May) A maximum entropy model for part-ofspeech tagging In Proceedings of the conference on empirical methods in natural language processing (Vol 1, pp 133-142) [RHT10] Rathany Chan Sam, Huong Thanh Le, Thuy Thanh Nguyen, The Minh Trinh Relation Extraction in Vietnamese Text Using Conditional Random Fields AAIRS 2010: 330-339 [RM95] L A Ramshaw and M P Marcus Text chunking using transformation-based learning In 3rd ACL SIGDAT Workshop on Very Large Corpora, pages 82–94, 1995 131 [RR09] Ratinov, L., & Roth, D (2009) Design challenges and misconceptions in named entity recognition In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (pp 147-155) Association for Computational Linguistics [SCW09] Scheuermann, R H., Ceusters, W., & Smith, B (2009) Toward an ontological treatment of disease and diagnosis Summit on translational bioinformatics,2009, 116 [SE04] Settles, B (2004, August) Biomedical named entity recognition using conditional random fields and rich feature sets In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (pp 104-107) Association for Computational Linguistics [SE09] Smith, C L., & Eppig, J T (2009) The mammalian phenotype ontology: enabling robust annotation and comparative analysis Wiley Interdisciplinary Reviews: Systems Biology and Medicine, 1(3), 390-399 [SGE04] Smith, C L., Goldsmith, C A W., & Eppig, J T (2004) The Mammalian Phenotype Ontology as a tool for annotating, analyzing and comparing phenotypic information Genome biology, 6(1), R7 [SJ09] Satoshi Sekine and Javier Artiles WePS2 Attribute Extraction Task in the 2nd Web People Search Evaluation Workshop (WePS 2, 2009) [SLT11a] Sam, R C., Le, H T., Nguyen, T T., & Nguyen, T H (2011) Combining proper name-coreference with conditional random fields for semisupervised named entity recognition in Vietnamese text In Advances in Knowledge Discovery and Data Mining (pp 512-524) Springer Berlin Heidelberg [SLT11b] Sam, R C., Le, H T., Nguyen, T T., Le, D A., & Nguyen, N M T (2011, October) Semi-supervised learning for relation extraction in Vietnamese text In Proceedings of the Second Symposium on Information and Communication Technology (pp 100-105) ACM [SMY15] Sun, H., Ma, H., Yih, W T., Tsai, C T., Liu, J., & Chang, M W (2015, May) Open Domain Question Answering via Semantic Enrichment In 132 Proceedings of the 24th International Conference on World Wide Web (pp 10451055) International World Wide Web Conferences Steering Committee [SOK13] Smedley, D., Oellrich, A., Köhler, S., Ruef, B., Westerfield, M., Robinson, P., & Mungall, C (2013) PhenoDigm: analyzing curated annotations to associate animal models with human diseases Database, 2013, bat025 [SSM09] S K Saha, S Sarkar, and P Mitra Feature selection techniques for maximum entropy based biomedical named entity recognition Journal of Biomedical Informatics, vol 42, no 5, pp 905–911, 2009 [STM08] Y Sasaki, Y Tsuruoka, J McNaught, and S Ananiadou How to make the most of NE dictionaries in statistical NER BMC Bioinformatics, 9(Suppl 11):S5, 2008 [TC05] K Takeuchi and N Collier Bio-medical entity extraction using support vector machines Artificial Intelligence in Medicine, 33(2):125–137, 2005 [TLH10] Tran Thi Oanh, Le Cuong Anh, Ha Thuy Quang, Improving Vietnamese Word Segmentation and POS Tagging using MEM with Various Kinds of Resources Journal of Natural Language Processing 17(3): 41-60 (2010) [TOH05] Tu, N C., Oanh, T T., Hieu, P X., & Thuy, H Q (2005) Named entity recognition in vietnamese free-text and web documents using conditional random fields In The 8th Conference on Some selection problems of Information Technology and Telecommunication [TTD07] Thao, P T X., Tri, T Q., Dien, D., & Collier, N (2007) Named entity recognition in Vietnamese using classifier voting ACM Transactions on Asian Language Information Processing (TALIP), 6(4), [TTK05] Tsuruoka, Y., Tateishi, Y., Kim, J D., Ohta, T., McNaught, J., Ananiadou, S., & Tsujii, J I (2005) Developing a robust part-of-speech tagger for biomedical text In Advances in informatics (pp 382-392) Springer Berlin Heidelberg 133 [TTQ07] Tran, Q T., Pham, T T., Ngo, Q H., Dinh, D., & Collier, N (2007) Named entity recognition in Vietnamese documents Progress in Informatics Journal,5, 14-17 [TWC06] Tzong-Han Tsai, Richard; Wu S.-H.; Chou, W.-C.; Lin, Y.-C.; He, D.; Hsiang, J.; Sung, T.-Y.; Hsu, W.-L 2006 Various Criteria in the Evaluation of Biomedical Named Entity Recognition BMC Bioinformatics 7:92, BioMed Central [UCO11] Y Usami, H.-C Cho, N Okazaki, and J Tsujii Automatic acquisition of huge training data for bio-medical named entity recognition In Proceedings of BioNLP 2011 Workshop, pages 65–73, 2011 [USC10] Uzuner, Ö., Solti, I., & Cadag, E (2010) Extracting medication information from clinical text Journal of the American Medical Informatics Association,17(5), 514-518 [USS10] Uzuner, Ö., South, B R., Shen, S., & DuVall, S L (2011) 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text Journal of the American Medical Informatics Association [VA10] Vlachos, A (2010) Semi-supervised learning for biomedical information extraction Doctoral dissertation Computer Laboratory, University of Cambridge [VED01] Voorhees, Ellen M., and Donna Harman Overview of TREC 2001 Trec 2001 [Vo03] E.M Voorhees Overview of the TREC 2003 Question Answering Track TREC 2003: 54-68 [VVO09] Vu Mai Tran, Vinh Duc Nguyen, Oanh Thi Tran, Uyen Thu Thi Pham, Thuy Quang Ha An Experimental Study of Vietnamese Question Answering System In Proceedings of IALP'2009 pp.152~155 [WAC12] Wu, C H., Arighi, C N., Cohen, K B., Hirschman, L., Krallinger, M., Lu, Z., & Wilbur, W J (2012) BioCreative-2012 Virtual Issue Database: The Journal of Biological Databases and Curation, 2012 134 [WGM14] West, R., Gabrilovich, E., Murphy, K., Sun, S., Gupta, R., & Lin, D (2014, April) Knowledge base completion via search-based question answering In Proceedings of the 23rd international conference on World wide web (pp 515526) ACM [WKS09] Wang, Y., Kim, J D., Sætre, R., Pyysalo, S., & Tsujii, J I (2009) Investigating heterogeneous protein annotations toward cross-corpora utilization BMC bioinformatics, 10(1), 403 [WPL15] Wei, C H., Peng, Y., Leaman, R., Davis, A P., Mattingly, C J., Li, J., & Lu, Z (2015) Overview of the BioCreative V chemical disease relation (CDR) task In Proceedings of the fifth BioCreative challenge evaluation workshop, Sevilla, Spain [WTJ13] Wagholikar, K B., Torii, M., Jonnalagadda, S., & Liu, H (2013) Pooling annotated corpora for clinical concept extraction J Biomedical Semantics, 4, [YD14] Yao, X., & Van Durme, B (2014) Information extraction over structured data: Question answering with freebase In Proceedings of ACL [YYW15] Yang, Y., Yih, W T., & Meek, C (2015) WIKIQA: A Challenge Dataset for Open-Domain Question Answering In Proceedings of the Conference on Empirical Methods in Natural Language Processing [ZD09] Zweigenbaum, P., & Demner-Fushman, D (2009) Advanced literature-mining tools In Bioinformatics (pp 347-380) Springer New York [ZDY07] Zweigenbaum, P., Demner-Fushman, D., Yu, H., & Cohen, K B (2007) Frontiers of biomedical text mining: current progress Briefings in bioinformatics, 8(5), 358-375 [ZSZ05] G Zhou, D Shen, J Zhang, J Su, and S Tan Recognition of protein/gene names from text using an ensemble of classifiers BMC Bioinformatics, 6(Suppl 1):S7, 2005 135 ... pháp thực thể khác thách thức khơng nhỏ vòng lặp sinh mẫu lớn hay số loại thực thể nhiều Bên cạnh nghiên cứu nhận dạng thực thể, số nghiên cứu ứng dụng nhận dạng thực thể đƣợc nhà nghiên cứu nƣớc... HỌC QUỐC GIA HÀ NỘI TRƢỜNG ĐẠI HỌC CÔNG NGHỆ TRẦN MAI VŨ NGHIÊN CỨU NHẬN DẠNG THỰC THỂ CÓ TÊN VÀ THỰC THỂ BIỂU HIỆN TRONG VĂN BẢN VÀ ỨNG DỤNG Chuyên ngành: Hệ thống thông tin Mã số: 62.48.05.01... sử nghiên cứu điểm lại số nghiên cứu tiêu biểu Các chƣơng sâu vào toán đƣợc luận án tập trung giải Chƣơng trình bày tốn nhận dạng thực thể ứng dụng nhận dạng thực thể vào toán hỏi đáp tự động văn

Nghiên cứu nhận dạng thực thể có tên và thực thể biểu hiện trong văn bản và ứng dụng

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan