DSpace at VNU: VnLoc: A real-time news event extraction framework for Vietnamese

6 144 0
DSpace at VNU: VnLoc: A real-time news event extraction framework for Vietnamese

Đang tải... (xem toàn văn)

Thông tin tài liệu

2012 Fourth International Conference on Knowledge and Systems Engineering VnLoc: A Real–time News Event Extraction Framework for Vietnamese Mai-Vu Tran∗ vutm@vnu.edu.vn Minh-Hoang Nguyen∗ hoangnm_53@vnu.edu.vn Minh-Tien Nguyen∗∗ tiennm@utehy.edu.vn ∗ Sy-Quan Nguyen∗ quanns_53@vnu.edu.vn Xuan-Hieu Phan∗ hieupx@vnu.edu.vn KTLab, Faculty of Information Technology, College of Technology Vietnam National University, Hanoi (VNU) Hanoi, Vietnam ∗∗ Faculty of Information Technology, Hung Yen University of Technology and Education, Hungyen, Vietnam Abstract know news detail without reading entire news content In addition, result of extracting event can be used in online monitoring system where user can catch information easily Recently, event extraction topic has received more attention from scientists in Natural Language Processing and Data Mining around the world In 1987, event extraction have become a main topic in Message Understanding Conference (MUC) [5] In this conference, an event was defined: an event must have actor, time, place and impact on the surrounding environment In addition, Automatic Content Extraction program gave definition: event is an activity was created by participants and divided event into eight types: Life, Movement, Transection, Business, Conflict, Contact, Personnel and Justice As Allen’s definition: an event includes four attributes: modality, polarity (Positive, Negative), genericity (Specific, Generic), tense(Past, Present, Future, Unspecified) [1] Based on investigation and analysis of meaning of event extraction, we have proposed a event extraction method for Vietnamese language and building event online monitoring system named VnLoc The method we proposed which is a combination between lexico–semantic and machine learning Data of system are gathered from news through the RSS feeds Then, we apply our method which was proposed to classify event into two categories: EVENT or NONEVENT upon tiding’s title After that, we extract event attributes from events which are classified and the last result is visualized on online map In experiment phase, we evaluated our approach by cross–validation method based on precision (≈92.85%), recall (≈90.39%) and F1 measured (≈ 91.61%) In section 2, we will mention to related researches, Event Extraction is a complex and interesting topic in Information Extraction that includes event extraction methods from free text or web data The result of event extraction systems can be used in several fields such as risk analysis systems, online monitoring systems or decide support tools [4] In this paper, we introduce a method that combines lexico–semantic and machine learning to extract event from Vietnamese news Furthermore, we concentrate to describe event online monitoring system named VnLoc based on the method that was proposed above to extract event in Vietnamese language Besides, in experiment phase, we have evaluated this method based on precision, recall and F1 measure At this time of experiment, we on investigated on three types of event: FIRE, CRIME and TRANSPORT ACCIDENT Introduction The information explosion and development of Information Technology–Communication is good condition for people reach information easily Therefore, information are more and more rich and diversified Information from different sources (newspaper, blog, social network, ) is main cause of chaos information Thus, extracting useful information that reader interested in from the daily news is really necessary One of the biggest problem we are facing is how to get information as fast as possible in the shortest time! This is really challenge in Web Mining This issue can be answered by event extraction because information usually contain event content Through event extraction, reader can 978-0-7695-4760-2/12 $26.00 © 2012 IEEE DOI 10.1109/KSE.2012.34 161 Table 1: Features Description Feature χ2 weight Freqt chém (cutting) 0.70329136 240 giết (kill) 0.6890592 530 cháy (fire) 0.5872597 201 gây tai nạn(crashed) 0.5106312 374 whereas section will describe more detail our method and event online monitoring system VnLoc Section illustrates our experiment and evaluates result on real data The last section is conclusion Related Work In [7], Ralph Grishman et al investigated on Maximum Entropy to detect event They used three classifiers for individual task which are argument classifier, role classifier, event classifier Moreover, event coreference is also solved by another Maximum Entropy classifier using features such as the event type, the event subtype, the anaphor anchor and the distance between anaphor and anchor In other study, Heng Ji and Ralph Grishman explored Maximum Entropy to identify events of a separate type [6] It is a sentence– level classifier which processes each sentence in the document and attempt determine event type Lexico–semantic patterns can be used for various purposes in many domains Cohen and Verspoor et al [3] applied semantic rules as patterns to extract event in biological area They divided biological events into six types: binding, gene expression, localization, phosphorylation, protein catebolism and transcription Biological events are extracted through patterns which each pattern is a set of semantic words In other word, Jethro Borsje et al [2] proposed using lexico–semantic patterns to detect financial event from RSS news feeds These patterns were organized in financial ontology named OWL Each pattern has a triple format and includes three elements: subject, a relation and optional subject In addition, there are several systems which extract events from online news in other domain Collier et al made BioCaster system where we can follow several event types around the world (http://born.nii.ac.jp) Besides, HealthMap system was built by Freifeld and Brownstein where user can monitor diseases types over the world (http://www.healthmap.org) By the way, Frontex system was developed by Atkinson and Piskoski et al (http://frontex.europa.eu) for monitor Europe agency Semantic label CRIME CRIME FIRE TRANSPORT ACCIDENT stores details information of news that is gathered The second one stores events information which are extracted by event extractor Both of them are organized in MongoDB system to attained the important key: high scalability The last one embraces several corpora which support for both machine learning process and extraction process Next, the news crawler fetches tidings through RSS resources which are supplied by many websites such as VnExpress1 , VietNamNet , DanTri By the structure of RSS format, useful information can be extracted from individual feed by a XML parser and be saved in news database in data repository Furthermore, the visualizer is described as a Map where shows event on web interface Data is pulled from event database and pushed to Google Map API with some modifications and will be represented Following, two main elements including event detector and event extractor will be explained to make VnLoc system clearer 3.2 Event Detector When a news is gathered, it is determined by the event detector to detect event inner news To settle this task, we used a binary calssification approach which is Maximum Entropy method We examined domain carefully and identified that the most of news’ titles express their content evidently Therefore, our problem is sentence level classification The first job, set of features is chose based on χ2 weight on offline data that is gathered before Simultaneously, N–grams method is also utilized to select phrases as features In this paper, we choose Uni–gram, Bi–grams and Tri–grams as three phrase types Moreover, feature is tagged with a semantic label to enhance its meaning The table shows some examples After that, Maximum Entropy classifier is applied to divides set of titles into two categories: EVENT and NON-EVENT This job is pre–condition for event extractor in the next phase Implemented System 3.1 System Architecture 3.3 Event Extractor VnLoc is an event monitoring system that is horizontal scalable and distributed Its architect is illustrated in figure VnLoc consists of six components: a scalable data repository, a news crawler, an event detector, an event extractor, a plugin engine and a visualizer as web–based We organize the data repository into three parts: a news database, an event database and a data corpus The first one In the second important part, event and its information such as time, place, participants will be extracted from news www.vnexpress.net www.vietnamnet.vn www.dantri.com 162 Figure 1: VnLoc’s Architecture which is predicted that contains circumstance by the event detector Our approach is very clear and knowledge driven A lot of rules would be generated and exploited on the rumours which are passed from previous phase In this paper, we use types of rules for our aim To take out event, the rules 1, 2, 3, are applied: CRIME := ẩu đả (brawl) băng cướp (bandits) bị đâm chết (stabbed) (90 rules) • FIRE < P RE > < F IRE > < P OST > ACCIDENT := đâm xe (car crash) cán chết (crashed) (1) • CRIME < P RE > < CRIM E > < P OST > lật tàu (boat capsized) (27 rules) (2) • TRANSPORT ACCIDENT DAMAGE := thiệt mạng (die) chết thảm (pitiful death) chấn thương sọ não (brain injury) (22 rules) < P RE > < ACCIDEN T > < P OST > (3) < P RE > < DAM AGE > < P OST > (4) To pull out time when event happened, we reach two methods: direct and indirect The former is in situation that the time is showed completely by circumstance’s content, we use regular expressions to accomplish this task The latter comes when the time is not concreate For instance, "Hôm nay, hai vụ tai nạn giao thông xảy đường Khuất Duy Tiến." ("Today, two transport accidents happened on Khuat Duy Tien With PRE and POST are phrases or words surrounding keywords FIRE := vụ cháy (fire) bùng cháy (burning) cháy rụi (burned) (18 rules) 163 Street.") In this example, Hôm (Today) is a relative adverb that does not denote the time exactly when the event occured We solved this problem by matching based on a dictionary which contains relative key and relative value as definition below Then, rule is used to extract time RELAT IV ET IM E = (RELAT IV E_W ORD, BIAS) = {(hôm (today), 0), (rạng sáng (this morning), 0) (hôm qua (yesterday), −1), Experiment and Result Our experiment process was conducted on data set that includes 18.400 titles which are extracted from 3.842.137 news titles of BAOMOI through RSS news gathering News components are illustrated in table Besides, we , (hai ngày trước (two days ago), − 2)} Element Title Abstract Publish time < DAT E >=< P U BLISH · T IM E > + < BIAS > (5) Next step, we extract location where event occured As mentioned in rule 6, we used two constituents to find out proper location The first is LOCPREP, which is a set of prepostions coming before right place; and the second is LOCPREFIX, which is a set of prefixes coming after prepositions above but coming before right place After, we applied the rule to perform this task Link have evaluated event detection via evaluate event classification process by using cross validation (10 fold cross validation) Testing data set is separated to 10 testing patterns with rate 9:1, parts are used as training data set and part used as testing data Result of classification is illustrated in table and chart Table 3: Result of classification Precision Recall F1 Fold 92.70 89.23 90.93 Fold 93.08 91.39 92.23 Fold 93.32 91.54 92.42 Fold 93.32 91.54 92.42 Fold 93.68 91.78 92.72 Fold 93.50 91.60 92.54 Fold 92.95 90.81 91.87 Fold 92.39 89.01 90.67 Fold 91.81 88.65 90.20 Fold 10 91.68 88.51 90.07 Average 92.85 90.39 91.61 LOCP REP = {ở (in), (at) , (in), gần (nearby, near), (into)} LOCP REF IX = {thành phố (city), tỉnh (province), quận (district), thị xã (town), xã (village), phố (street)} LOCAT ION = {loci |loci ∈ location dictionary} < LOCP REP >< LOCP REF IX >< LOCAT ION > (6) Finally, participants is considered As the same prior event information, we also use rule that is shown at < P RE > < P ERSON > Table 2: News’ elements Description News’ headline The short paragraph what summaries news’ content Time when the news is published Maybe support for time extraction process Link to origin news After the system operates online at http://vnloc.com (figure 3), we evaluated result of event extraction process by manual task through checking each event is showed on system from 13/04/2012 to 22/04/2012 The statistics precision of event extraction is showed in table Based on articles detected that contain events, the statistics in table presents that event extraction strategy using lexico–semantic and machine learning is appropriate in Vietnamese news In some cases, extracting event process is false because it relates to ambiguity of places where many locations have similar names whereas article does not mention position fully (7) PRE := ơng (Mr) bà (Mrs/Ms) gia đình (family) nghi can (suspect) bị cáo (defendant) http://www.baomoi.com/ 164 Figure 3: VnLoc at http://vnloc.com Table 4: Event extraction result (in quantity) Date Extracted Correct Precision(%) 13/04/2012 47 43 91.49 14/04/2012 61 58 95.08 15/04/2012 65 59 90.77 16/04/2012 59 54 91.52 17/04/2012 48 43 89.58 18/04/2012 55 49 89.09 19/04/2012 71 64 90.14 20/04/2012 56 50 89.28 21/04/2012 60 54 90.00 22/04/2012 63 57 90.47 Figure 2: Result of classification Conclusion tributes are extracted based on rules and it is visualized on online map Furthermore, we have described in detail the system architecture Especially, we concentrated describing activity of Event Detector component which uses the method was proposed to recognize an event, and Event Extractor which uses lexico–semantic rules to extract event’s attributes Although we have achieved good result, system need to have some improvements to enhance quality in the future Firstly, the precision of Maximum Entropy classifier must be enhanced the by adding useful information Secondly, we aim to expand some areas such as disaster (disease, earthquake, tsunami), culture and finance Therefore, an ontol- In this paper, we have represented a method that combines lexico–semantic and machine learning (Maximum Entropy) for event extraction on Vietnamese domain data and described VnLoc system Through the result of experiment have demonstrated combining lexico–semantic and machine learning will achieve good result in Vietnamese domain data Maximum Entropy machine learning method is used for binary classification and only keeping events that are suitable with features in training data set Lexico–semantic is applied to take out useful information of event Eventually, event’s at- 165 ogy is building to integrates easier some plugin modules for purpose above Acknowledgement This research work was partly supported by The National Major Research Program KC.01/11-15 (code KC.01.TN04/11-15) under project "Analyzing opinion’s trend based on social network and its application in tourism and technology products" References [1] J Allan, R Papka, and V Lavrenko On-line new event detection and tracking In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’98, pages 37–45, New York, NY, USA, 1998 ACM [2] J Borsje, F Hogenboom, and F Frasincar Semi–automatic financial events discovery based on lexico–semantic patterns Int J Web Eng Technol., 6(2):115–140, Jan 2010 [3] K B Cohen, K Verspoor, H L Johnson, C Roeder, P V Ogren, W A Baumgartner, Jr., E White, H Tipney, and L Hunter High-precision biological event extraction with a concept recognizer In Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task, BioNLP ’09, pages 50–58, Stroudsburg, PA, USA, 2009 Association for Computational Linguistics [4] U K F D J Frederik Hogenboom, Flavius Frasincar An overview of event extraction from text Workhop on Detection, Representation, and Exploitation of Events in the Semantic Web, 2011 [5] R Grishman and B Sundheim Message understanding conference-6: a brief history In Proceedings of the 16th conference on Computational linguistics - Volume 1, COLING ’96, pages 466–471, Stroudsburg, PA, USA, 1996 Association for Computational Linguistics [6] H Ji and R Grishman Refining event extraction through cross-document inference In Proc, 2008 [7] D W Ralph Grishman and A Meyers Nyu’s english ace 2005 system description ACE Program, 2005 166 ... Testing data set is separated to 10 testing patterns with rate 9:1, parts are used as training data set and part used as testing data Result of classification is illustrated in table and chart Table... scalable data repository, a news crawler, an event detector, an event extractor, a plugin engine and a visualizer as web–based We organize the data repository into three parts: a news database, an... right place After, we applied the rule to perform this task Link have evaluated event detection via evaluate event classification process by using cross validation (10 fold cross validation) Testing

Ngày đăng: 16/12/2017, 02:57

Từ khóa liên quan

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan