Luận văn thạc sĩ ứng dụng mạng nơron trong phân tích quan điểm cộng đồng

UNIVERSITY OF ENGINEERING AND TECHNOLOGY VIETNAM NATIONAL UNIVERSITY, HANOI  PHAM DINH TAI SENTIMENT ANALYSIS USING NEURAL NETWORK MASTER OF COMPUTER SCIENCE Ha N o i - UNIVERSITY OF ENGINEERING AND TECHNOLOGY VIETNAM NATIONAL UNIVERSITY, HANOI  PHAM DINH TAI SENTIMENT ANALYSIS USING NEURAL NETWORK Major: Computer Science Code : 60.48.01.01 MASTER OF COMPUTER SCIENCE Supervisor: Assoc Prof Dr Le Anh Cuong Ha Noi - 2016 ORIGINALITY STATEMENT I hereby declare that this submission is my own work and to the best of my knowledge, it contains no materials previously published or written by another person, or substantial proportions of material which has been accepted for the award of any other degree or diploma at University of Engineering and Technology (UET), or any other educational institution, except where due acknowledgement is made in the thesis Any contribution made to the research by others, with whom I have studied at UET or elsewhere, is explicitly acknowledged in the thesis I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project's design and conception or in style, presentation and linguistic expression is acknowledged Signature Abstract Sentiment analysis and opinion mining is an important task in natural language processing and data mining Opinions of users' comments from social network, forum, blog, … are very useful for new user when they are looking for a good service or good product It is also useful for service providers or companies for improving their products based on comments from customers Therefore, recently there have been raising a large number of studies focusing on the problem of opinion mining and sentiment analysis In this research field, there are some essential problems including: subjectivity classification, polarity classification, aspect based sentiment analysis, sentiment rating This thesis focusing on two of the above problems For the first one, subjectivity classification classifies a review into two classes, subjective and objective An objective text expresses some factual information, while a subjective one usually gives personal views and opinions In fact, subjective sentences can express many types of information, e.g., opinions, evaluations, emotions, beliefs, speculations, judgments, allegations, stances, etc Given a text, we will determine whether it is subjective or objective The second problem we are addressing is the problem of review rating We will use a Neural Network to solve this problem II Acknowledgements First and foremost I would like to offer my sincerest gratitude to my supervisor, Assoc.Prof.Dr Le Anh Cuong who always supported me throughout my research with patience He always appears when I need help, and responds to queries so helpfully and promptly I attribute the level of my Master's degree to him encouragement and effort Without him, this thesis would not have come into being I could never wish for better or kinder supervisors I would like to give my honest appreciation to my group friends: Le Ngoc Anh, Nguyen Ngoc Truong, Dao Bao Linh who study in my school for what so ever they did for me I am very grateful to Mrs.Nguyen Thi Xuan Huong and Mr.Pham Duc Hong, graduate students at University of Engineering and Technology(UET), and for providing me the methods and data required for sentiment analysis Special thanks to Trinh Quyet Thang student at University of Engineering and Technology (UET) for providing me the forum data and help me source code required for sentiment analysis Last but not least, I am very grateful to my family who love them the most in this world People I cannot imagine living my life without them Thank you! III Contents Acknowledgements III Contents IV List of Tables VI List of Figures VII List of Abbreviations VIII Chapter Introduction 1.1 Motivation 1.2 Sentiment Analysis Problems 1.2.1 Problem Description 1.2.2 Different Levels of Analysis 1.2.3 Natural Language Processing Issues 1.3 About This Thesis .4 1.3.1 Thesis Aims 1.3.2 Thesis structure .4 Chapter Sentiment Analysis and Methods .6 2.1 Opinion Definition 2.2 Sentiment Analysis Tasks 2.3 Subjectivity and Emotion 10 2.4 Document Sentiment Classification 13 2.4.1 Sentiment Classification Using Supervised Learning 13 2.4.2 Sentiment Rating Prediction 15 2.5 Dictionary based Approach & Corpus Approach 16 Chapter Subjective Document Detection 18 3.1 Subjectivity Classification problem 18 3.2 General Framework 18 3.3 Building the Classifier 20 Chapter Sentiment Analysis with Neural Networks 23 4.1 Neural Network 23 4.2 Problem of Sentiment Rating 26 4.2.1 Formulating the Problem 27 Chapter Experiments 29 5.1 Data set 29 5.2 Sentiment Analysis with Subjectivity 29 5.2.1 Data presentation 29 5.2.2 Feature extraction: 31 5.2.3 Experimental Results 31 5.3 Sentiment analysis with ratings 32 5.3.1 Dataset 32 IV 5.3.2 Feature Extraction: 32 5.3.3 Machine learning: 32 Conclusion 33 V List of Tables Table 5.1 Data set 30 Table 5.2 Result machine learning 31 Table 5.3 Result using perceptron with 200 loops 32 Table 5.4 Result with 200 iterations 32 VI List of Figures 1.1 Example review hotel by customer 2.3 Example opinion by user 12 3.2 General Framework for Subjectivity Classification 19 4.1 Simple structure of a biological Neural Network 23 4.2 Model Neural Network with one neuron 24 4.3 Neural Network by axes of coordinate .25 4.4 General model for learning overall rating from Sentiment word using Neural Network 27 VII List of Abbreviations NLP: Nature Language Processing 1,4,7,16 SVM: Support Vector Machines 14,15,22,33 POS: Part OF Speech 14 OVA: One vs All 15 NNRating: Neural Network Rating 32 BP: Back-Propagation .26 UET: University of Engineering and Technology V III 3.3 Building the Classifier Feature Selection: There are plenty of approaches to solve subjectivity classification problem are illustrated in many previous researches on the English data These proposed approaches are able to be applied on Vietnamese data For the classification problem in general subjectivity classification in particular, identifying the set of features play an important role and effects to the performance of system In previous researches many methods are proposed to build the set of features They usually use number of words (n-gram) which are generated from text are used as features In addition, the patterns are proposed to enrich the features Some studies used the patterns to extract phrases which contain adjectives or verbs These approaches require the knowledge about grammar, vocabulary of specific language to extract the patterns There are also some expressions such as word lengthening, emoticon, domain which are used for feature representation In summary, the feature selection may contain lexical, syntactic, or semantic information The features will have form in N-grams, or patterns like a sequence of linguistic elements, each element contain word surface, or syntactic label, or Part-OfSpeech tags, … To extract linguistic features from sentences, the related studies usually used syntactic information which can be generated from using Stanford Parser tool (for English) Stanford Parser is a natural language parser developed by The Stanford Natural Language Processing Group It uses probabilistic methods to work out parse trees for texts Take the following example of a sentence:" smart and alert, thirteen conversations about one thing is a small gem." The Stanford parser output is generated: (ROOT/(S/(NP/(NP/(ADJP (JJ smart)/ (CC and)/(JJ alert) (, ,) (JJ thirteen))/ (NNS conversations))/(PP (IN about)/(NP (CD one) (NN thing))))/(VP (VBZ is)/(NP (DT a) (JJ small) (NN gem)))/( .)) 20 If we use 1-gram, we have the following features (note that we should remove stop words): smart alert thirteen conversations about one thing small gem If we use 2-gram, we will have the following features: s m ar t and and alert thirteen conversations conversations about one thing thing is a small small gem 21 If we will the pattern 2-gram containing adjective words, we will have the following features: (JJ alert) (JJ thirteen) (JJ small) This work just limits features in N-grams and doing it for Vietnamese corpus Classification Methods: After extracting the features from data, there are many technical classifications in statistical machine learning have been applied for sentiment classification task, such as Support vector machines (SVM), naive Bayesian (NB), Maximum Entropy (ME) The previous studies achieved fairly good results for different domains Because a classification model is well-known in the community, we don't mention here in detail Note that, in this work we will use SVM for the machine learning method 22 Chapter Sentiment Analysis with Neural Networks Sentiment Rating is a main task in this thesis Firstly, we will introduce basically the model of Neural Network, and then we introduce the problem of sentiment rating using NN model 4.1 Neural Network The content of this section is mainly borrowed from [11] We just represent here for introduce the fundamental knowledge A Neural Network is a mathematical model inspired by the way biological neural networks function in our brain Artificial Neural Network is a framework for distributed representation capable of performing both Classification and Clustering In biological neural networks, the central nervous system consists of neurons, axons, dendrites and synapses Dendrites, synapses and axon help transmitting electrical pulses between the neurons Figure 4.1 Simple structure of a biological neural network An artificial Neural Network is made up of simple processing units called neurons or cells that communicate to each other by sending signals over a large number of weighted connections 23 Every neuron has a state of activation θk The connection between neurons is defined by xik which determines the effect signal of neuron i has on neuron k Propagation rule ŷi that determines the effective input a neuron gets from its external inputs Figure 4.2 Model Neural Network with neuron Activation function ŷi = P(yi=1|xi,θ) determines the new level of activation based on ŷi and the current activation θ An output uk and a method for information gathering and the environment which system must operate The connection between neurons is defined by k ŷi = ∑ x θ (t) +θ ( u i i ) i=1 The total input to the unit k which is the weighted sum of the separate outputs from each of the connected neurons plus θi as the bias or offset The effect of the total input on the activation of a unit is defined by Update θi and deviation θi0 at time t+1 θil (t+1)= θil (t)+∆ θil(t) with i≤ l≤ k θi0 (t+1)= θi0 (t)+∆ θi0(t) Generally, we can divide neural networks based on their topologies to feed forward and recurrent neural networks Flow of data in feedforward neural networks is only from 24 input to output units Data is processed over multiple layers of units Recurrent networks contain feedback connections from the output of neurons to input of neurons that earlier processed the data A neural network is a "connectionist" computational system The computational systems we write are procedural, a program starts at the first line of code, executes it, and goes on to the next, following instructions in a linear fashion A true neural network does not follow a linear path Rather, information is processed collectively, in parallel throughout a network of nodes (the nodes, in this case, being neurons) The perceptron itself can be diagrammed as follows: x2 (θ1,θ2) (θ1′,θ2′) x1 Figure 4.3 Neural Network by axes of coordinates The simple perceptron algorithm 1) θ0 ← 0; i ← 0; ^ 2) Get (u, yi ) 3) Predict P ← sign(θiT u) ^ 4) If Mistake (P ≠ yi ) ^ 4.2) θi+1 ← θi + yi⋅ x; 4.3) i ← i +1; 5) Goto 25 The most well-known and widely used multi-layer feed-forward neural network is Back-propagation(BP) The main idea behind back-propagation is that the errors of the units of the hidden layer is determined by the errors of the units of the output layer When a learning pattern is fed into the network, activation values are propagated to the output and then are compared to the desired output Error εo for a particular output unit u and the difference between the Back-propagation output and the target output, governs the training process with the ultimate aim to bring the error εo close to zero 4.2 Problem of Sentiment Rating We formally define the problem of Sentiment Rating using Neural Network Rating As a computational problem, NN Rating assumes that the input is a set of reviews of some interesting entity (e.g., forums, hotel, mobile), where each review has an overall rating Such a format of reviews is quite common in most of the merchant's web site, e.g Amazon (www.amazon.com) and Yelp (www.yelpl.com), or TripAdvisor (www.tripadvisor.com ), thegioididong.com and the number of such reviews is growing constantly Formally, Let D = {x1, x 2, , x|D|} be a set of review text documents for an interesting entity or topic, and each review document x ∈ D is associated with an overall rating rx We also assume that there are n unique sentiment words in the vocabulary Specifically, we assume that the reviewer generates the overall rating of a review based on a weighted combination of his/her ratings on all sentiment words We formulate this problem by learning the weights of sentiment words to generate the overall rating which are given in the training data Figure 4.4 below shows the formulation of this problem 26 Figure 4.4: General model for learning overall rating from sentiment words using Neural Network 4.2.1 Formulating the Problem An overall rating rx of a review document x is a numerical rating indicating different levels of overall opinion of x, i.e rx ∈ [rmax, rmin], where rmax and rmin are the minimum and maximum ratings respectively We further assume that we are given a set of sentiment words, which are rating factors that potentially affect the overall rating of the given topic We firstly build a dictionary of sentiment words, and then given a text, we will be looking at the dictionary for which word belongs to it Then, we have a list of sentiment words appearing in the input documents These words with their weights will be combined to generate the overall rating β = (β1,β2, ,βn) is the weight vector of input words and W=(w1,w2, ,wn ) feature vector (i.e vector of sentiment words) represent for a input Denote is document Then, according to the NN model n y = g (v ) = g ∑w β +β ( i=1 i i ) Given the training dataset: D={(wj ,sj )}, 1≤ j ≤ m 27 we have the output is Then the error function is defined by: E = m ∑(s − y ) 2 i=1 i i Learning the weights, we will use the Back Propagation Algorithm The main steps in the algorithms are: n v= Step 1: Initialize β = (β1,β2, ,βn) and compute the function y=g(v) with ∑ i =1 w iβ i + β Step 2: Update β βi (t +1) = βi (t) + ∆βi (t) with according to the following formulation: ∂ ∆βi (t) = −η E =η(sj − yj (t)) f (v)' ∂βi (t) ∂βi (t) The step2 is recursively repeated until convergence ) Output β = (β1,β2, ,βn , β0 : 28 Chapter Experiments 5.1 Data set We crawled nearly 5316 mobile reviews from ww.thegioididong.com, tinhte.vn, sohoa.vnexpress.net, fptshop.vn, pico.vn, We chose this data set for evaluation because in addition to the overall rating, reviewers also provided aspect ratings in each review: fact, advertise, question, answer, spam, sentiment, objective, ranking from to level, which can serve as ground-truth for quantitative evaluation of latent aspect rating prediction 5.2 Sentiment Analysis with Subjectivity 5.2.1 Data presentation There is a fact that, a text crawled from social media doesn't show only two states: having opinion and don't have opinion It has also other categories which are valuable to classify and detect Sometimes we are wondering to classify a text into opinion or fact groups For example, some other labels as question/answering, advertising, … For that reasons, we extend the task of subjectivity classification to multiple classes more than only two classes (subj and obj), as follows: + "sent" - sentiment : containing opinion + "answ" - answer : containing answers + "ques" - question : containing questions + "ad" - advertise : containing advertisement + "spam" - spam : don't relate to the focusing entity + "sentiment": containing opinion (but is not advertisement) + "objective": containing a fact information We first perform simple pre-processing on these reviews: a Converting words into lower cases 29 b Removing punctuations, stop words defined in [12], and the terms occurring less than times in the corpus; c Stemming each word to its root with Porter Stemmer [13] After aspect segmentation, we discarded those sentences that fail to be associated with any aspect If we require all the reviews contain all the aspect descriptions, there would be 5316 reviews left covering mobile phone To avoid sparseness and missing aspect descriptions in the review, we thus concatenated all the reviews commenting on the same mobile phone together as a new "review" (we call it "h-review") and average the overall/aspect ratings over them as the ground truth ratings After these processing, we have a corpus with mobiles phone ("h-review") and 5316 reviews, the details are illustrated in Table Sum Name 5316 Number of reviews 372 Fact 379 Advertise 353 Question 190 Answer 581 Spam 7375 Sentiment 2898 Objective Table 5.1: Data set 30 5.2.2 Feature extraction: - Converting words into lower cases and use method: Tokenize Word segmentation Then use the method "bag of words", use comment as a collection of words with no have alphabet, we processing split the 1-gram and 2-gram from comments and we have set of comments feature For example: Công nh n camera c a e ch p chán th t Ch p ch vi t nhòe,không nhìn th y Pre-processing we have: công_nh n camera c a e ch p chán th t ch p ch vi t nhòe , không nhìn th y j and feature after split: 1- gram: công_nh n, camera, c a, e, này, ch p, chán, th t, , ch , vi t, nhòe, , , không, nhìn, th y, 2- gram: công_nh n camera, camera c a, c a e, e này, ch p, Finally, after finished split and cover to feature, we use the methods: weight tfidf cover the feature to vector 5.2.3 Feature selection: Using the method: "Chi-Square test" extract the feature and we have 3200 strong feature Some features: 1- gram: p, quá, hơn, ngon, b n, nhưng, r t, thích, … 2- gram: s _thích, thèm_quá, u_thua, tu i_gì, gi t_lag, c m_t nh, … 5.2.3 Experimental Results With 80% of the data for training and 20% for testing we have: Precision 0.823 Recall 0.814 F-Measure 0.816 Table 5.2: Result of Subjectivity classification 31 5.3 Sentiment analysis with ratings 5.3.1 Dataset Data training for comment on mobile reviews with positive and negative of users The reviews with level from to 5, with are the most negative and are the most positive - Data: 2279 reviews with label: 1* to 5* 5.3.2 Feature Extraction: - Converting words into lower cases and use method: Tokenize Word segmentation - Features: 1-gram and 2-gram - Feature selection: use method "Chi-square test" 5.3.3 Machine learning: Running the NNRating Algorithm with 200 loops I have at the end loop result: 191 192 193 0.8207769873014197 196 0.819419253020942 0.8205033745111205 197 0.8191511045926922 0.820230714193611 198 0.8188839100763329 0.8199591343351208 199 0.8186182654080822 0.8196887288557506 200 0.8183597948585247 194 195 Table 5.3: Result using perceptron with 200 loops And the final result: sai trung bình t p train 0.82/5 sai trung bình t p test 0.83/5 Table 5.4: Result with 200 iterations 32 Conclusion In this thesis, we have studies the problem of Opinion Mining and Sentiment Analysis, focusing on two sub-tasks: subjectivity classification and sentiment rating We have achieved the following results: - Firstly, we have studied and understand basic concepts, issues, general approaches and methods in the field of Opinion Mining and Sentiment Analysis - Secondly, we have focused on two problems including subjectivity classification and sentiment rating - Thirdly, we have followed approaches for the two problems, using SVM for classification of subjectivity detection We also use Neural Network for problem of sentiment rating - Finally, we prepare training data, implementations of necessary experiment for the two problems In future, we will research deeply on feature selection, as well as use strong machine methods to improve the accuracy of the problems 33 REFERENCES [1] Hu, Minqing and Bing Liu Mining and summarizing customer reviews in Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2004) 2004 [2] Riloff, Ellen, Siddharth Patwardhan, and Janyce Wiebe Feature subsumption for opinion analysis in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-2006) 2006 [3] Wiebe, Janyce and Ellen Riloff Creating subjective and objective sentence classifiers from unannotated texts.Computational Linguistics and Intelligent Text Processing, p 486-497 2005 [4] Zhang, Lei and Bing Liu Identifying noun product features that imply opinions in Proceedings of the Annual Meeting of the Association for Computational Linguistics (short paper) (ACL-2011) 2011b [5] Parrott, W Gerrod Emotions in social psychology: Essential readings: Psychology Pr 2001 [6] Pang, Bo, Lillian Lee, and Shivakumar Vaithyanathan Thumbs up? sentiment classification using machine learning techniques in Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP-2002) 2002 [7] Pang, Bo and Lillian Lee Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales in Proceedings of Meeting of the Association for Computational Linguistics (ACL-2005) 2005 [8] Goldberg, Andrew B and Xiaojin Zhu Seeing stars when there aren't many stars: graphbased semi-supervised learning for sentiment categorization in Proceedings of HLTNAACL 2006 Workshop on Textgraphs: Graph-based Algorithms for Natural Language Processing 2006 [9] Wan, Xiaojun Co-training for cross-lingual sentiment classification in Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP (ACL-IJCNLP2009) 2009 [10] B Liu Sentiment analysis and subjectivity, available from http://www.cs.uic.edu/ liub/FBS/NLP-handbook-sentiment-analysis.pdf, viewed on 30/08/2011 [11] M.H Hassoun Fundamentals of artificial neural networks the MIT Press, 1995 [12] Onix text retrieval toolkit stopword list http://www.lextek.com/manuals/onix/stopwords1.html [13] M Porter An algorithm for suffix stripping Program, 14(3):130 - 137, 1980 34 ... analysis has also flourished due to the proliferation of commercial applications This provides a strong motivation for research Secondly, it offers many challenging research problems, which had

Luận văn thạc sĩ ứng dụng mạng nơron trong phân tích quan điểm cộng đồng

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan