Tài liệu Báo cáo khoa học: "Word representations: A simple and general method for semi-supervised learning" doc

11 644 0
  • Loading ...

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Tài liệu liên quan

Thông tin tài liệu

Ngày đăng: 20/02/2014, 04:20

Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 384–394,Uppsala, Sweden, 11-16 July 2010.c2010 Association for Computational LinguisticsWord representations:A simple and general method for semi-supervised learningJoseph TurianD´epartement d’Informatique etRecherche Op´erationnelle (DIRO)Universit´e de Montr´ealMontr´eal, Qu´ebec, Canada, H3T 1J4lastname@iro.umontreal.caLev RatinovDepartment ofComputer ScienceUniversity of Illinois atUrbana-ChampaignUrbana, IL 61801ratinov2@uiuc.eduYoshua BengioD´epartement d’Informatique etRecherche Op´erationnelle (DIRO)Universit´e de Montr´ealMontr´eal, Qu´ebec, Canada, H3T 1J4bengioy@iro.umontreal.caAbstractIf we take an existing supervised NLP sys-tem, a simple and general way to improveaccuracy is to use unsupervised wordrepresentations as extra word features. Weevaluate Brown clusters, Collobert andWeston (2008) embeddings, and HLBL(Mnih & Hinton, 2009) embeddingsof words on both NER and chunking.We use near state-of-the-art supervisedbaselines, and find that each of the threeword representations improves the accu-racy of these baselines. We find furtherimprovements by combining differentword representations. You can downloadour word features, for off-the-shelf usein existing NLP systems, as well as ourcode, here: http://metaoptimize.com/projects/wordreprs/1 IntroductionBy using unlabelled data to reduce data sparsityin the labeled training data, semi-supervisedapproaches improve generalization accuracy.Semi-supervised models such as Ando and Zhang(2005), Suzuki and Isozaki (2008), and Suzukiet al. (2009) achieve state-of-the-art accuracy.However, these approaches dictate a particularchoice of model and training regime. It can betricky and time-consuming to adapt an existing su-pervised NLP system to use these semi-supervisedtechniques. It is preferable to use a simple andgeneral method to adapt existing supervised NLPsystems to be semi-supervised.One approach that is becoming popular isto use unsupervised methods to induce wordfeatures—or to download word features that havealready been induced—plug these word featuresinto an existing system, and observe a significantincrease in accuracy. But which word features aregood for what tasks? Should we prefer certainword features? Can we combine them?A word representation is a mathematical objectassociated with each word, often a vector. Eachdimension’s value corresponds to a feature andmight even have a semantic or grammaticalinterpretation, so we call it a word feature.Conventionally, supervised lexicalized NLP ap-proaches take a word and convert it to a symbolicID, which is then transformed into a feature vectorusing a one-hot representation: The feature vectorhas the same length as the size of the vocabulary,and only one dimension is on. However, theone-hot representation of a word suffers from datasparsity: Namely, for words that are rare in thelabeled training data, their corresponding modelparameters will be poorly estimated. Moreover,at test time, the model cannot handle words thatdo not appear in the labeled training data. Theselimitations of one-hot word representations haveprompted researchers to investigate unsupervisedmethods for inducing word representations overlarge unlabeled corpora. Word features can behand-designed, but our goal is to learn them.One common approach to inducing unsuper-vised word representation is to use clustering,perhaps hierarchical. This technique was used bya variety of researchers (Miller et al., 2004; Liang,2005; Koo et al., 2008; Ratinov & Roth, 2009;Huang & Yates, 2009). This leads to a one-hotrepresentation over a smaller vocabulary size.Neural language models (Bengio et al., 2001;Schwenk & Gauvain, 2002; Mnih & Hinton,2007; Collobert & Weston, 2008), on the otherhand, induce dense real-valued low-dimensional384word embeddings using unsupervised approaches.(See Bengio (2008) for a more complete list ofreferences on neural language models.)Unsupervised word representations havebeen used in previous NLP work, and havedemonstrated improvements in generalizationaccuracy on a variety of tasks. But different wordrepresentations have never been systematicallycompared in a controlled way. In this work, wecompare different techniques for inducing wordrepresentations, evaluating them on the tasks ofnamed entity recognition (NER) and chunking.We retract former negative results published inTurian et al. (2009) about Collobert and Weston(2008) embeddings, given training improvementsthat we describe in Section 7.1.2 Distributional representationsDistributional word representations are basedupon a cooccurrence matrix F of size W×C, whereW is the vocabulary size, each row Fwis the ini-tial representation of word w, and each column Fcis some context. Sahlgren (2006) and Turney andPantel (2010) describe a handful of possible de-sign decisions in contructing F, including choiceof context types (left window? right window? sizeof window?) and type of frequency count (raw?binary? tf-idf?). Fwhas dimensionality W, whichcan be too large to use Fwas features for word w ina supervised model. One can map F to matrix f ofsize W × d, where d  C, using some function g,where f = g(F). fwrepresents word w as a vectorwith d dimensions. The choice of g is another de-sign decision, although perhaps not as importantas the statistics used to initially construct F.The self-organizing semantic map (Ritter &Kohonen, 1989) is a distributional techniquethat maps words to two dimensions, such thatsyntactically and semantically related words arenearby (Honkela et al., 1995; Honkela, 1997).LSA (Dumais et al., 1988; Landauer et al.,1998), LSI, and LDA (Blei et al., 2003) inducedistributional representations over F in whicheach column is a document context. In most of theother approaches discussed, the columns representword contexts. In LSA, g computes the SVD of F.Hyperspace Analogue to Language (HAL) isanother early distributional approach (Lund et al.,1995; Lund & Burgess, 1996) to inducing wordrepresentations. They compute F over a corpus of160 million word tokens with a vocabulary size Wof 70K word types. There are 2·W types of context(columns): The first or second W are counted if theword c occurs within a window of 10 to the left orright of the word w, respectively. f is chosen bytaking the 200 columns (out of 140K in F) withthe highest variances. ICA is another technique totransform F into f . (V¨ayrynen & Honkela, 2004;V¨ayrynen & Honkela, 2005; V¨ayrynen et al.,2007). ICA is expensive, and the largest vocab-ulary size used in these works was only 10K. Asfar as we know, ICA methods have not been usedwhen the size of the vocab W is 100K or more.Explicitly storing cooccurrence matrix F can bememory-intensive, and transforming F to f canbe time-consuming. It is preferable that F neverbe computed explicitly, and that f be constructedincrementally.ˇReh˚uˇrek and Sojka (2010) describean incremental approach to inducing LSA andLDA topic models over 270 millions word tokenswith a vocabulary of 315K word types. This issimilar in magnitude to our experiments.Another incremental approach to constructing fis using a random projection: Linear mapping g ismultiplying F by a random matrix chosen a pri-ori. This random indexing method is motivatedby the Johnson-Lindenstrauss lemma, which statesthat for certain choices of random matrix, if d issufficiently large, then the original distances be-tween words in F will be preserved in f (Sahlgren,2005). Kaski (1998) uses this technique to pro-duce 100-dimensional representations of docu-ments. Sahlgren (2001) was the first author to userandom indexing using narrow context. Sahlgren(2006) does a battery of experiments exploringdifferent design decisions involved in construct-ing F, prior to using random indexing. However,like all the works cited above, Sahlgren (2006)only uses distributional representation to improveexisting systems for one-shot classification tasks,such as IR, WSD, semantic knowledge tests, andtext categorization. It is not well-understoodwhat settings are appropriate to induce distribu-tional word representations for structured predic-tion tasks (like parsing and MT) and sequence la-beling tasks (like chunking and NER). Previousresearch has achieved repeated successes on thesetasks using clustering representations (Section 3)and distributed representations (Section 4), so wefocus on these representations in our work.3 Clustering-based word representationsAnother type of word representation is to inducea clustering over words. Clustering methods and385distributional methods can overlap. For example,Pereira et al. (1993) begin with a cooccurrencematrix and transform this matrix into a clustering.3.1 Brown clusteringThe Brown algorithm is a hierarchical clusteringalgorithm which clusters words to maximize themutual information of bigrams (Brown et al.,1992). So it is a class-based bigram languagemodel. It runs in time O(V·K2), where V is the sizeof the vocabulary and K is the number of clusters.The hierarchical nature of the clustering meansthat we can choose the word class at severallevels in the hierarchy, which can compensate forpoor clusters of a small number of words. Onedownside of Brown clustering is that it is basedsolely on bigram statistics, and does not considerword usage in a wider context.Brown clusters have been used successfully ina variety of NLP applications: NER (Miller et al.,2004; Liang, 2005; Ratinov & Roth, 2009), PCFGparsing (Candito & Crabb´e, 2009), dependencyparsing (Koo et al., 2008; Suzuki et al., 2009), andsemantic dependency parsing (Zhao et al., 2009).Martin et al. (1998) presents algorithms forinducing hierarchical clusterings based upon wordbigram and trigram statistics. Ushioda (1996)presents an extension to the Brown clusteringalgorithm, and learn hierarchical clusterings ofwords as well as phrases, which they apply toPOS tagging.3.2 Other work on cluster-based wordrepresentationsLin and Wu (2009) present a K-means-likenon-hierarchical clustering algorithm for phrases,which uses MapReduce.HMMs can be used to induce a soft clustering,specifically a multinomial distribution over pos-sible clusters (hidden states). Li and McCallum(2005) use an HMM-LDA model to improvePOS tagging and Chinese Word Segmentation.Huang and Yates (2009) induce a fully-connectedHMM, which emits a multinomial distributionover possible vocabulary words. They performhard clustering using the Viterbi algorithm.(Alternately, they could keep the soft clustering,with the representation for a particular word tokenbeing the posterior probability distribution overthe states.) However, the CRF chunker in Huangand Yates (2009), which uses their HMM wordclusters as extra features, achieves F1 lower thana baseline CRF chunker (Sha & Pereira, 2003).Goldberg et al. (2009) use an HMM to assignPOS tags to words, which in turns improvesthe accuracy of the PCFG-based Hebrew parser.Deschacht and Moens (2009) use a latent-variablelanguage model to improve semantic role labeling.4 Distributed representationsAnother approach to word representation is tolearn a distributed representation. (Not to beconfused with distributional representations.)A distributed representation is dense, low-dimensional, and real-valued. Distributed wordrepresentations are called word embeddings. Eachdimension of the embedding represents a latentfeature of the word, hopefully capturing usefulsyntactic and semantic properties. A distributedrepresentation is compact, in the sense that it canrepresent an exponential number of clusters in thenumber of dimensions.Word embeddings are typically induced us-ing neural language models, which use neuralnetworks as the underlying predictive model(Bengio, 2008). Historically, training and testingof neural language models has been slow, scalingas the size of the vocabulary for each model com-putation (Bengio et al., 2001; Bengio et al., 2003).However, many approaches have been proposedin recent years to eliminate that linear dependencyon vocabulary size (Morin & Bengio, 2005;Collobert & Weston, 2008; Mnih & Hinton, 2009)and allow scaling to very large training corpora.4.1 Collobert and Weston (2008) embeddingsCollobert and Weston (2008) presented a neurallanguage model that could be trained over billionsof words, because the gradient of the loss wascomputed stochastically over a small sample ofpossible outputs, in a spirit similar to Bengio andS´en´ecal (2003). This neural model of Collobertand Weston (2008) was refined and presented ingreater depth in Bengio et al. (2009).The model is discriminative and non-probabilistic. For each training update, weread an n-gram x = (w1, . . . , wn) from the corpus.The model concatenates the learned embeddingsof the n words, giving e(w1) ⊕ . . . ⊕ e(wn), wheree is the lookup table and ⊕ is concatenation.We also create a corrupted or noise n-gram˜x = (w1, . . . , wn−q, ˜wn), where ˜wn wnis chosenuniformly from the vocabulary.1For convenience,1In Collobert and Weston (2008), the middle word in the386we write e(x) to mean e(w1) ⊕ . . . ⊕ e(wn). Wepredict a score s(x) for x by passing e(x) througha single hidden layer neural network. The trainingcriterion is that n-grams that are present in thetraining corpus like x must have a score at leastsome margin higher than corrupted n-grams like˜x. Specifically: L(x) = max(0, 1 − s(x) + s( ˜x)). Weminimize this loss stochastically over the n-gramsin the corpus, doing gradient descent simultane-ously over the neural network parameters and theembedding lookup table.We implemented the approach of Collobert andWeston (2008), with the following differences:• We did not achieve as low log-ranks on theEnglish Wikipedia as the authors reported inBengio et al. (2009), despite initially attemptingto have identical experimental conditions.• We corrupt the last word of each n-gram.• We had a separate learning rate for the em-beddings and for the neural network weights.We found that the embeddings should have alearning rate generally 1000–32000 times higherthan the neural network weights. Otherwise, theunsupervised training criterion drops slowly.• Although their sampling technique makes train-ing fast, testing is still expensive when the size ofthe vocabulary is large. Instead of cross-validatingusing the log-rank over the validation data asthey do, we instead used the moving average ofthe training loss on training examples before theweight update.4.2 HLBL embeddingsThe log-bilinear model (Mnih & Hinton, 2007) isa probabilistic and linear neural model. Given ann-gram, the model concatenates the embeddingsof the n − 1 first words, and learns a linear modelto predict the embedding of the last word. Thesimilarity between the predicted embedding andthe current actual embedding is transformedinto a probability by exponentiating and thennormalizing. Mnih and Hinton (2009) speed upmodel evaluation during training and testing byusing a hierarchy to exponentially filter downthe number of computations that are performed.This hierarchical evaluation technique was firstproposed by Morin and Bengio (2005). Themodel, combined with this optimization, is calledthe hierarchical log-bilinear (HLBL) model.n-gram is corrupted. In Bengio et al. (2009), the last word inthe n-gram is corrupted.5 Supervised evaluation tasksWe evaluate the hypothesis that one can take anexisting, near state-of-the-art, supervised NLPsystem, and improve its accuracy by includingword representations as word features. Thistechnique for turning a supervised approach into asemi-supervised one is general and task-agnostic.However, we wish to find out if certain wordrepresentations are preferable for certain tasks.Lin and Wu (2009) finds that the representationsthat are good for NER are poor for search queryclassification, and vice-versa. We apply clus-tering and distributed representations to NERand chunking, which allows us to compare oursemi-supervised models to those of Ando andZhang (2005) and Suzuki and Isozaki (2008).5.1 ChunkingChunking is a syntactic sequence labeling task.We follow the conditions in the CoNLL-2000shared task (Sang & Buchholz, 2000).The linear CRF chunker of Sha and Pereira(2003) is a standard near-state-of-the-art baselinechunker. In fact, many off-the-shelf CRF imple-mentations now replicate Sha and Pereira (2003),including their choice of feature set:• CRF++ by Taku Kudo (http://crfpp.sourceforge.net/)• crfsgd by L´eon Bottou (http://leon.bottou.org/projects/sgd)• CRFsuite by by Naoaki Okazaki (http://www.chokkan.org/software/crfsuite/)We use CRFsuite because it makes it sim-ple to modify the feature generation code,so one can easily add new features. Weuse SGD optimization, and enable negativestate features and negative transition fea-tures. (“feature.possible transitions=1,feature.possible states=1”)Table 1 shows the features in the baseline chun-ker. As you can see, the Brown and embeddingfeatures are unigram features, and do not partici-pate in conjunctions like the word features and tagfeatures do. Koo et al. (2008) sees further accu-racy improvements on dependency parsing whenusing word representations in compound features.The data comes from the Penn Treebank, andis newswire from the Wall Street Journal in 1989.Of the 8936 training sentences, we used 1000randomly sampled sentences (23615 words) fordevelopment. We trained models on the 7936387• Word features: wifor i in {−2, −1, 0, +1, +2},wi∧ wi+1for i in {−1, 0}.• Tag features: wifor i in {−2, −1, 0, +1, +2},ti∧ ti+1for i in {−2, −1, 0, +1}. ti∧ ti+1∧ ti+2for i in {−2, −1, 0}.• Embedding features [if applicable]: ei[d] for iin {−2, −1, 0, +1, +2}, where d ranges over thedimensions of the embedding ei.• Brown features [if applicable]: substr(bi, 0, p)for i in {−2, −1, 0, +1, +2}, where substr takesthe p-length prefix of the Brown cluster bi.Table 1: Features templates used in the CRF chunker.training partition sentences, and evaluated theirF1 on the development set. After choosing hy-perparameters to maximize the dev F1, we wouldretrain the model using these hyperparameters onthe full 8936 sentence training set, and evaluateon test. One hyperparameter was l2-regularizationsigma, which for most models was optimal at 2 or3.2. The word embeddings also required a scalinghyperparameter, as described in Section Named entity recognitionNER is typically treated as a sequence predictionproblem. Following Ratinov and Roth (2009), weuse the regularized averaged perceptron model.Ratinov and Roth (2009) describe differentsequence encoding like BILOU and BIO, andshow that the BILOU encoding outperforms BIO,and the greedy inference performs competitivelyto Viterbi while being significantly faster. Ac-cordingly, we use greedy inference and BILOUtext chunk representation. We use the publiclyavailable implementation from Ratinov and Roth(2009) (see the end of this paper for the URL). Inour baseline experiments, we remove gazetteersand non-local features (Krishnan & Manning,2006). However, we also run experiments thatinclude these features, to understand if the infor-mation they provide mostly overlaps with that ofthe word representations.After each epoch over the training set, wemeasured the accuracy of the model on thedevelopment set. Training was stopped after theaccuracy on the development set did not improvefor 10 epochs, generally about 50–80 epochstotal. The epoch that performed best on thedevelopment set was chosen as the final model.We use the following baseline set of featuresfrom Zhang and Johnson (2003):• Previous two predictions yi−1and yi−2• Current word xi• xiword type information: all-capitalized,is-capitalized, all-digits, alphanumeric, etc.• Prefixes and suffixes of xi, if the word containshyphens, then the tokens between the hyphens• Tokens in the window c =(xi−2, xi−1, xi, xi+1, xi+2)• Capitalization pattern in the window c• Conjunction of c and yi−1.Word representation features, if present, are usedthe same way as in Table 1.When using the lexical features, we normalizedates and numbers. For example, 1980 becomes*DDDD* and 212-325-4751 becomes *DDD*-*DDD*-*DDDD*. This allows a degree of abstrac-tion to years, phone numbers, etc. This delexi-calization is performed separately from using theword representation. That is, if we have inducedan embedding for 12/3/2008 , we will use the em-bedding of 12/3/2008 , and *DD*/*D*/*DDDD*in the baseline features listed above.Unlike in our chunking experiments, after wechose the best model on the development set, weused that model on the test set too. (In chunking,after finding the best hyperparameters on thedevelopment set, we would combine the devand training set and training a model over thiscombined set, and then evaluate on test.)The standard evaluation benchmark for NERis the CoNLL03 shared task dataset drawn fromthe Reuters newswire. The training set contains204K words (14K sentences, 946 documents), thetest set contains 46K words (3.5K sentences, 231documents), and the development set contains51K words (3.3K sentences, 216 documents).We also evaluated on an out-of-domain (OOD)dataset, the MUC7 formal run (59K words).MUC7 has a different annotation standard thanthe CoNLL03 data. It has several NE types thatdon’t appear in CoNLL03: money, dates, andnumeric quantities. CoNLL03 has MISC, whichis not present in MUC7. To evaluate on MUC7,we perform the following postprocessing stepsprior to evaluation:1. In the gold-standard MUC7 data, discard(label as ‘O’) all NEs with type NUM-BER/MONEY/DATE.2. In the predicted model output on MUC7 data,discard (label as ‘O’) all NEs with type MISC.388These postprocessing steps will adversely affectall NER models across-the-board, nonethelessallowing us to compare different models in acontrolled manner.6 Unlabled DataUnlabeled data is used for inducing the wordrepresentations. We used the RCV1 corpus, whichcontains one year of Reuters English newswire,from August 1996 to August 1997, about 63millions words in 3.3 million sentences. Weleft case intact in the corpus. By comparison,Collobert and Weston (2008) downcases wordsand delexicalizes numbers.We use a preprocessing technique proposedby Liang, (2005, p. 51), which was later usedby Koo et al. (2008): Remove all sentences thatare less than 90% lowercase a–z. We assumethat whitespace is not counted, although thisis not specified in Liang’s thesis. We call thispreprocessing step cleaning.In Turian et al. (2009), we found that allword representations performed better on thesupervised task when they were induced on theclean unlabeled data, both embeddings and Brownclusters. This is the case even though the cleaningprocess was very aggressive, and discarded morethan half of the sentences. According to theevidence and arguments presented in Bengio et al.(2009), the non-convex optimization process forCollobert and Weston (2008) embeddings mightbe adversely affected by noise and the statisticalsparsity issues regarding rare words, especiallyat the beginning of training. For this reason, wehypothesize that learning representations over themost frequent words first and gradually increasingthe vocabulary—a curriculum training strategy(Elman, 1993; Bengio et al., 2009; Spitkovskyet al., 2010)—would provide better results thancleaning.After cleaning, there are 37 million words (58%of the original) in 1.3 million sentences (41% ofthe original). The cleaned RCV1 corpus has 269Kword types. This is the vocabulary size, i.e. howmany word representations were induced. Notethat cleaning is applied only to the unlabeled data,not to the labeled data used in the supervised tasks.RCV1 is a superset of the CoNLL03 corpus.For this reason, NER results that use RCV1word representations are a form of transductivelearning.7 Experiments and Results7.1 Details of inducing word representationsThe Brown clusters took roughly 3 days to induce,when we induced 1000 clusters, the baseline inprior work (Koo et al., 2008; Ratinov & Roth,2009). We also induced 100, 320, and 3200Brown clusters, for comparison. (Because Brownclustering scales quadratically in the number ofclusters, inducing 10000 clusters would havebeen prohibitive.) Because Brown clusters arehierarchical, we can use cluster supersets asfeatures. We used clusters at path depth 4, 6, 10,and 20 (Ratinov & Roth, 2009). These are theprefixes used in Table 1.The Collobert and Weston (2008) (C&W)embeddings were induced over the course of afew weeks, and trained for about 50 epochs. Oneof the difficulties in inducing these embeddings isthat there is no stopping criterion defined, and thatthe quality of the embeddings can keep improvingas training continues. Collobert (p.c.) simplyleaves one computer training his embeddingsindefinitely. We induced embeddings with 25, 50,100, or 200 dimensions over 5-gram windows.In comparison to Turian et al. (2009), we useimproved C&W embeddings in this work:• They were trained for 50 epochs, not just 20epochs.• We initialized all embedding dimensions uni-formly in the range [-0.01, +0.01], not [-1,+1].For rare words, which are typically updated only143 times per epoch2, and given that our embed-ding learning rate was typically 1e-6 or 1e-7, thismeans that rare word embeddings will be concen-trated around zero, instead of spread out randomly.The HLBL embeddings were trained for 100epochs (7 days).3Unlike our Collobert and We-ston (2008) embeddings, we did not extensivelytune the learning rates for HLBL. We used a learn-ing rate of 1e-3 for both model parameters andembedding parameters. We induced embeddingswith 100 dimensions over 5-gram windows, andembeddings with 50 dimensions over 5-gram win-dows. Embeddings were induced over one pass2A rare word will appear 5 (window size) times perepoch as a positive example, and 37M (training examples perepoch) / 269K (vocabulary size) = 138 times per epoch as acorruption example.3The HLBL model updates require fewer matrix mul-tiplies than Collobert and Weston (2008) model updates.Additionally, HLBL models were trained on a GPGPU,which is faster than conventional CPU arithmetic.389approach using a random tree, not two passes withan updated tree and embeddings re-estimation.7.2 Scaling of Word EmbeddingsLike many NLP systems, the baseline system con-tains only binary features. The word embeddings,however, are real numbers that are not necessarilyin a bounded range. If the range of the wordembeddings is too large, they will exert moreinfluence than the binary features.We generally found that embeddings had zeromean. We can scale the embeddings by a hy-perparameter, to control their standard deviation.Assume that the embeddings are represented by amatrix E:E ← σ · E/stddev(E) (1)σ is a scaling constant that sets the new standarddeviation after scaling the embeddings.(a) 93.6 93.8 94 94.2 94.4 94.6 94.8 0.001 0.01 0.1 1Validation F1Scaling factor σC&W, 50-dimHLBL, 50-dimC&W, 200-dimC&W, 100-dimHLBL, 100-dimC&W, 25-dimbaseline(b) 89 89.5 90 90.5 91 91.5 92 92.5 0.001 0.01 0.1 1Validation F1Scaling factor σC&W, 200-dimC&W, 100-dimC&W, 25-dimC&W, 50-dimHLBL, 100-dimHLBL, 50-dimbaselineFigure 1: Effect as we vary the scaling factor σ (Equa-tion 1) on the validation set F1. We experiment withCollobert and Weston (2008) and HLBL embeddings of var-ious dimensionality. (a) Chunking results. (b) NER results.Figure 1 shows the effect of scaling factor σon both supervised tasks. We were surprisedto find that on both tasks, across Collobert andWeston (2008) and HLBL embeddings of variousdimensionality, that all curves had similar shapesand optima. This is one contributions of ourwork. In Turian et al. (2009), we were notable to prescribe a default value for scaling theembeddings. However, these curves demonstratethat a reasonable choice of scale factor is such thatthe embeddings have a standard deviation of Capacity of Word Representations(a) 94.1 94.2 94.3 94.4 94.5 94.6 94.7 100 320 1000 3200 25 50 100 200Validation F1# of Brown clusters# of embedding dimensionsC&WHLBLBrownbaseline(b) 90 90.5 91 91.5 92 92.5 100 320 1000 3200 25 50 100 200Validation F1# of Brown clusters# of embedding dimensionsC&WBrownHLBLbaselineFigure 2: Effect as we vary the capacity of the wordrepresentations on the validation set F1. (a) Chunkingresults. (b) NER results.There are capacity controls for the wordrepresentations: number of Brown clusters, andnumber of dimensions of the word embeddings.Figure 2 shows the effect on the validation F1 aswe vary the capacity of the word representations.In general, it appears that more Brown clustersare better. We would like to induce 10000 Brownclusters, however this would take several months.In Turian et al. (2009), we hypothesized onthe basis of solely the HLBL NER curve thathigher-dimensional word embeddings would givehigher accuracy. Figure 2 shows that this hy-pothesis is not true. For NER, the C&W curve isalmost flat, and we were suprised to find the even25-dimensional C&W word embeddings work sowell. For chunking, 50-dimensional embeddingshad the highest validation F1 for both C&W andHLBL. These curves indicates that the optimalcapacity of the word embeddings is task-specific.390System Dev TestBaseline 94.16 93.79HLBL, 50-dim 94.63 94.00C&W, 50-dim 94.66 94.10Brown, 3200 clusters 94.67 94.11Brown+HLBL, 37M 94.62 94.13C&W+HLBL, 37M 94.68 94.25Brown+C&W+HLBL, 37M 94.72 94.15Brown+C&W, 37M 94.76 94.35Ando and Zhang (2005), 15M - 94.39Suzuki and Isozaki (2008), 15M - 94.67Suzuki and Isozaki (2008), 1B - 95.15Table 2: Final chunking F1 results. In the last section, weshow how many unlabeled words were used.System Dev Test MUC7Baseline 90.03 84.39 67.48Baseline+Nonlocal 91.91 86.52 71.80HLBL 100-dim 92.00 88.13 75.25Gazetteers 92.09 87.36 77.76C&W 50-dim 92.27 87.93 75.74Brown, 1000 clusters 92.32 88.52 78.84C&W 200-dim 92.46 87.96 75.51C&W+HLBL 92.52 88.56 78.64Brown+HLBL 92.56 88.93 77.85Brown+C&W 92.79 89.31 80.13HLBL+Gaz 92.91 89.35 79.29C&W+Gaz 92.98 88.88 81.44Brown+Gaz 93.25 89.41 82.71Lin and Wu (2009), 3.4B - 88.44 -Ando and Zhang (2005), 27M 93.15 89.31 -Suzuki and Isozaki (2008), 37M 93.66 89.36 -Suzuki and Isozaki (2008), 1B 94.48 89.92 -All (Brown+C&W+HLBL+Gaz), 37M 93.17 90.04 82.50All+Nonlocal, 37M 93.95 90.36 84.15Lin and Wu (2009), 700B - 90.90 -Table 3: Final NER F1 results, showing the cumulativeeffect of adding word representations, non-local features, andgazetteers to the baseline. To speed up training, in combinedexperiments (C&W plus another word representation),we used the 50-dimensional C&W embeddings, not the200-dimensional ones. In the last section, we show howmany unlabeled words were used.7.4 Final resultsTable 2 shows the final chunking results and Ta-ble 3 shows the final NER F1 results. We compareto the state-of-the-art methods of Ando and Zhang(2005), Suzuki and Isozaki (2008), and—forNER—Lin and Wu (2009). Tables 2 and 3 showthat accuracy can be increased further by combin-ing the features from different types of word rep-resentations. But, if only one word representationis to be used, Brown clusters have the highest ac-curacy. Given the improvements to the C&W em-beddings since Turian et al. (2009), C&W em-beddings outperform the HLBL embeddings. Onchunking, there is only a minute difference be-tween Brown clusters and the embeddings. Com-(a) 0 50 100 150 200 2500 1 10 100 1K 10K 100K 1M# of per-token errors (test set)Frequency of word in unlabeled dataC&W, 50-dimBrown, 3200 clusters(b) 0 50 100 150 200 2500 1 10 100 1K 10K 100K 1M# of per-token errors (test set)Frequency of word in unlabeled dataC&W, 50-dimBrown, 1000 clustersFigure 3: For word tokens that have different frequencyin the unlabeled data, what is the total number of per-tokenerrors incurred on the test set? (a) Chunking results. (b) NERresults.bining representations leads to small increases inthe test F1. In comparison to chunking, combin-ing different word representations on NER seemsgives larger improvements on the test F1.On NER, Brown clusters are superior to theword embeddings. Since much of the NER F1is derived from decisions made over rare words,we suspected that Brown clustering has a superiorrepresentation for rare words. Brown makesa single hard clustering decision, whereas theembedding for a rare word is close to its initialvalue since it hasn’t received many trainingupdates (see Footnote 2). Figure 3 shows the totalnumber of per-token errors incurred on the testset, depending upon the frequency of the wordtoken in the unlabeled data. For NER, Figure 3 (b)shows that most errors occur on rare words, andthat Brown clusters do indeed incur fewer errorsfor rare words. This supports our hypothesisthat, for rare words, Brown clustering producesbetter representations than word embeddings thathaven’t received sufficient training updates. Forchunking, Brown clusters and C&W embeddingsincur almost identical numbers of errors, anderrors are concentrated around the more common391words. We hypothesize that non-rare words havegood representations, regardless of the choiceof word representation technique. For tasks likechunking in which a syntactic decision relies uponlooking at several token simultaneously, com-pound features that use the word representationsmight increase accuracy more (Koo et al., 2008).Using word representations in NER broughtlarger gains on the out-of-domain data than on thein-domain data. We were surprised by this result,because the OOD data was not even used duringthe unsupervised word representation induction,as was the in-domain data. We are curious toinvestigate this phenomenon further.Ando and Zhang (2005) present a semi-supervised learning algorithm called alternatingstructure optimization (ASO). They find a low-dimensional projection of the input features thatgives good linear classifiers over auxiliary tasks.These auxiliary tasks are sometimes specificto the supervised task, and sometimes generallanguage modeling tasks like “predict the missingword”. Suzuki and Isozaki (2008) present a semi-supervised extension of CRFs. (In Suzuki et al.(2009), they extend their semi-supervised ap-proach to more general conditional models.) Oneof the advantages of the semi-supervised learningapproach that we use is that it is simpler and moregeneral than that of Ando and Zhang (2005) andSuzuki and Isozaki (2008). Their methods dictatea particular choice of model and training regimeand could not, for instance, be used with an NLPsystem based upon an SVM classifier.Lin and Wu (2009) present a K-means-likenon-hierarchical clustering algorithm for phrases,which uses MapReduce. Since they can scaleto millions of phrases, and they train over 800Bunlabeled words, they achieve state-of-the-artaccuracy on NER using their phrase clusters.This suggests that extending word representa-tions to phrase representations is worth furtherinvestigation.8 ConclusionsWord features can be learned in advance in anunsupervised, task-inspecific, and model-agnosticmanner. These word features, once learned, areeasily disseminated with other researchers, andeasily integrated into existing supervised NLPsystems. The disadvantage, however, is that ac-curacy might not be as high as a semi-supervisedmethod that includes task-specific informationand that jointly learns the supervised and unsu-pervised tasks (Ando & Zhang, 2005; Suzuki &Isozaki, 2008; Suzuki et al., 2009).Unsupervised word representations have beenused in previous NLP work, and have demon-strated improvements in generalization accuracyon a variety of tasks. Ours is the first work tosystematically compare different word repre-sentations in a controlled way. We found thatBrown clusters and word embeddings both canimprove the accuracy of a near-state-of-the-artsupervised NLP system. We also found that com-bining different word representations can improveaccuracy further. Error analysis indicates thatBrown clustering induces better representationsfor rare words than C&W embeddings that havenot received many training updates.Another contribution of our work is a defaultmethod for setting the scaling parameter forword embeddings. With this contribution, wordembeddings can now be used off-the-shelf asword features, with no tuning.Future work should explore methods forinducing phrase representations, as well as tech-niques for increasing in accuracy by using wordrepresentations in compound features.Replicating our experimentsYou can visit http://metaoptimize.com/projects/wordreprs/ to find: The wordrepresentations we induced, which you candownload and use in your experiments; The codefor inducing the word representations, which youcan use to induce word representations on yourown data; The NER and chunking system, withcode for replicating our experiments.AcknowledgmentsThank you to Magnus Sahlgren, Bob Carpenter,Percy Liang, Alexander Yates, and the anonymousreviewers for useful discussion. Thank you toAndriy Mnih for inducing his embeddings onRCV1 for us. Joseph Turian and Yoshua Bengioacknowledge the following agencies for re-search funding and computing support: NSERC,RQCHP, CIFAR. Lev Ratinov was supported bythe Air Force Research Laboratory (AFRL) underprime contract no. FA8750-09-C-0181. Anyopinions, findings, and conclusion or recommen-dations expressed in this material are those of theauthor and do not necessarily reflect the view ofthe Air Force Research Laboratory (AFRL).392ReferencesAndo, R., & Zhang, T. (2005). A high-performance semi-supervised learning methodfor text chunking. ACL.Bengio, Y. (2008). Neural net language models.Scholarpedia, 3, 3881.Bengio, Y., Ducharme, R., & Vincent, P. (2001).A neural probabilistic language model. NIPS.Bengio, Y., Ducharme, R., Vincent, P., & Jauvin,C. (2003). A neural probabilistic languagemodel. Journal of Machine Learning Research,3, 1137–1155.Bengio, Y., Louradour, J., Collobert, R., &Weston, J. (2009). Curriculum learning. ICML.Bengio, Y., & S´en´ecal, J S. (2003). Quick train-ing of probabilistic neural nets by importancesampling. AISTATS.Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003).Latent dirichlet allocation. Journal of MachineLearning Research, 3, 993–1022.Brown, P. F., deSouza, P. V., Mercer, R. L., Pietra,V. J. D., & Lai, J. C. (1992). Class-based n-grammodels of natural language. ComputationalLinguistics, 18, 467–479.Candito, M., & Crabb´e, B. (2009). Improving gen-erative statistical parsing with semi-supervisedword clustering. IWPT (pp. 138–141).Collobert, R., & Weston, J. (2008). A unifiedarchitecture for natural language processing:Deep neural networks with multitask learning.ICML.Deschacht, K., & Moens, M F. (2009). Semi-supervised semantic role labeling using theLatent Words Language Model. EMNLP (pp.21–29).Dumais, S. T., Furnas, G. W., Landauer, T. K.,Deerwester, S., & Harshman, R. (1988). Usinglatent semantic analysis to improve access totextual information. SIGCHI Conference onHuman Factors in Computing Systems (pp.281–285). ACM.Elman, J. L. (1993). Learning and developmentin neural networks: The importance of startingsmall. Cognition, 48, 781–799.Goldberg, Y., Tsarfaty, R., Adler, M., & Elhadad,M. (2009). Enhancing unlexicalized parsingperformance using a wide coverage lexicon,fuzzy tag-set mapping, and EM-HMM-basedlexical probabilities. EACL.Honkela, T. (1997). Self-organizing maps ofwords for natural language processing applica-tions. Proceedings of the International ICSCSymposium on Soft Computing.Honkela, T., Pulkki, V., & Kohonen, T. (1995).Contextual relations of words in grimm tales,analyzed by self-organizing map. ICANN.Huang, F., & Yates, A. (2009). Distributional rep-resentations for handling sparsity in supervisedsequence labeling. ACL.Kaski, S. (1998). Dimensionality reduction byrandom mapping: Fast similarity computationfor clustering. IJCNN (pp. 413–418).Koo, T., Carreras, X., & Collins, M. (2008).Simple semi-supervised dependency parsing.ACL (pp. 595–603).Krishnan, V., & Manning, C. D. (2006). Aneffective two-stage model for exploiting non-local dependencies in named entity recognition.COLING-ACL.Landauer, T. K., Foltz, P. W., & Laham, D. (1998).An introduction to latent semantic analysis.Discourse Processes, 259–284.Li, W., & McCallum, A. (2005). Semi-supervisedsequence modeling with syntactic topic models.AAAI.Liang, P. (2005). Semi-supervised learningfor natural language. Master’s thesis, Mas-sachusetts Institute of Technology.Lin, D., & Wu, X. (2009). Phrase clusteringfor discriminative learning. ACL-IJCNLP (pp.1030–1038).Lund, K., & Burgess, C. (1996). Producinghighdimensional semantic spaces from lexicalco-occurrence. Behavior Research Methods,Instrumentation, and Computers, 28, 203–208.Lund, K., Burgess, C., & Atchley, R. A. (1995).Semantic and associative priming in high-dimensional semantic space. Cognitive ScienceProceedings, LEA (pp. 660–665).Martin, S., Liermann, J., & Ney, H. (1998). Algo-rithms for bigram and trigram word clustering.Speech Communication, 24, 19–37.Miller, S., Guinness, J., & Zamanian, A. (2004).Name tagging with word clusters and discrim-inative training. HLT-NAACL (pp. 337–342).393[...]... context analysis AKRR’05, International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning Sahlgren, M (2005) An introduction to random indexing Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering (TKE) V¨ yrynen, J J., & Honkela, T (2004) Word cata egory maps based on emergent features created... dependency parsing NAACL-HLT Mnih, A. , & Hinton, G E (2007) Three new graphical models for statistical language modelling ICML Suzuki, J., & Isozaki, H (2008) Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data ACL-08: HLT (pp 665–673) Mnih, A. , & Hinton, G E (2009) A scalable hierarchical distributed language model NIPS (pp 1081–1088) Suzuki, J., Isozaki, H., Carreras,... Gauvain, J.-L (2002) Connectionist language modeling for large vocabulary continuous speech recognition International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp 765–768) Orlando, Florida Zhang, T., & Johnson, D (2003) A robust risk minimization based named entity recognition system CoNLL Sha, F., & Pereira, F C N (2003) Shallow parsing with conditional random fields HLT-NAACL Zhao,... semantic maps Biological Cybernetics, 241–254 Ushioda, A (1996) Hierarchical clustering of words COLING (pp 1159–1162) Sahlgren, M (2001) Vector-based semantic analysis: Representing word meanings based on random labels Proceedings of the Semantic Knowledge Acquisition and Categorisation Workshop, ESSLLI V¨ yrynen, J., & Honkela, T (2005) Compara ison of independent component analysis and singular value... for named-entity recognition NIPS Workshop on Grammar Induction, Representation of Language and Language Learning Ratinov, L., & Roth, D (2009) Design challenges and misconceptions in named entity recognition CoNLL Turney, P D., & Pantel, P (2010) From frequency to meaning: Vector space models of semantics Journal of Artificial Intelligence Research Ritter, H., & Kohonen, T (1989) Self-organizing semantic... (2009) An empirical study of semi-supervised structured conditional models for dependency parsing EMNLP Morin, F., & Bengio, Y (2005) Hierarchical probabilistic neural network language model AISTATS Pereira, F., Tishby, N., & Lee, L (1993) Distributional clustering of english words ACL (pp 183–190) Turian, J., Ratinov, L., Bengio, Y., & Roth, D (2009) A preliminary evaluation of word representations for. .. created by ICA Proceedings of the STeP’2004 Cognition + Cybernetics Symposium (pp 173–185) Finnish Artificial Intelligence Society Sahlgren, M (2006) The word-space model: Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces Doctoral dissertation, Stockholm University V¨ yrynen, J J., Honkela, T., & Lindqvist, L a (2007) Towards explicit... semantic features using independent component analysis Proceedings of the Workshop Semantic Content Acquisition and Representation (SCAR) Stockholm, Sweden: Swedish Institute of Computer Science ˇ ur Reh˚ ˇek, R., & Sojka, P (2010) Software framework for topic modelling with large corpora LREC Sang, E T., & Buchholz, S (2000) Introduction to the CoNLL-2000 shared task: Chunking CoNLL Schwenk, H., & Gauvain,... Shallow parsing with conditional random fields HLT-NAACL Zhao, H., Chen, W., Kit, C., & Zhou, G (2009) Multilingual dependency learning: a huge feature engineering method to semantic dependency parsing CoNLL (pp 55–60) Spitkovsky, V., Alshawi, H., & Jurafsky, D (2010) From baby steps to leapfrog: How “less 394 . compareto the state-of-the-art methods of Ando and Zhang(2005), Suzuki and Isozaki (2008), and for NER—Lin and Wu (2009). Tables 2 and 3 showthat accuracy. as Ando and Zhang(2005), Suzuki and Isozaki (2008), and Suzukiet al. (2009) achieve state-of-the-art accuracy.However, these approaches dictate a particularchoice
- Xem thêm -

Xem thêm: Tài liệu Báo cáo khoa học: "Word representations: A simple and general method for semi-supervised learning" doc, Tài liệu Báo cáo khoa học: "Word representations: A simple and general method for semi-supervised learning" doc, Tài liệu Báo cáo khoa học: "Word representations: A simple and general method for semi-supervised learning" doc