... for automaticevaluationofChinesetranslation output. Our analysis suggests several key reasons behind this finding. 2 ChineseTranslationEvaluation Automatic MT evaluation aims at formulating ... Association for Computational Linguistics Automatic EvaluationofChineseTranslation Output: Word-Level or Character-Level? Maoxi Li Chengqing Zong Hwee Tou Ng National Laboratory of Pattern ... character-level metrics more suitable for the automaticevaluationof Chinese translation output. 6 Conclusion In this paper, we conducted a detailed study of the relative merits of word-level versus...
... human evalua-tors. For the input ofautomaticevaluation meth-ods, we created three evaluation sets from the MT outputs: 1. Case set: The original system outputs with case information. 2. ... section, we present the evaluations of ROUGE-L, ROUGE-S, and compare their per-formance with other automaticevaluation meas-ures. 5 Evaluations One of the goals of developing automatic evalua-tion ... construction of N-best translation lexicons from parallel text. Melamed (1995) used the ratio (LCSR) between the length of the LCS of two words and the length of the longer word of the two words...
... Proceedings of the First Conference of the Association for Machine Translation in the Ameri-cas, pages 193–205, Columbia, Maryland. BLEU: a Method for AutomaticEvaluationof Machine Translation Kishore ... judgeswhich substitutes for them when there isneed for quick or frequent evaluations.11 Introduction1.1 RationaleHuman evaluations of machine translation (MT)weigh many aspects of translation, including ... “perfect” translations of agiven source sentence. These translations may varyin word choice or in word order even when they usethe same words. And yet humans can clearly dis-tinguish a good translation...
... Method for AutomaticEvaluationof Ma-chine Translation. Technical Report RC22176, IBM.Joseph Turian, Luke Shen, and I. Dan Melamed. 2003. Evalua-tion of Machine Translation and its evaluation. ... types of language model forthe generated text. Then in Section 6 we conclude.2 Related WorkIn terms of human evaluation, there is no uniformview on what constitutes the notion of fluency, or ... letters forcorrect spelling, good grammar, rhythm and flow,appropriateness of tone, and several other specificcharacteristics of good text.In terms ofautomatic evaluation, we are not aware of any...
... Morpho-syntactic information for au-tomatic error analysis of statistical machine translation output. In Proceedings of WMT, pages 1–6, New YorkCity, New York, USA.Matthew Snover, Bonnie Dorr, ... though the main purpose of BLAST is forerror annotation of machine translation output, thefreedom in the use of error typologies and supportannotations also makes it suitable for other taskswhere ... colored text, where the color depends onthe main type of the error. For each new annotationthe user selects the word or words that are wrong,and selects an error type. In figure 1, the words...
... LanguageMean # ofTranslation Words per Source WordMean Context-freeness (# of Word Link = 4)28,300 words (49.5%)3,107 words1.51 trans./word4.4520,722 words (34.0%)3,601 words1.94 trans./word4.21Rewritten ... Agreement: The word order of a source sentence agrees substantially withthat of a target sentence. This measure en-sures that the cost of word order adjustmentis small.ãWord Translation Stability: ... trans./word4.21Rewritten TranslationsOriginal TranslationsTable 1: Comparison of TDMT Training Translations and Original Translationsthe rule writers rewrote translations to make tar-get words as simple...
... of systems,ignoring the linguistic quality of the system out-put. Part of the reason for this imbalance is theexistence of ROUGE (Lin and Hovy, 2003; Lin,2004), the system for automaticevaluation ... pair-wise comparisons of competing systemsover a test set of several inputs and 70%for ranking summaries of a specific input.1 IntroductionEfforts for the development ofautomatic text sum-marizers ... are the total number of word types fromboth sentences siand si+1. Stop words were re-tained. The value of each dimension for a sentenceis the number of tokens of that word type in thatsentence....
... measures of fluency, i.e. syntactic cor-rectness.In this paper, we carry out experiments to corre-late automaticevaluationof the outputof a surfacerealisation ranking system for German ... 0.3333Table 2: Correlation between human judgementsfor experiment A (rank 1–3) and automatic metricsfor attempting to account for different but equiva-lent translations of a given source word, typicallyby ... between the automatic eval-uation of system output and human (expert andnon-expert) evaluationof the same data (Englishweather forecasts). Their findings show that theNIST metric correlates...
... size of the evaluation corpus. Furthermore, inorder to find the nearest neighbor solution, we haveto calculate all BLEU scores of the neighborhood.If the number of rules is R and the neighborhoodis ... domain, bilingual corpora of the new do-main are necessary. However, the sizes of the newcorpora are generally smaller than that of the orig-inal corpus because the collection of bilingual sen-tences ... 149,882Corpus# of Words 868,087 984,197 Evaluation # of Sentences 10,145Corpus# of Words 59,533 67,554Test # of Sentences 10,150Corpus# of Words 59,232 67,193Table 1: Corpus Sizethe rule subsets cannot...
... development of sys-tems. Automaticevaluation requires a cor-pus of questions and answers, a definition of what is a correct answer, and a way to com-pare the correct answers to automatic answersproduced ... selection of natural questions. Thearticles varied in topic, degree of formality and theamount of details; from ”Horror film” and ”Christ-mas worldwide” to ”G-Man (Half-Life)” and ”His-tory of London”. ... andTsunenori Mori. 2007. An Overview of the 4th Ques-tion Answering Challenge (QAC-4) at NTCIR work-shop 6. In Proceedings of the Sixth NTCIR WorkshopMeeting, pages 433–440.Chiori Hori, Takaaki Hori,...
... ing, etc., from English word forms, thereby re- ducing the inflected word forms to more or less stand- ard stem forms. Machinable rules for the automatic splitting of word affixes, a process ... order according to rule (1), added new English correspondents according to rule (2), deleted the translations of homographic Russian words according to rule (3), inserted short words ac- cording ... English correspondent, the inflection of a cor- respondent into the plural, etc.2 (2) A determiner formula Dr for the desired algor- ithm. This is the portion of the algorithm known beforehand;...
... point for wordsegmentation and 1 point for Joint S&T, with cor-responding error reductions of 30.2% and 14%.The final result outperforms the latest work on thesame corpus which uses more ... representation of Ng and Low (2004). For word segmentationonly, there are four boundary tags:ã b: the begin of the wordã m: the middle of the wordã e: the end of the wordã s: a single-character wordwhile ... Processingnorms of modern chinese corpus. Technical report.Yue Zhang and Stephen Clark. 2007. Chinese seg-mentation with a word-based perceptron algorithm.In Proceedings of the 45th Annual Meeting of...
... Utsuro and Noriko Kando.2009. Meta -Evaluation ofAutomatic Evaluation Methods for Machine Translation using Patent Translation Data in NTCIR-7. In Proc. of the3rd Workshop on Patent Translation, ... Na-tional Institute of Informatics (NII) providedcorpora used in this work. The author grate-fully acknowledges JAPIO and NII for theirsupport. Moreover, this work was partiallysupported by Grants ... a parameter for theweight of scorenp: it is between 0.0 and 1.0.The ratio of scorewdto scorenpis 1:1 when δ is1.0. Moreover, scorewd multiand scorenp multiare used for Eq. (13) in...
... the length, i.e., number of words, among theanalyzed phrases differs and our algorithm doesnot use semantical information, in order to avoidthe distortion of the original phrase, in our systemthe ... generator for open do-main dialogue systems, or chatbots.In this paper, we propose the automatic gene-ration of trivial dialogue phrases through the ap-plication of a transfer-like genetic algorithm ... which one of the genera-ted expressions is correct. Our method differs inthe application of a GA-like transfer process inorder to automatically insert new features on theselected original...
... score for a set of generator forecastsis the average of scores for individual forecasts.Human evaluations: We recruited 9 experts(people with experience reading forecasts for off-shore oil ... types of NLG systems for whichthe correlation ofautomatic and human evaluation was particularly good or bad.Data: We extracted from each forecast in theSUMTIME corpus the first description of ... top of Table 1).We randomly selected a subset of 21 forecastdates for use in human evaluations. For these 21forecast dates, we also asked two meteorologistswho had not contributed to the original...