Nghiên cứu và đánh giá độ tin cậy của bài thi viết cuối học kỳ i dành cho sinh viên chuyên anh năm thứ hai tại trường cao đẳng sư phạm nghệ an và một số gợi ý thay đổi

INTRODUCTION RATIONALE There is no doubt that testing is an essential part of language teaching and learning A language test in general can be a “ sample language behavior and infer general ability in the language learnt.”(Brown D.H, 1994:252) In other words, from the results of the test, depending on different kinds of tests with different purposes as well, the teacher infers a certain level of language competence of his students in such different areas as grammar, vocabulary, pronunciation, or speaking, listening, writing and reading It is obvious that the teacher plays a very important role in the process of assessment and measurement which is conducted through testing It is said that “ language testing is a form of measurement It is so closely related to teaching that we cannot work in testing without being constantly concerned with teaching.”(Heaton, 1988:5) There are various types of test which serve different purposes in foreign language teaching and learning Among the kinds of tests and testing, writing tests are said to be less reliable from the point of both scorer and testee This situation can be seen clearly at Nghe An Junior Teacher Training College For many years, English writing has been considered the most difficult skill to be tested among teachers Teachers have found it difficult to mark the achievement writing tests accurately, in particular mark compositions, as they blame that there is no rating scale for scoring compositions, or the provided rating scale is too general Apart from this, many students are still worried about the results of the writing achievement tests, especially the task of writing a composition as they wonder if their writings are accurately evaluated by raters That is the reason for choosing the topic of the research: A study on the reliability of the achievement writing test for the second year English major students at N.A.JTTC It is hoped that the study will be helpful to the author, the teachers at the English department of N.A JTTC and to those who are concerned with language testing in general and the study of the reliability of writing achievement tests in particular AIMS OF THE STUDY The major aims of this study are: - to explore the relevant notions of language testing - to analyze the achievement writing test for the second year English major students on the basis of the syllabus, purposes of teaching and testing; and available data such as test scores and scores of sample compositions for evidences on its validity and reliability with a focus on reliability - to provide some suggestions for test- designers as well as raters SCOPE OF THE STUDY Evaluating an achievement writing test consists of complex procedures and needs a number of criteria to be set up However, due to the availability of data and limitation of time, this study focuses mainly on the reliability of the achievement writing test for the second year English major students at N.A JTTC The results can be seen as the basis for providing some suggestions for test designers as well as raters METHODS OF THE STUDY On the basis of analyzing the teaching aims, and syllabus for the second-year English major students as well as the content of the writing test (term 1) as the practical base for the study, the quantitative method, which focuses on analyzing the test scores of 156 second year students and the scores of 15 sample compositions collected randomly, is used to measure the reliability of the test DESIGN OF THE STUDY The study is comprised of three parts: Part I: Introduction provides information on the rationale for choosing the topic, the aims, the scope, and the methods of the study Part II: Development ( divided into chapters) *Chapter one: reviews the literature related to language testing (definitions, approaches, roles, purposes and relationships between teaching-learning- testing), testing writing (types of writing, criteria of testing writing and problems in testing writing) and the criteria of a good test with a focus on reliability in which factors affecting language test scores, methods to determine the reliability of a language test and measures for improving test reliability are mentioned * Chapter two: presents an overview of the teaching, learning and testing situations at Nghe An Teacher Training college including a description of the second year English teaching aims, the writing syllabus /course book as well as the content of the writing achievement test * Chapter three: presents the methodology of the analysis of the format of the writing achievement test, of the data collected from the test scores of 156 second year students and the scores of 15 random sample compositions in order to find out “to what extent is the marking of the writing achievement test reliable? Part III: Conclusion presents a summary and some recommendations/suggestions for test- designers and raters DEVELOPMENT CHAPTER 1: LITERATURE REVIEW In this chapter, the theoretical background for the study is established Firstly, the term’ language testing’ including the approaches, roles, purposes as well as the relationships between language testing, teaching and learning will be explored Then testing of writing will be discussed and followed by an examination of the criteria of a good language test with a focus on the reliability 1.1 Language testing 1.1.1 Definitions of language testing Testing is an important part of every teaching and learning experience and becomes one of the main aspects of methodology Many researchers have given out definitions of testing with different points of view Allen (1974:313) emphasizes testing as an instrument to ensure that students have a sense of competition rather than to know how good their performance is and in which condition a test can take place He says ‘test is a measuring device which we use when we want to compare an individual with other individuals who belong to the same group.’ According to Carroll (1968:46), a psychological or educational test is a procedure designed to elicit certain behavior from which one can make inferences about certain characteristics of an individual In other words, a test is a measurement instrument designed to elicit a particular behavior of each individual According to Bachman (1990:20), what distinguishes a test from other types of measurement is that it is designed to obtain specific sample of behavior This distinction is believed to be of great importance because it reflects the primary justification for the use of language tests and has implications for how we design, develop and use them to their best use Thus, language tests can provide the means for more focus on the specific assure of interest Besides, Ibe (1981:1) points out that “a sample of behavior under the control of specified conditions aims towards providing a basis for performing judgment.” The term a sample of behavior used here is rather broad and it means something else rather than the traditional types of paper and pencils Read (1983) shares the same idea with Ibe in the sense that a sample of behavior suggests language testing certainly includes listening and speaking skills as well as reading and writing ones However, Heaton (1988:5) looks at testing in a different way In his opinion, tests are considered as a means of assessing the students’ performance and to motivate the students He looks at tests with positive eyes as many students are eager to take tests at the end of the semester to know how much knowledge they have One important thing is that he points out the relationship between testing and teaching In short, from the above descriptions, testing is an effective means of measuring and assessing students’ language knowledge and skills It is of great use to both language teaching and learning In order to understand more about language testing, we should have a look at the different approaches to language testing in the following part 1.1.2 Approaches to language testing According to Heaton (1988:15), there are four main approaches to testing: (i) the essay translation approach, (ii) the structuralist approach, (iii) the integrative approach, and (iv) the communicative approach (i) The essay approach is considered as an old method in which tests often focus on essay writing, translation and grammatical analysis This approach requires no special skills or expertise in testing, but the subjective judgment of the teacher is considered to be of paramount importance (ii) The structuralist approach places the main focus on testing concrete skills without relating them to the context The skills of listening, speaking, reading and writing are separated from one another as it is considered essential to test each one at a time The learners’ mastery of the separate elements of the target language (phonology, vocabulary and grammar) are also tested using words and sentences completely out of contexts so that a large number of samples of language forms can be covered in the test in a comparatively short time Thus, the results of the students’ tests completely depend on the accurate forms of separate language aspects or skills tested rather than the total meaning of the discourse or ability to use the language appropriately and effectively However, this approach is still valid for some types of tests for specific purposes as it is considered to be objective, precise, reliable and scientific That is the reason why the typical type of test following this approach, multiple choice, is still widely used nowadays even though there is still a limited use for multiple choice items in many communicative tests (iii) The integrative approach, in contrast, involves the testing of language in contexts and is thus concerned primarily with meaning and the total communicative effect of discourse (Heaton, 1988) Integrative tests, instead of separating the language into different aspects, are designed to test two or more skills at the same time (especially focusing on reading and listening or language components in integration as grammar and vocabulary In other words, this type of tests is concerned with students’ global proficiency, not their mastery of separate elements or skills The typical types of tests following this approach are cloze tests, dictation, oral interviews, translation and essay writing However, according to Heaton (1988:16), integrative testing involves ‘ functional language’ but not the use of functional language, and thus they are weak in communication (iv) The communicative approach is considered to be interactive, purposive, authentic, contextualized and should be assessed in terms of behavioral outcomes Although both integrative and communicative approaches emphasize the importance of the meaning of utterances rather than their form and structure, communicative tests are concerned primarily with how language is used in communication ( Heaton, 1988:19) The communicative approach emphasizes the evaluation of language use rather than usage (‘use’ is concerned with how people actually use language for different purposes while ‘usage’ concerns the formal patterns of language) However, the communicative approach is claimed to be less reliable because of various real- life situations in different areas/ countries and according to Heaton (1988), in order to increase the reliability, especially in scoring, a very carefully drawn -up and well- established criteria must be designed In brief, each approach to language testing has its weak points as well as strong points Therefore, in Heaton’s point of view, a useful test will generally incorporate features of several of these approaches (Heaton, 1988:15) 1.1.3 The roles of language testing According to Mc Namara (2003:4), language tests play a powerful role in many people’s lives, acting as gateways at important transitional moments in education, in employment, and in moving from one country to another Language testing is used for the assessment, employment, selection and considered by raters as a means of placing students on particular courses Moreover, language tests are also considered as the criterion to evaluate the language proficiency of the researchers who want to carry out their research in language study F Bachman (1990:2) has also mentioned to the importance of language testing Language tests can be valuable sources of information about the effectiveness of learning and teaching (language teachers regularly use tests to help diagnose students’ strengths and weaknesses, to assess students’ progress, and to assist in evaluating students’ achievements); language tests are also frequently used as sources of information in evaluating the effectiveness of different approaches to language teaching, as sources of feedback on learning and teaching; and thus language tests can provide useful input into the process of language teaching Conversely, insights gained from language learning and teaching can provide valuable information for designing and developing more useful tests 1.1.4 The purposes of language testing Language tests usually differ according to their purposes Hughes (1989:7) points out the different purposes of testing based on the different kinds of tests such as: - to measure language proficiency regardless of any language courses that candidates may have followed - to discover how far students have achieved the objectives of a course of study - to diagnose students’ strengths and weaknesses, to identify what they know and what they not know - to assist placement of students by identifying the stage or part of a teaching program most appropriate to their ability According to Henning (1987:1-2-3), there are six major purposes of language testing which can be represented as follows: (i) Diagnosis and Feedback ( in order to find out students’ strengths and weaknesses) (ii) Screening and Selection ( in order to decide who should be allowed to participate in a particular program of study) (iii) Placement ( in order to identify a particular level of the student and place him or her in a particular program of study) (iv) Program Evaluation ( in order to provide the information about the effectiveness of the programs) (v) Providing Research Criteria ( in order to provide a standard of judgment in a variety of other research contexts) (vi) Assessment of Attitudes and Sociopsychological Differences 1.1.5 Relationship between Language Testing and Language Teaching & Learning Though a large number of examinations and tests in the past tended to separate testing from teaching, Heaton (1988:5) emphasizes that teaching and testing in some ways are so interwoven and independent that it is very difficult to tease apart “Both testing and teaching are so closely interrelated that it is virtually impossible to work in either field without being constantly concerned with the others” Heaton (1988:5) also notes: “Tests may be constructed primarily as devices to reinforce learning and motivate the students or as a mean of assessing the students’ performance in the language” In the former case, testing is geared to the teaching, whereas in the latter case, teaching is often geared largely to the testing According to Hughes (1989:1), the effect of testing on teaching and learning is known as backwash which can be harmful or beneficial He puts more focus on the harmful test If the test content does not go with the objectives of the course, then the backwash can be really harmful and it leads to problems of teaching in one way but testing in another way In his view, some language tests have harmful effects on teaching and often fail to measure accurately whatever they are intended to measure That is proved in the case of a writing test with only multiple choice items in which learners only concentrate on practicing such items rather than practicing the skill of writing itself In general, testing and teaching -learning process have a very close relationship Tests can be effective for both teaching and learning process and vice versa Therefore, good tests are useful, desirable and can be used as a valuable teaching device For educational purposes, educators should improve both language tests and language teaching methods to have beneficial backwash 1.2 Testing writing In this section, to understand what testing writing is, it is crucial to discuss the writing types, criteria to test writing accurately and problems in testing writing 1.2.1 Types of writing Writing is defined as the productive skill in the written mode It is complex and difficult to teach and test as well since it requires the mastery not only of grammatical and rhetorical devices but also of conceptual and judgmental elements in which the use of judgment is considered to be the most important in teaching and testing of writing Normally, after clarifying the purpose of writing, the particular audience and register, an essay or a composition may be written in accordance with the four main types: narrative, descriptive, expressive, or argumentative However, Heason (1988:136) divides the types of writing according to learners’ levels In his point of view, there are three main levels: basic, intermediate and advanced At the basic level, learners learn how to write letters, postcards, diaries, and forms With the already -learnt types in the basic level, learners should be taught the ways to guide and set instructions at the intermediate level And at the advanced level, learners have to learn more how to write newspaper reports and notes 1.2.2 Criteria of testing writing There is a wide variety of writing tests which is needed to test many kinds of writing tasks that we engage in According to Madsen (1983:101), there are three main types of writing tasks and tests as well: controlled, guided and free writing which correspond with their testing criteria He points out the advantages and limitations of each type in which scoring in guided and free writings is considered to be subjective Thus, McNamara (2000:38) says in the past, writing skills were assessed indirectly through examinations of control over the grammatical system and knowledge of vocabulary because of the problem of subjectivity However, nowadays educators place more emphasis on teaching and testing learners’ communicative language abilities That is the reason why there is a current trend in shifting from testing isolated items to testing writing compositions in which the managing of the rating process became an urgent necessity John Boker indicates the disadvantages of testing writing compositions such as limited amount of content sampled, time- consuming of score, and low reliability in scoring Therefore, in order to evaluate learners’ ability in the writing skill accurately, raters have to be carefully trained in rating learners’ writings, and pay much attention to some micro skills involving in the writing process that learners need to acquire Heason (1988:135) describes those micro skills as follows: * language use: the ability to write correct and appropriate sentences; * mechanical skills: the ability to use correctly those conventions peculiar to the written language (e.g punctuation, spelling ); * treatment of content: the ability to think creatively and develop thoughts, excluding all irrelevant information; * stylistic skills: the ability to manipulate sentences and paragraphs, and use language effectively; * judgment skills: the ability to write in an appropriate manner for a particular purpose with a particular audience in mind, together with an ability to select, organize and order relevant information He also represents the minimum criteria of testing writing which learners need to obtain based on the different levels This can be shown as belows: * Basic level: No confusing errors of grammar or vocabulary; a piece of writing legible and readily intelligible; able to produce simple unsophisticated sentences * Intermediate level: Accurate grammar, vocabulary and spelling, though possibly with some mistakes which not destroy communication; handwriting generally legible; expression clear and appropriate, using a fair range of language; able to link themes and points coherently * Advanced level: Extremely high standards of grammar, vocabulary and spelling; easily legible handwriting; no obvious limitations on range of language candidate is able to use accurately and appropriately; ability to produce organized, coherent writing, displaying considerable sophistication (c) Measures of distribution Scores(x) Mean(M) Deviation/difference(x- x) Difference squared(x- x)2 5.6 -2.6 6.76 5.6 -1.6 2.56 5.6 -0.6 0.36 5.6 0.4 0.16 5.6 1.4 1.96 5.6 2.4 5.76 5.6 3.4 11.56 = 29.12 Table 7: Measures of distribution From the results of the above table, we can calculate the standard deviation (SD) by the following formula: x: students’ score SD = (x-x)2 N : sum of N : number of students x : the mean √ : square root The result of the standard deviation will be calculated as follows: 29.12 SD = 156 = 1.19 According to the result of the standard deviation, we can come to the following conclusion: i There is a quite wide range of ability among testees ii The score distribution is rather wide iii The test has spread the student out As mentioned above, the reliability coefficient will be calculated as follows: x (K- x) Rxx = 1- 5.6( 30-5.6) = 1- K(SD) 136.64 ≈ -2.2 = 130 (1.19)2 42.483 In the literature, as stated in chapter 2, the appropriate reliability coefficient for writing tests range from 0.7 to 0.79 Accordingly, the result somehow proves that there are some degrees of unreliability in the writing achievement test Only by observing and analyzing the test, the test scores for each task, could we realize that task 4- writing a composition/a narrative may cause some degrees of reliability as it was scored subjectively Therefore, marking process is considered as one of the reasons that cause the unreliability of the test In the following part, an analysis of 15 sample compositions will be conducted to see how much the marking process has had an influence on the reliability of the test 3.3.3 Analysis of the sample compositions As mentioned above, there is an ambiguity in task - writing a composition which was marked by only one rater and the ambiguity may occur in the rating process Therefore, that is the reason why we should examine the inter- rater reliability in order to analyze and find out the differences in scoring Our experiment is carried out as follows: a) First, we randomly collected the writing papers of 15 students These writing papers are called the sample compositions (see in Appendix 3) b) Then, those writing papers were rated by three volunteered teachers of English who are in charge of teaching writing skill in our Department They were required to rate the same compositions on the same marking process 15 sample compositions were copied and delivered to three raters with no specific criteria for marking them c) Finally, we carried out the analysis of the sample compositions based on the scores collected and shown in the following table: Raters Students’ scores 1st 2nd 3rd Mean Range d128 2.5 3.5 2.7 2- 3.5 d130 3.5 4.0 4.0 3.8 3.5- 4.0 d132 3.0 3.5 4.0 3.5 3.0- 4.0 d134 3.5 4.5 4.5 4.0 3.5- 4.5 d136 2.0 3.0 3.0 2.7 2.0- 3.0 d138 2.0 3.0 2.5 2.5 2.0- 3.0 d140 2.5 3.0 2.5 2.7 2.5- 3.0 d142 2.0 3.0 3.5 2.8 2.0- 3.5 d144 2.0 3.0 1.5 2.2 1.5- 3.0 10 d146 3.5 4.5 4.2 4.0 3.5- 4.5 11 d148 2.5 3.5 4.5 3.5 2.5- 4.5 12 d150 2.0 2.5 1.5 2.0 1.5- 2.5 13 d152 2.5 3.0 3.5 3.0 2.5- 3.5 14 d154 1.8 3.0 3.0 2.6 1.8- 3.0 15 d156 3.0 4.0 4.2 3.7 3.0- 4.2 Range 1.8- 3.5 2.5- 4.5 1.5- 4.5 Mean 2.5 3.4 3.3 Table 8: The scores of 15 sample compositions by three raters Looking at the results of the scores, we can see that there is a difference in the marks awarded by the different raters Obviously, the main reason for this is the lack of explicit agreed criteria for carrying out the marking task Whatever the reason, candidates would have been affected by the choice of the rater assigned to marking their writings Compare the marks of the first rater with those of the second rater Who would you prefer to be marked by? The mean of the first rater is 2.5 while that of the second one is 3.4 Moreover, the mark range given to each composition shows some degrees of unreliability of the writing scores Having a look at the ranges of ‘d148’ (2.5-4.5), ‘d128’ (2-3.5) or ‘d144’(1.5-3.0), it is clear that there is a rather big gap between the lowest and highest marks The average gap of the ranges by the different raters is 1.0 and the smallest one is 0.5(d130, d140) Therefore, we must seek the ways to bring raters closer together, in terms of the marks they award and in the consistency of their own judgments This involves the development of appropriate rating scales and the standardization of raters to these scales Looking at the particular difference of the scores of some testees, we would see the great difference in the three raters’ scoring and evaluating Typically, d148 ‘s writing paper which gets the different marks (2.5, 3.5,4.5) shows that the first rater seemed to focus more on grammar Meanwhile, the third rater highly appreciated his ideas and logical organization in his writing though he made a lot of grammatical mistakes Besides, we can recognize the differences in scoring among the raters in the case of d144 (with the scores: 2,3,1.5) In the first and third raters’ opinions, he deserved to get mark or mark 1.5 as he not only made a lot of grammatical mistakes but has not completed his writing yet However, the second rater seemed to be interested in the story about the first meeting between the writer-test taker and a girl, so she gave mark for that writing However, in some other cases such as d130, d132, d134, d146, d156, the three raters have had nearly the same comments though there is still a small difference between the marks by the different raters They said that there were only a few grammatical mistakes in those writings, and generally the writings were complete and clearly organized Those writing are highly appreciated by the three raters In conclusion, based on the statistics and analysis above in combination with the information collected from the three raters through the informal talks we have drawn out some brief comments on those raters as follows: (i) The first rater seemed to be tight in marking She had a tendency towards underlying a few mistakes (not correcting) She did not give the marking scale or any comments to the writings (ii) Like the first rater, the second rater did not provide any rating scale or feedback for the writings However, she underlined and corrected carefully a lot of the test- takers’ mistakes She seemed to focus more on the mistakes The scores rated by her were above average (iii) The third rater paid more attention to the ideas, organization of the writings She underlined the mistakes with some corrections Especially, she had the marking scale for each writing Some of her marks were higher than the other raters’ Obviously, the three raters have yielded quite different results in the scoring process although they had to score not many writings and there was no limitation of time or psychological pressure Therefore, in practice, it is very difficult to obtain the high reliability during the scoring process with a large number of writing papers and with the time pressure excluding other affecting factors (weather, noise, etc) Moreover, the writing test was scored by only one rater Thus, there should be a detailed rating scale and at least two raters are required for scoring a writing paper 3.3.4 Results Based on the practical context of the study and the results of the analysis of the achievement writing test, we have found that the test is valid However, the results of analysis of 156 test scores and scores of 15 sample compositions marked by three different raters show that there are some degrees of unreliability of the scores The results are represented as follows: (i) The mean of the achievement writing test scores: xf 878 M= ≈ 5.6 = N 156 (ii) The standard deviation of the achievement writing test scores: 29.12 SD = 156 = 1.19 (iii)The reliability coefficient: x (K- x) Rxx = 1- 5.6( 30-5.6) = 1- K(SD) 136.64 ≈ -2.2 = 130(1.19)2 42.483 (iv) The scores of 15 sample compositions by the three different raters.(shown in table 8) Provided with the above results, it should be concluded that the test unreliability has occurred in the scoring process However, the question we should raise here is ‘ How can greater reliability be obtained in such a writing test? This question will be clarified in the next part CONCLUSION In this part, the author would like to draw out the summary of what have been done and make some suggestions for the writing achievement test itself and testing writing in general to achieve its reliability Discussion Obviously, a well - designed, administered and carefully- scored test not only helps the teacher to precisely evaluate the students’ proficiency but also motivates the teacher to teach better and the students to learn more Being a teacher of English with a great interest in the field of testing, the author would like to evaluate the whole writing achievement test being used for the students (consisting of designing and administering) as well as look for some more evidence from the students and teachers to measure the reliability of the test However, due to the limitation of the time and unavailability of the data, the study only focuses on analyzing the test scores of 156 students and 15 sample compositions which are considered to be the problematic factor for unreliability of the test Basing on the literature reviewed in chapter which is considered as the theoretical framework for the study in combination with the practical base of the teaching, learning and testing situations for the second year English major students at N.AJTTC (chapter 2), an analysis of 156 test scores and the scores of 15 sample compositions has been conducted Along with the results of the mean, standard deviation and reliability coefficient of 156 test scores, a comparison of the scores of 15 sample compositions marked by different raters has been made and found that there were some degrees of unreliability of the achievement writing test Suggestions As being defined by Henning (1987:74), reliability, one of the criteria of a good test, is a measure of accuracy, consistency, dependability, or fairness of scores resulting from administration of a particular examination The unreliability of a test is often found in writing and speaking tests However, making such kinds of tests more reliable is a very complex issue In this research, we have found that there are some degrees of unreliability of the achievement writing test for the second year English major students and its main cause is the scoring process Therefore, in this section some suggestions for the test designers and raters are offered for testing writing a composition itself and testing writing in general 2.1 Suggestions for test- designers Test designers are terms used to refer people who are responsible to write a test which is considered to ensure its validity and reliability In the previous chapters, the writing achievement test is proved to be valid; however, there are some degrees of unreliability Therefore, test-designers are recommended to pay more attention to the following issues in order to obtain higher reliability First, the format of the test shows that the test consists of both controlled (task 1,2,3) and free writings (task 4); however, task is found to be the most difficult one and proved to be less reliable Therefore, the test-designers are advised to write one more composition writing task- a guided writing or turn task into a guided writing with the detailed marking scale in order to measure the students’ writing skill accurately and not to confuse the students before writing the second composition- free writing Second, looking at the marking scale (table 4), we can see that the marking criteria are too general for the raters (5 pt for the whole composition), and in fact there was no a detailed rating scale for the composition In addition, as discussed above, most of the raters put more emphasis on grammar errors when marking the composition Therefore, in order to make the rating process more reliable, the test designers should establish a detailed rating scale for the raters According to scholars such as Alderson (1995), Hughes (1989), Weir (2005), rating scales are divided into two categories: holistic scales and analytic scales They are described as follows: (i) Holistic scales involve the assignment of a single score to a piece of writing on the basis of an overall impression of it This scale is generally found to be much faster than analytic scale Heaton and Hughes (1989) prove that each student’s work is scored by four different trained impression scorers can result in high scorer reliability Alderson (1995) offers an example of holistic scale which is adapted from UCLES International Examination in English as a Foreign Language General Handbook, in 1987 It is represented in the following table: Scores Level Description 18-20 Excellent Natural English with minimal errors and complete realization of the task set 16-17 Very good More than a collection of simple sentences, with good vocabulary and structures Some non- basis errors 12-15 Good Simple but accurate realization of the task set with sufficient naturalness of English and not many errors 8-11 Pass Reasonably correct but awkward and non- communicating OR fair and natural treatment of subject, with some serious errors 5-7 Weak Original vocabulary and grammar both inadequate to the subject 0-4 Very poor Incoherent Errors show lack of basic knowledge of English Table : A sample holistic scale Alderson, Clapham and Wall (1995: 108) Obviously, with the scores provided above, raters often find difficult to have an accurate score (between 12-15 for the same level ’Good’) Weir (2005:188) quotes Weigle’s statements about the disadvantages of holistic scoring: “One drawback to holistic scoring is that a single score does not provide useful diagnostic information about a person’s writing ability This is especially problematic for second language writers, since different aspects of writing ability develop at different rates for different writers Another disadvantage of holistic scoring is that holistic scores are not always easy to interpret, as raters not necessarily use the same criteria to arrive at the same scores ” According to Alderson, Clapham & Wall (1995:108), when raters use this type of scale, they are asked not to pay too much attention to any particular aspect of the candidate’s production, but rather to make a judgment of its overall effectiveness and make their judgments quickly Therefore, this type of scale is not encouraged to use to score a composition which requires to judge several components of a performance separately such as grammar, vocabulary, handwriting, etc Alderson, Clapham & Wall, Hughes, Weir and Heaton recommend the analytic scales to score compositions (ii) Analytic scales which is defined as giving scores for each component of a composition is believed to best suit to the process of rating a composition, especially ‘where most teachers have little opportunity to enlist the services of two or three colleagues in marking class compositions’.(Heaton, 1989:148) Hughes (1989:94) points out the disadvantages as well as advantages of analytic scales and provides readers with a clear sample example of analytic rating scales Besides, another model of analytic rating scales adapted by Alderson, Clapham & Wall (1995:109) is presented in the following table: Criteria Description Relevance& Adequacy of The answer bears almost no relation to the task set content Totally inadequate answer Answer of limited relevance to the task set Possibly major gaps in treatment of topic and /or pointless repetition For the most part answers the task set, though there may be some gaps or redundant information Relevant and adequate answer to the task set Compositional No apparent organization of content Organization Very little organization of content Underlying structures not sufficiently apparent Some organizational skills in evidence but not adequately controlled Overall shape and internal pattern clear Organizational skills adequately controlled Cohesion Cohesion almost totally absent Writing is so fragmentary that comprehension of the intended communication is virtually impossible 1.Unsastisfactory cohesion may cause difficulty in comprehension of most of the intended communication For the most part satisfactory cohesion though occasional deficiencies may mean that certain parts of the communication are not always effective 3.Satisfactory use of cohesion resulting in effective communication Adequacy of Vocabulary for Purpose Vocabulary inadequate even for the most basic parts of the intended communication Frequent inadequacies in vocabulary for the task Perhaps frequent lexical inadequacies and /or repetitions Some inadequacies in vocabulary for the task Perhaps some lexical inappropriacies and/ or circumlocution Almost no inadequacies in vocabulary for the task Only rare inappropriacies and/ or circumlocution Grammar Almost all grammatical patterns inaccurate Frequent grammatical inaccuracies Some grammatical inaccuracies Almost no grammatical inaccuracies Mechanical Accuracy I Ignorance of conventions of punctuation ( Punctuation) Low standard of accuracy of punctuation Some inaccuracies of punctuation Almost no inaccuracies of punctuation Mechanical Accuracy II Almost all spelling inaccurate ( Spelling) Low standard of accuracy in spelling Some inaccuracies in spelling Almost no inaccuracies in spelling In Total Content: Organization: Cohesion: Vocabulary: Grammar: Punctuation: Spelling: SCORE: Table 10: A sample analytic scales Alderson, Clapham & Wall (1995:109-110) Basing on the above detailed rating scale, it is obvious that the analytic scales are entirely appropriate to rating writing tests, in general, and compositions, in particular, as it helps students as well as scorers to be able to see the assessments of all components of a composition and avoid the problem of uneven development of subskills Moreover, Hughes states that a number of scores may make the scoring more reliable In conclusion, according to Hughes (1989:97), both rating scales are considered applicable for scoring, but the choice between them depends on the purpose of the testing and the circumstances of the scoring According to Madsen (1983:170), holistic scoring of compositions is recommended in the case the raters are well and specially trained for rating compositions In his view of point, holistic scoring is one of the better way to evaluate the complex communicative act of writing However, in our testing situation, where there is a lack of raters of writing tests especially the skillful raters, analytic scales are recommended for rating a composition and sometimes this type of scoring is combined with holistic scales in order to obtain higher reliability 2.2 Suggestions for raters In the rating process, raters plays an important role, especially when ratings are subjective as Mc Namara (2000:36) says ‘ the rating given to a candidate is a reflection, not only of the quality of the performance, but of the qualities as a rater of a person who has judged it.’ Indeed, the results of 15 sample compositions shows that raters’ scoring is one of the main factors that cause the unreliability Therefore, in this section, we will make an attempt to find out how to obtain greater inter-rater reliability Basing on the ideas of Alderson, Clapham &Wall (1995), Mc Namara (2000), we draw out the steps necessary for the raters at N.A JTTC to follow in the rating process in order to obtain the high reliability Those steps are represented as follows: (i) We should choose and train carefully a small group of teachers (about teachers) who are responsible for rating writing tests Among them, a chief rater will be in charge of organizing and leading the group However, at least raters are required to score the same performance (ii) There should be a meeting of the groups of raters before conducting scoring In the meeting, they have to read the writing test to find out the problems that can cause students’ misunderstandings or confuse them in their performances (such as unclear instructions, wrong spellings, etc); then analyze the rating scale of test- designers carefully and may adjust it if necessary (iii) The next step is to try out the rating scale provided by choosing of students’ writings as the samples at random, giving the copies to the raters and asking them to score those samples seriously Then, the raters should compare their marks, discuss any differences to reach an agreement and refine the rating scale so that it is easier to understand and use (iv) At last, the raters will receive the copies of the refined rating scale and score students’ writings After the first rater’s work, that writing will be checked by the second rater or the chief rater in order to ensure the test reliability and make the raters feel more responsible for their scoring Besides, giving feedback is also thought to be useful and necessary for students and raters as well since it helps students recognize their strengths and weaknesses; reinforce raters’ responsibilities for their work as well as make the test more reliable In reality, after each test, our students are only informed their marks and there is no other feedback That is the reason why the author would like to recommend some ways of feedback for raters They are represented as follows: (i) During the rating process, raters should take notes of the strong and weak points of the students’ writings and then, after the exam, basing on the notes, a general feedback about their writings will be given to a class of students (ii) The scored writing papers will be given back to the students The students will have some time to have a look back at their writings and then the teachers will give the general feedback about the students’ strengths and weaknesses in their writings In general, the second way is more effective as it provides students with the more details about their writings and it requires more responsibility from the raters However, the first one seems to be easy to conduct as it takes less time and more convenient In a summary, the above are some useful recommendations for the test- designers and raters in order to improve the reliability of the writing achievement tests, especially compositions It is hoped that the writing tests will obtain higher reliability with the detailed rating scale made by the test- designers and adapted carefully by the raters Finally, the raters as well as their scorings will be evaluated to be reliable through giving the feedback to the students after each exam That is one of the motives to encourage students and a chance for their consolidations REFERENCES Allen, J.P.B and Corden, Pit, S (1974) Techniques in Applied Linguistics Oxford: Oxford University Press Alderson, J.C., C Clapham and D Wall (1995) Language Test Construction and Evaluation Cambridge: Cambridge University Press Bachman, L.F (1990) Fundamental Considerations in Language Testing Oxford: Oxford University Press Bachman, L.F., and Palmer, A.S.( 1996) Language Testing in Practice: Designing and Developing Useful Language Tests Oxford: Oxford University Press Carroll, J.B (1968) The psychology of language testing (In) A David (Ed.) Language Testing Symposium A Psycholinguistic Perspective London: Oxford University Press Finocchiaro, M., and Sako, S.(1983) Foreign Language Testing A Practical Approach New York: Regents Publishing Company, Inc Harrison, A.(1987) A Language Testing Handbook Macmillan Publishers Harold, S Madsen.(1983) Techniques in Testing Oxford: Oxford University Press Heaton, J.B (1988) Writing English Tests London: Longman 10 Henning, G (1987) A Guide to Language Testing Cambridge: Newbury House Publishers 11 Hughes, A.(1989) Testing for Language Teachers Cambridge: Cambridge University Press 12 Ibe, M.D (1981) Language test analysis beyond the validity and reliability criteria (In) Papers on Language Testing Occasional Papers Read J.A.S (Ed.) No.18.pp1 Singapore: SEAMEO Regional Language Center 13 Mc Namara, T.( 2000) Language Testing Oxford: Oxford University Press 14 McNamara,T., A Davies., A Brown., C Elder., K Hill., T Lumley.(1999) Studies in Language Testing: A Dictionary of Language Testing University of Melbourne 15 Oller, J.W Jr.( 1979) Language Tests at School London: Longman 16 Read, J.A.S (1983) What is a Good Classroom Test? (In) Guidelines A Periodical for Classroom Language Teachers Classroom tests Crabbe, D (Ed.) Volume 5.No.1 pp1-7 Singapore: SEAMEO Regional Language Center 17 Shohamy, E.(1985) A Practical Handbook in Language Testing for the Second Language Teachers Tel-Aviv: Tel- Aviv University Press 18 Wallen, N., and Fraenkel, J.R (1996) How to Design and Evaluate Research in Education The McGraw- Hill Companies 19 Weir, C.J (2005) Language Testing and Validation: an Evidenced- Based Approach Palgrave Macmillan 20 Wiersma,W , and Jurs, S G.(1990) Educational Measurement and Testing Boston: Allyn and Bacon ... testing), testing writing (types of writing, criteria of testing writing and problems in testing writing) and the criteria of a good test with a focus on reliability in which factors affecting... (definitions, approaches, roles, purposes and relationships between testing and teaching- learning); testing writing (types of writing, criteria of testing writing and problems in testing writing);... has practicality if it does not involve much time or money in constructing, implementing and scoring it 1.3.3 Discrimination It is said to be incomplete without considering the discrimination of

Nghiên cứu và đánh giá độ tin cậy của bài thi viết cuối học kỳ i dành cho sinh viên chuyên anh năm thứ hai tại trường cao đẳng sư phạm nghệ an và một số gợi ý thay đổi

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan