Validity and fairness

http://ltj.sagepub.com/ Language Testing http://ltj.sagepub.com/content/27/2/177 The online version of this article can be found at: DOI: 10.1177/0265532209349467 2010 27: 177 originally published online 3 March 2010Language Testing Michael Kane Validity and fairness Published by: http://www.sagepublications.com can be found at:Language TestingAdditional services and information for http://ltj.sagepub.com/cgi/alertsEmail Alerts: http://ltj.sagepub.com/subscriptionsSubscriptions: http://www.sagepub.com/journalsReprints.navReprints: http://www.sagepub.com/journalsPermissions.navPermissions: http://ltj.sagepub.com/content/27/2/177.refs.htmlCitations: at Victoria Uni of Technology on April 21, 2011ltj.sagepub.comDownloaded from Corresponding author: Michael Kane, National Conference of Bar Examiners, MS 07-r, Educational Testing Service, Rosedale Rd., Princeton, NJ, 08541. E-mail: mkane@ets.org /$1*8$*( 7(67,1* Language Testing 27(2) 177–182 © The Author(s) 2010 Reprints and permission: http://www. sagepub.co.uk/journalsPermission.nav DOI: 10.1177/0265532209349467 http://ltj.sagepub.com Validity and fairness Michael Kane National Conference of Bar Examiners, USA Keywords assessment, bias, consequential evidence for validity, fairness,validity Xi’s article (this volume) lays out a broad framework for studying fairness as comparable validity across groups within the population of interest. She proposes to develop a fairness argument that would identify and evaluate potential fairness-based objections to proposed interpretations and uses of the test scores. The fairness argument would focus on whether an interpretation is equally plausible for different groups and whether the decision rules are appropriate for the groups. The model combines fairness and validity in a common framework. Under Xi’s model (this volume), a fairness analysis would evaluate a range of potential challenges to the comparability of score-based inferences and decisions across various groups within a population. The resulting fairness argument would suggest a range of empirical hypotheses about the comparability of interpretations and decision outcomes across groups, and thereby identify potential threats to fairness/validity. This argument- based approach would address how fairness issues play out in score-based interpretations, decisions, and consequences. By organizing fairness investigations within a well-defined, argument-based framework, this approach can be particularly useful in focusing research on specific threats to fairness/validity. In general, the relationship between validity and fairness depends on how we define these two concepts, and perhaps more to the point how broadly we define each of these concepts. If we define validity narrowly (e.g. Popham, 1997) and fairness broadly (Kunan, 2000), then validity could be conceived of as a component of fairness; if we define fairness in terms of a specific technical issue, like predictive bias (e.g. Cleary, 1968) and validity broadly (Cronbach, 1971; Messick, 1989; Kane, 2006), then fairness could be conceived of as a component of validity. Conceptually, I am inclined to define both validity and fairness quite broadly, and as a result, I don’t see either of these concepts as being completely included within the other. Rather, I see them as closely related ways of looking at the same basic question: Are the proposed interpretations and uses of the test scores appropriate for a population over some range of contexts. Validity and at Victoria Uni of Technology on April 21, 2011ltj.sagepub.comDownloaded from 178 Language Testing 27(2) fairness come at the question from somewhat different points of view and involve different emphases, but the overlap is more pronounced than the differences. Fairness Fairness is a very complex and potentially contentious issue, with many possible definitions (Camilli, 2006; Xi, this volume). In the Standards for Educational and Psychological Testing (AERA, APA, NCME, 1999) four possible definitions of fairness are considered, along with some variations on these four possibilities. I will distinguish two general conceptions of fairness, which have a long tradition in political thought and are specified in the United States’ legal system in terms of two kinds of due process. The due process requirement in the Constitution (5th and 14th Amendments), ‘prohibits the government from unfairly or arbitrarily depriving a person of life, liberty or property’ (Garner, 2001, p. 223). Fairness conceived of as the absence of bias or arbitrariness is a core value in democratic societies (although often honored in the breach) and in measurement (Porter, 2003). Procedural due process requires that the same rules be applied to everyone in more- or-less the same way. It requires that legal proceedings have to be conducted, ‘according to established rules and principles for the protection and enforcement of private rights, including notice and the right to a fair hearing before a tribunal with the power to decide the case’ (Garner, 2001, p. 223). Substantive due process is the doctrine that ‘the Due Process Clauses of the 5th and 14th Amendments require legislation to be fair and reasonable in content and to further a legitimate governmental objective’ (Garner, 2001, p. 223). Substantive due process requires that the procedures to be applied be reasonable in general and in the context in which they are applied. Applying these general legal concepts to fairness in assessment, procedural fairness can be said to require that all test takers be treated in essentially the same way, that they take the same test or equivalent tests, under the same conditions or equivalent conditions, and that their performances be evaluated using the same (or essentially the same) rules and procedures. This is a very basic notion of fairness; everyone is to be treated in the same way, unless there is some good reason to adjust the procedures in particular cases (e.g. by providing an audio version of the test for a blind test taker). Procedural fairness can be viewed as a lack of bias for or against any individual or group, and corresponds roughly to the first two definitions of fairness in the Standards, which relate fairness, ‘to absence of bias and to equitable treatment of all examinees in the testing process’ (AERA, APA, NCME, 1999, p. 74). That is, testing programs should be free from bias in the materials and procedures used. Procedural fairness is an essential requirement for both fairness and validity in testing. The whole point of standardized testing is to treat everyone in the same way, or if adjustments in procedures or materials are necessary, to treat them in equivalent ways (i.e. in ways that ‘level the playing field’). Substantive fairness in testing requires that the score interpretation and any test-based decision rule be reasonable and appropriate, and in particular, that they be equally at Victoria Uni of Technology on April 21, 2011ltj.sagepub.comDownloaded from Michael Kane 179 appropriate for all test takers (at least roughly). Substantive fairness includes a much wider range of issues than procedural fairness and is therefore more difficult to evaluate than procedural fairness. For example, substantive fairness would include all of the issues subsumed under the Standards’ third definition of fairness, ‘that examinees of equal standing with respect to the construct the test is intended to measure should on average earn the same test score, irrespective of group membership’, and the fourth definition, ‘equity in opportunity to learn the material covered in an achievement test’ (AERA et al., 1999, p. 74), but it could also include other potential concerns, including the consequences of any decisions being made, and particularly, differential outcomes across groups. Procedural fairness is concerned with how we treat test takers, in particular with how consistently and fairly we treat them, and is therefore largely under our control. Substantive fairness is concerned with how the testing program functions, and, in particular, with how well it functions for different groups. Substantive fairness is not completely under our control up front; we can design testing programs to work well for all groups, but empirical evidence for how well the program is working for different groups is generally available only after the program is in operation. Note that, to the extent that the Standards’ third definition of fairness were seriously violated, test scores would not reflect standing on the construct of interest, for at least some test takers, and therefore the basic score interpretation would be invalidated (for at least some test takers). Validity and fairness are intertwined. Equity in opportunity to learn material is also a validity issue if the purpose of the assessment is to evaluate how much has been learned in a particular educational program, but would not be a validity issue if the purpose of the assessment is to measure level of achievement in some specified domain (e.g. in a licensure or certification test), but in this latter case, a lack of opportunity to learn would not necessarily indicate a lack of fairness or validity. Shepard (1993) provides a good example, in terms of a kindergarten ‘readiness’ test, of how some of these fairness and validity issues play out in practice. Suppose that a standardized ‘readiness’ test is shown to be an accurate measure of certain basic skills, and it is also found to be an excellent predictor of performance in kindergarten. The assessment can be considered procedurally fair, and therefore would not be invalidated on that account. By assumption, it is also valid and fair as a measure of skill level and as a predictor of performance in kindergarten. Does a positive judgment about procedural fairness/validity justify the test’s use in deciding whether to admit children to kindergarten this year or to hold them back for a year? Shepard (1993) suggests that, if a low ‘readiness’ score indicates a developmental lag that will be resolved by waiting a year, this strategy would seem to make sense; how- ever, if low scores indicate a home environment that does not promote development of the skills (i.e. a lack of opportunity to learn), keeping the child out of school for another year would seem to be counterproductive. In this second case, the use of the test to make admission decisions is inappropriate/invalid and unfair to those children who have not had an opportunity to learn the skills. If most of the children for whom counterproductive decisions are being made are in a particular sub-group within the population (defined by socio-economic class, race, or gender), the decision procedure is arguably unfair to that subgroup, even though the testing process is unbiased. The assessment is procedurally fair, and the interpretations of scores in terms of current achievement are valid/fair, but at Victoria Uni of Technology on April 21, 2011ltj.sagepub.comDownloaded from 180 Language Testing 27(2) the decisions being made are substantively unfair for children who have not had an opportunity to learn the skills at home. Procedural fairness is a necessary requirement for fair and valid assessment, but it is not enough. It is also necessary that scores have comparable meaning in different groups and that any decisions based on the test scores be appropriate for all of the groups. Validity Validation involves an evaluation of the credibility, or plausibility, of the proposed interpretations and uses of test scores (Cronbach, 1971; Messick, 1989; Kane, 2006). Effective validation, therefore, depends on a clear, explicit statement of the proposed interpretations and uses, with the statement including a specification of the population and of the range of contexts in which the interpretations and uses will occur. The assumptions inherent in the proposed interpretations and uses of test scores can be made explicit in the form of an interpretive argument that lays out the the details of the reasoning leading from the test performances to conclusions included in the interpretation and to any decisions based on the interpretation (Kane, 2006). Each inference in an argument is based on a rule of inference, or warrant, that allows for the inference (Toulmin, 2003), and the purpose of the interpretive argument is to make these warrants and their supporting assumptions explicit, and therefore available for inspection. Warrants are supported by evidence, or backing, (e.g. expert judgment can provide adequate backing for a scoring key). The warrants and therefore the inferences depend on assumptions (e.g. that the rubric developed to translate test performances into test scores is reasonable, that it is applied correctly, and that the test was administered under standard conditions). The testing procedures and the proposed interpretations are developed for specific populations, defined by age, educational back- ground, language proficiency, etc., and some of the assumptions built into the interpretive argument may depend on the population (e.g. the expectations for a written response would vary as a function of age). The interpretive argument provides a generic version of the proposed interpretation/ use of scores, which can be applied to some population of interest. The interpretations and uses of the test results for individuals in the population are instances of this generic interpretive argument. The generic interpretive argument provides the rationale for the interpretation for each examinee’s assessment results and for any decisions based on these results. The interpretive argument may include statistical models, psychometric models, formal and informal theories about learning and performance, and general rules of formal and informal logic in some combination for each test-interpretation/ use-population-context combination (Kane, 1992, 2006). A validity argument is intended to provide an overall evaluation of the evidence for and against the proposed interpretation/use (i.e. for and against the interpretive argument). The plausibility of the interpretive argument can be evaluated by analyzing its overall clarity and coherence and by assessing the plausibility of its inferences and assumptions. This evaluation, or validation, of the interpretive argument generally requires many different kinds of analysis and evidence, which are used to evaluate the at Victoria Uni of Technology on April 21, 2011ltj.sagepub.comDownloaded from Michael Kane 181 different parts of the interpretive argument. Some of the evidence may be empirical (e.g. studies of reliability, equating, relationships to other variables) and some will be judg- mental (analyses of how scoring keys were developed and applied), and the validity argument relies on all of this evidence to reach general conclusions about how much confidence we can place on the proposed interpretations and uses. Once the interpretive argument is developed, it can presumably be applied to a population of test takers under some range of conditions. So for example, the scores on a lit- eracy test given to all fourth graders in a school district would all be interpreted in essentially the same way, based on an interpretive argument. A basic case for the validity of the proposed interpretation and use (i.e. the generic interpretive argument) is typically made during test development. This development stage tends to have a confirmationist bias, because it is an integral part of developing an assessment that is designed to support certain interpretations and uses (Kane, 2006). In the second stage of validation, which focuses on a more objective, critical appraisal of the proposed interpretation and use of scores for a particular purpose, in particular contexts, and with a particular population, the evaluation of potential challenges to the proposed interpretive argument takes center stage. When an inference is drawn (e.g. from an observed performance to a score) in a particular case, the warrant (e.g. the scoring rule) and its backing (judgments by panels of content experts who developed the scoring rule) may not be explicitly mentioned, espe- cially if the inference is fairly routine in a particular context and the audience is friendly, but the warrant is invoked every time a score is interpreted. Validity and Fairness Validity and fairness are closely connected. An assessment that is unfair, in the sense that it systematically misrepresents the standing of some individuals or some groups of individuals on the construct being measured or that tends make inappropriate decisions for individuals or groups is, to that extent not valid for that interpretation or use. Similarly, an assessment that is not valid in the sense that it tends to generate misleading conclusions or inappropriate decisions for some individuals or groups will also be unfair. Validity theory has tended to focus on the accuracy and appropriateness of score- based interpretations and decisions about all of the individuals in the population of interest. Analyses of fairness have tended to focus on group differences and on differences in the accuracy and appropriateness of interpretations and decisions across groups, which are defined in terms of race/ethnicity, gender, age, and so on. The issues being addressed are basically the same. Xi’s model (this volume) makes this relationship explicit. My main reason for draw- ing a distinction between an interpretive argument and a validity argument was to make the point that if validation is to provide an evaluation of the proposed interpretations and uses of test scores, it should begin with a clear statement of what was being claimed. The interpretive argument is to provide an explicit statement of the reasoning leading from test performances to conclusions and decisions. The validity argument would then provide an evaluation of the plausibility of the interpretive argument. Xi at Victoria Uni of Technology on April 21, 2011ltj.sagepub.comDownloaded from 182 Language Testing 27(2) (this volume) has extended this framework by introducing a fairness argument, and thereby giving much more attention to group differences and the implications of these differences for fairness. References American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (1999). Standards for educational and psychological testing. Washington, DC: American Psychological Association. Camilli, G. (2006). Test fairness. In R. Brennan (Ed.), Educational measurement, 4th ed. (pp. 221–256), Westport, CT: American Council on Education and Praeger. Cleary, T. A. (1968). Test bias: Prediction of grades of negro and white students in integrated col- leges. Journal of Educational Measurement, 5, 115–124. Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.), Educational measurement, 2nd ed. (pp. 443–507). Washington, DC: American Council on Education. Garner, B. Editor in Chief (2001) Black’s Law Dictionary, 2nd pocket ed. St Paul, MN West Group. Kane, M. (1992). An argument-based approach to validation. Psychological Bulletin, 112, 527–535. Kane, M. (2006). Validation. In R. Brennan (Ed.), Educational measurement, 4th ed. (pp. 17–64), Westport, CT: American Council on Education and Praeger. Kunnan, A. J. (2000). Fairness and justice for all. In A. J. Kunnan (Ed.), Fairness and validation in language assessment (pp. 1–14). Cambridge, UK: Cambridge University Press. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational Measurement, 3rd ed. (pp. 13–103). New York: American Council on Education and Macmillan. Popham, W. J. (1997). Consequential validity: Right concern – wrong concept. Educational Measure- ment: Issues and Practice, 16(2), 9–13. Porter, T. (2003). Measurement, objectivity, and trust. Measurement: Interdisciplinary Research and Perspectives, 1, 241–255. Shepard, L. A. (1993). Evaluating test validity. In L. Darling-Hammond (Ed.), Review of research in education, Vol. 19 (pp. 405–450). Washington, DC: American Educational Research Association. Toulmin, S. (2003). The uses of argument, 2nd ed. Cambridge: Cambridge University Press. at Victoria Uni of Technology on April 21, 2011ltj.sagepub.comDownloaded from . 1968) and validity broadly (Cronbach, 1971; Messick, 1989; Kane, 2006), then fairness could be conceived of as a component of validity. Conceptually, I am inclined to define both validity and fairness. routine in a particular context and the audience is friendly, but the warrant is invoked every time a score is interpreted. Validity and Fairness Validity and fairness are closely connected groups, and thereby identify potential threats to fairness /validity. This argument- based approach would address how fairness issues play out in score-based interpretations, decisions, and consequences.

Validity and fairness

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan