The Platinum Standard
Download
Psychometrics is an important part of test development and operations. Unfortunately, many psychometric practices undermine validity in the pursuit of computational convenience, scoring model robustness and test reliability.
Reliability
There is no question that reliability is incredibly important when measuring anything. Any tool that cannot be trusted to return consistent results cannot be trusted at all. This is the principle of reliability. Of course, small deviations from measurement to measurement are not necessarily a problem. That is, the precision of a tool is a related to its reliability. If a tool only can be relied upon to give accurate readings down to the centimeter, but you need accurate measurement down to the millimeter, the tool is useless. If the measurement scheme works in some conditions, but not all the conditions in which it needs to be used, it might be useless.
Reliability is incredibly important.
But reliability is not the most important issue. The most important issue is validity. That is, does the tool measure what it purports to measure. A tool meant to measure health but instead measures height might be a problem. A tool meant to measure math proficiency but instead measures reading proficiency is almost certainly a problem.
This leads to perhaps the most important question in statistics and in educational measurement. Would you rather have a tool that
Measures the wrong thing, but does it remarkably consistently
Measures the right thing, but is notably inconsistent
RTD takes a strong stance that measuring the right thing is more important than maximizing measurement consistency. Validity is more important than reliability.
We know that the classic response is that your measurement is not consistent, how can you know what you are measuring or when you can trust it. We know that reliability is an upper bound on validity. But we do not fall for the idea that simply by raising the ceiling that anything in the room gets taller. Yes, we are willing to sacrifice some reliability in order to increase validity. Unfortunately, psychometrics has no way to mathematically model validity, and so it generally focuses on maximizing reliability—often pushing us to use rather short objects under a very tall ceiling.
Criteria- vs. Norm-Based Tests
Without delving too deeply into the differences, criteria-based assessments are about measuring against some external standard, idea or construct, and norm-based assessments are about sorting and ranking test takers. There may be good uses for norm-based testing, and there certainly are important uses for criteria-based assessments. Rigorous Test Development assumes that tests are designed and built to assess test takers’ proficiencies with some sort of external construct (i.e., defined in a domain model). The classic example is an educational test designed to assessment proficiency with state learning standards. Short quizzes often focus on a single skill, but even chapter tests usually cover a range of knowledge, skills and/or abilities (KSAs). Certainly, large scale assessment usuallyaims at assess a range of KSAs. This contrasts with the norm-based tests often used in psychological applications, which aim to measure just a single construct.
Unfortunately, desires to rank and sort students often overwhelms the imporantt uses of getting more granular information about how proficient students are on a variety of learning standards or goals. This shows us in score reporting and is psychometric models and tools. We refer to this by the technical name, unidimensionality. That is, norm-based reporting and many psychometric tools reduce the complexity and validity of those domain models (e.g., the collection of state standards) to a single dimension, either to make the models simpler or to report how test takers rate relative to each other. We sometimes refer to unidimensionality as the original sin of educational measurement.
Item Difficulty
Psychometric practice often has items evaluated based on their empircal difficulty. This is not about their conceptual difficulty, how difficult they are to explain or teach, or how difficulty they are to learn. Obviously, these things are beyond the scope of psychometrics. Rather, empirical difficulty is simply the proportion of test takers who responded to an item successfully.
Obviously, this is a product of the quality and quantity of the teaching and learning about the targeted cognition. This is also a product of the clarity of the item. It is a product of the complexity of the task, which may be conceived of as the diversity of knowledge, skills and abilities required to complete it successfully. It is a product of test takers familiarity with the task in the items, including its application and context. None of this is simply intrinsic to the item, as test takers’ prior experiences—including educational experiences—factor into all of them.
And yet, psychometric practices often prescribe an acceptable range of empirical item difficulty for an item. This is rarely appropriate. Rather, items should be as difficult as the concepts and contents of the standards they aligned to. If teachers have taught a difficult concept well, and students have mastered it, then the item might appear empirically quite easy—which is not a problem. If teachers have taught a simple concept incredibly poorly, they might show up as an empirically difficult item. But all of those empirical results merely report on those aggregated test taker performances on the items, not on their conceptual difficulty. So long as items elicit high quality evidence of the targeted cognition for the range of typical test takers, empirical difficulty should not matter. It certainly is no reason to exclude items.
Now, there are two valid uses of empirical difficulty. First, it can be used as a check by subject matter experts on their expert judgment about the difficulty and relative difficulty of items. If an item that they think—based upon their knowledge of the content, the construct and of pedagogical practices—should show some particular level of empirical difficulty, surprising field test or operational test results may help them to spot issues in items that they had not previously considered. To be sure, relative item difficulty—even relative empirical item difficulty—should vary based upon the nuance, context and complexity of the application of the targeted cognition. But whether there are issues there are something for subject matter experts to examine, not something for psychometric thresholds. Second, item difficulty statistics may be useful when building computer adaptive tests.
Item Discrimination
Perhaps the biggest psychometric obstacle to getting high quality items onto large scale assessment is the use of item discrimination statistics (e.g., point-biserial correlations) as a measurement of item quality. Item discrimination is the question of how well an item differentiates lower scoring test takers from higher scoring test takers. That is, how well a a single item’s results predict the aggregate results of the other items. This idea is built on the assumption of unidimensionality. Item discrimination assumes that there is a single scale that well describes the knowledge, skill and abilities that the test in question assesses.
This assumption almost invariably is at odds with how the content area/construct being assessed is understood by those who know and appreciate it.
Even when scoring models are robust to some multi-dimensionality or are based upon multi-dimensional views, the use of item discrimination statistics acts as a gatekeeper that prevents items from being used on operational tests. It undermines construct and content validity by replacing the multi-dimensional structure of the the targeted construct with a unidimensional view. The Standards for Educational and Psychological Testing offer that evidence based on the internal structure of a test may constitute evidence of the validity of a test(‘s use and purpose). Use of item discrimination statistics imposes a unidimensional structure on what is virtually always a multi-dimensional construct, when it comes to large scale standardized assessment. At best, it imposes and arbitrary and capricious composite of those many dimensions that is likely unexamined and certainly has not even been validated by careful review of true experts.
Use of item discrimination statistics should not be the standard psychometric practice that it is. Rather, it should be reserved only for those tests that aim to measure a single unidimensional construct—which rarely are found in large scale educational assessment.
DIF
Psychometrics has a fairly straightforward quantitative method to look for biased items. That it, to find items that show odd patterns of performance when test takers are grouped by identified demographic characteristics (e.g., gender, ethnicity). DIF (i.e., differential item functioning) looks for items that do not follow the general pattern of who does well and who does poorly on them within a particular subgroup. For example, overall high scoring Asian-American test takers do significantly less well on a particular item than equivalently high scoring White test takers.
DIF also rests on assumptions of unidimensionality. It would not be difficult to do DIF analysis in the context of a multi-dimensional construct in a way that respects that complexity, but it would rely on having a sufficient number of items for each dimension. By aggregating all the items together to get the overall pattern (i.e., including that simplified ranking of test takers), DIF can be applied with more precision and perhaps more statistical sensitivity—though with the distortions of imposing unidimensionality.
This truly leaves us with a quandary. We appreciate any tool that helps us to find inappropriate bias in test items. However, we are quite wary of the ways that the assumption of unidimensionality distorts the construct that is being measured. We are glad that in practice DIF is generally used as a flag that calls for further qualitative and content-based examination of items. DIF has proven important in the context of legal proceedings over the fairness of tests, but that context is a whole different approach with different values and lenses than RTD uses.
Which leaves RTD accepting DIF as a flag, but we fervently wish it was used in a way that better respects the complexity of the constructs being assessed.