Fairness

Download

RTD Item Fairness Review Checklist (very old)

Fairness is one of the cornerstones of the RTD Project. One of our Core Principles is, Understanding the perspective of the range of typical test takers is the true heart of assessment development. While it may not be the most intellectualized part of content development work, it is definitely a vital organ. The RTD Mantra specifies that valid items must work for the range of typical test takers, as well. Our pillar practice of radical empathy is all about fairness. Fairness is not a marginal or late-in-the-process concern. For RTD-informed work, it is the true heart.

Alignment, validity and reliability mean nothing — are worth nothing – if assessments are not fair. And RTD explicitly addresses many aspects of fairness in the packets and explanations downloadable from the sidebar to the left.

Now, in the world of assessment, “fairness” is not simply a colloquial term. Fairness is often split into two (or three) different categories: bias, sensitivity and accessibility. All of them refer to ways in which some test takers are particularly advantaged or disadvantaged compared to others, even when they have the same level of proficiency in what the item or test is trying to measure. And while each of these categories focuses on a different aspect of fairness, in fact they often overlap. They are each a lens that focuses on a particular set of issues, and some issues lie in more than one set.

Bias

In the world of statistics, bias technically refers to consistent error in a particular direction. A measuring cup that was actually a bit smaller than listed would fit this idea. Error generally refers to the little random deviations from the actual value — that little amount that we slightly under- or over-fill a measuring cup. But bias is not random error; rather it is consistent. Bias in this sense is not about values, goals or interests, as the colloquial term often is.

Bias does not have to impact all test takers equally. In fact, that kind of bias is generally addressed by adjusting for difficulty or for content validity. In the world of assessment, bias usually refers to problems that consistently hit one identifiable group of test takers more (or less) than others. Asking test takers to spell “magnolia” might be biased in favor of southerners, and asking them about the texture and taste of snow might be biased in favor of northerners.

The most common identifiable groups are categories of gender (or sex), race/ethnicity and socio-economic status. We think that urbanity should also be in this group. These characteristics are often recorded, and so psychometricians can examine results from field testing and from operational testing) to look to see if any items show signs of bias — usually in DIF (differential item functioning) studies. This kind of psychometric bias review can only be done after the fact, after test takers have attempted the items. It is based on looking for issues across identifiable groups of test takers, and is usually the work for psychometricians — though when they find evidence of DIF, they run it by content development professionals (CDPs).

Of course, psychometricians do not have data on all the theoretically identifiable groups. Items may inappropriately disfavor test takers from different cultural backgrounds or with different kinds of upbringings, experiences or even educational experiences. That is, items may offer a shortcut around the targeted cognition for test takers with particular experiences, or be harder to make sense of for test takers who lack particular experience — all despite their actual proficiency with the targeted cognition. Therefore, review panels (see below) examine items while they are still in development to try to spot potential issues when they can still be fixed.

In the world of assessment, bias is about differences in performance across identifiable groups.

Sensitivity

Sensitivity issues arise when an item causes a distracting emotional reaction in test takers — a reaction strong enough to impact their performance. For example, a K-12 assessment is quite unlikely to have a reading passage about the death of a loved one because test takers who have recently lost a loved one (e.g., a grand parent) can easily be so upset by this reminder of this emotional event that they would be distracted from their best performance. Similarly, other deeply traumatic events that children might experience are also unlikely to appear on tests. Stories or problems should not address eating pork because many religious groups find such practices offensive.

Many organizations have sensitivity topic lists that are meant to prevent such topics from ever appearing on a test. We are skeptical about this approach. It is not that the lists themselves are problematic, but rather that they are sometimes treated as a formulaic solution to this potential problem. We believe that the mere mention of a topic does not necessarily create sufficient problems to absolutely bar its use. Topics can be be handled less carefully or more carefully. The nature of the testing population matters, All of this requires careful, nuanced and professional judgment to determine whether an item is problematic, and then whether it can be fixed. An item is not automatically acceptable simply because it has steered clear of the listed topics, and an item is not automatically unacceptable simply because it touches on a topic that is on a list.

Among the various issues that sensitivity includes is negative depictions of particular groups or populations. We do not use the term stereotypical because men stereotypically have short hair, and depicting men with short hair is not problematic. In fact, many positive historical depictions can also be termed stereotypical. Yes, these stereotypes can become problematic when they are projected upon an entire group and individual members of that group are all expected to meet those expectations. But avoiding all stereotypes is not the answer to such problems. Again, informed nuanced and professional judgment must be used to determine whether reference to a negative could cause a significantly distracting emotional reaction for some significant group of test takers.

Accessibility

Accessibility focuses on fairness for test takers who have particular relevant disabilities. In our view, accessibility is about ensuring that these test takers have a fair opportunity to demonstrate that they have the proficiencies being assessed. Perhaps most obviously, those with vision impairments might need enlarged text and those with complete loss of vision either need braille text or to hear the stimuli and items on a test.

These kinds of barriers to accessing test content and from supplying responses to test content is the unique concern of accessibility, and is sometimes as focused on the form and platform of test delivery as it is on the contents of individual items.

Accessibility issues also can address the content of test items, though. There are specific sensitivity concerns — both regarding strong emotional responses to material and to exposure to/experience of elements of items — that are not about the actual targeted cognition. There is also often a lot of work that goes into adapting items that are suitable for a general audience of test takers for use by test takers with disabilities. For example, illustrations or diagrams might need so-called alt text written that describes them for test takers with visual impairments.

Fairness Review Panels

Fairness issues are so important to assessment that the item development process includes review panels to examine items for such issues. These panels are made of up outside experts of various sorts who collectively can bring expertise, experience and understanding of perspectives that go beyond the experience and expertise of the CDPs who have been working on the items. Those CDPs remain primarily responsible for spotting and correcting fairness issues, but these panels are invaluable for pointing out potential issues that might have been missed, and thereby helping the CDPs further broaden their own knowledge of different sorts of test takers. Thus, these panels should point out issues and perhaps suggest solutions, but it is up to the CDPs to weigh and consider the various suggestions and select the most appropriate course of action.

Review panels bring that diversity of experience, perspective, identity and understanding to this work both through their own views and direct experience and through their understanding of others. For examples, teachers with deep experience working with a particular population may have insight into fairness issues that may impact that group, even though they themselves are not a member of that group. More generally, everyone who serves on sensitivity panels should consider the entire range of typical test takers, and not focus exclusively on just a narrow subset.

Thresholds

RTD says that fairness problems arise when a significant group of test takers’ chances of responding to an item is significantly impacted by the issue. But that (new school) begs the question of what those thresholds of significance are. Unfortunately, we cannot offer an easy answer to that question.

There is no way to prevent any test taker from ever encountering content on a test that elicits any emotional reaction. It is easy to imagine a negative version of every human experience — particularly if one considers the possibility of a parent recently yelling at a child somehow in connection with such an experience. Furthermore, literature is usually rooted in some kind of drama or tension — all of which could evoke a negative association for some test takers. Jack broke his crown when he was getting some water. Humpty Dumpty evokes dropping eggs in the kitchen.

Ideally, every test would give every test taker absolutely equal opportunities to demonstrate their skills and proficiencies. But that would require a level of customization that is not possible — because people simply vary too much and in too many ways. Instead, each assessment project needs to determine how it is going to set those thresholds.

Culturally Responsive Assessment

In recent years, a new term has appeared, culturally responsive assessment. Obviously, this term follows from the older concept of culturally appropriate pedagogy. In our view, most of the concerns that are under the umbrella of culturally responsive assessment are fairness concerns. This term appears to have arisen from an effort to raise the salience of these sorts of issues — an effort that RTD entirely supports. When fairness is viewed too technically and/or is only viewed through the lens of a narrow slice of the population, its moral urgency can be lost.

We are less sure about the demands under the umbrella of culturally responsive assessment that challenge the very constructs and proficiencies being assessed. That is not to say that those constructs and proficiencies are always selected well, but rather that those decisions are usually made upstream from assessment development. That is, questions about what should be assessed follow from decisions that have already been made about what should be taught. If those earlier decisions are being challenged, it is not for assessment professionals to decide them. Assessment simply has too low a profile and lacks the democratic legitimacy to make such decisions.