Assessment Glossary — Rigorous Test Development

RTD’s Assessment Glossary

Accessibility: Set of fairness concerns and approaches that focuses on removing as many barriers to performance for the broadest range of students possible. Considerations include UDL (universal design for learning) and specific disability accommodations, among other approaches.

Additional KSAs: The knowledge, skills and/or abilities that a test takers makes use of when reading, making sense of and responding to an item other than the specific cognition targeted by the item. (An ECD term.)

Alignment: The quality of two different considerations or elements lining up or pointing to the same result. For example, if a test is aligned with instruction, then the test measures the same thing – or construct – as was taught. Alignment may be considered on a broad level or through fine grained examination.

Alternative Task: Cognition, cognitive processes and cognitive steps that a test taker engages in instead of the intended task, upon reading an item. An alternative task might or might not rely on the targeted cognition, and might not rely on it appropriately. (An RTD term.)

Anchor Item: An item that appears on multiple tests or multiple forms of a test to aid the linking of scores across those tests or forms.

Answer Option: The set of possible answer that test takers are given to select from with multiple choice items or other selected response items. These include both the distractors and the key.

Artifact: Any product of any process or task within ECD. Lists, charts and tables are artifacts. Written explanations are artifacts. Student work is an artifact. Anything that one can read or examine is an artifact. (An ECD term.)

Assembly Model: The requirements for assembling the assessment. This includes requirements/targets for distributions various item features (content, format, time, passage types, etc.). Similar to a test blueprint. (An ECD term.)

Assessment Target: A form of construct definition that is written specifically to guide the development of individual standardized test items. Assessment targets often acknowledge limitations imposed by the standardized testing environment on the test’s ability to examine the full breadth and depth of a particular curricular standard.

Assessment: An instrument designed to systematically gather equivalent and comparable information about a large number of students and their performance levels on a set of constructs, to be used for a variety of purposes. An abbreviation of standardized assessment or standardized test. Usually a reference to large scale standardized assessments.

Assessment Triangle: The idea -- often depicted as a literal triangle -- that instruction must be geared towards (“aligned”) with learning objectives (e.g., state curricular standards) and that assessment must be aligned with both of them. That is, teachers teach towards the standards and the assessments report on student performance on those same standards, and therefore assessment report on what teachers have taught students.

Authentic/Authenticity: Authenticity refers to the quality of closely resembling what students might encounter in “the real world,” or even being like the work they do in their classrooms or in their learning activities. This is an aspirational attribute of an assessment or an element of assessment, in that there is no definitive goal, but rather something that should usually be strived for whenever possible. Authenticity contributes to facial validity and other forms of validity, as well.

Automaticity: The quality of fluid and less conscious application of knowledge, skills and/ability. Higher proficiency often leader to greater automaticity and lower proficiency often leads to more a more deliberate approach, instead. This is the central thrust of both Webb’s Depth of Knowledge and Revised Depth of Knowledge.

Balanced Assessment: A system or set of assessments that provides the variety of constituencies of assessment reports (i.e., teachers, supervisors, districts, policy-makers, parents, and students) with high quality and valid information that suits the purposes and needs of each constituency. Note that this cannot be supplied with a single assessment, regardless of the different analyses and reports that are based upon it.

Bias: Technically, bias refers to consistent error in a particular direction -- as opposed to random error. In the context of fairness, it refers to the concern (or fact) that an identifiable subgroup of students performs worse on a test (or individual item) for reasons other than their mastery with the targeted construct. This is often measurable or identifiable after the fact, though – for a variety of reasons – that it is too late. (Note that this differs from the colloquial use of the term biased, in that colloquial term often connotes some intent or purpose that is served by the bias.)

CCSS: The Common Core States Standards, including the CCSS-L (i.e., Common Core State Standards for Literacy) and the CCSS-M (i.e., Common Core State Standards for Mathematics).

CDP: See Content Development Professional.

Characteristic Task Feature: A required element, feature or quality of task, laid out in a task model. These are the traits or qualities of a task that are thought most necessary to elicit evidence of the targeted skill/focal KSA. (An ECD term.)

Charge: What an item asks a test taker to do to do to produce a work product – be it selection of a response that answers a question or the construction of a response. The more common use of the meaning of the term task. (An RTD term.)

Checklist: Short for Content Validity Checklist, an old RTD procedure that scaffolds a review of an item to the goal of considering the cognitive path that test takers might take upon reading the item. This procedure is appropriate for review panels, whereas serious CDP refinement of items relies on deeper and more sustained item review. (An RTD term.)

Claim: A particular statement about the test taker’s knowledge, skill or ability that a test aims to assess or support. Claims can be very fine-grained, but do not need to be. However, claims should be quite specific as to the precise cognition or KSA(s) that they address. A single assessment should support a wide and diverse range of claims. (An ECD term.)

Clang (Association): Repetition of a key word or phrase from a stimulus or stem of an item in an answer option that inappropriately draws attention to that answer option. This may either serve to clue that option as the key (i.e., positive clang) or to make that distractor attractive for reasons other than misunderstanding or misapplication of the targeted cognition (i.e., negative clang). May also occur when all but one of the answer options contains this word or phrase repetition.

Clash: Wearing a red shirt with pink pants. When the register of language of different parts of an item do not match, for reasons other than assessing the targeted cognition.

Classroom Assessment: Non-standardized small scale assessment taken by students in connection to classes in which they are enrolled. Classroom assessment may be created by teachers themselves, text book publisher or otherwise found by teachers, schools or distracts. Contrast with Standardized Assessment.

Client: The owner of an assessment, who prompts the development of the assessment, is responsible for ensuring the test is administered, scored and reported up. The client selects and hires the test development vendor, and monitors its work. Clients are often state education departments or professional organizations.

Closed Stem: The actual question or final prompt in an item, which must take the syntactic form of a question or a command. That is, it must end with a question mark or a period. Contrasted with open stem.

Cloze Item: A selected response item type that replaces each blank in a fill-in-the-blank item with a set of answer options to choose from.

Clueing: When elements of an item or its presentation tip the identity of the key inappropriately. Generally refers to intra-item cluing (i.e., elements of an item tip the identity of its own key), though it may be used as inter-item cluing.

Cognitive Complexity: Not to be confused with difficulty, the complexity of the thinking and problem solving that goes into understanding and responding to an item. First, note that some very different tasks are not complex, per se, but rather are just difficult. Memorizing the first 100 digits of pi is not more complex than memorizing the first 5 digits, though it is more difficult. Second, note that some response modes are inherently more complex than others. For example, generating on-demand writing (e.g., an essay) is a more cognitively complex task than selecting a conclusion about an author’s intent from a list of four options. Third, note that more cognitively complex tasks are not necessarily more difficult than less cognitively complex tasks. For example, writing a short essay may be much easier than memorizing the first 100 digits of pi.

Cognitive Path: The series of cognitive steps that a test taker takes upon reading an item as they work towards their final responses. They may be more or be less less conscious of various steps. (An RTD term.)

Cognitive Process Dimensions: Originally, a 6-level model of cognitive complexity that is part of Revised Bloom’s Taxonomy (RBT). DLM has extended it downward, with four additional levels to ensure it includes the levels appropriate for students with significant cognitive disabilities. It’s levels are i) pre-intentional (DLM), ii) attend (DLM), iii) respond (DLM), iv) replicate (DLM), v) remember, vi) understand, vii) apply, viii) analyze, ix) evaluate, x) create.

Commissioning Phase: The first phase of five phases in the life of a test. This phase including initial recognition of a test, business plans, high level and the highest leverage decisions about the nature of the assessment. It also includes the hiring of some key personnel. While tests might return to the commissioning phase, this is very unlikely. (An RTD term.)

Confidence: The rigorous practice of maintaining awareness of the degree of one’s expertise and the nature and limited degree of expertise of others. Confidence is critical to collaboration, to supporting the learning of others and one’s organization and to producing high quality work products. Confidence can be seen in those who know when to speak and to answer others’ questions. Half of one of the Pillar Practices of Rigorous Test Development work. (An RTD term.)

Commissioned Passage: A reading passage that was written for the purpose of being used in an assessment, and therefore does not require licensing. Contrast with licensed passage or permissioned passage.

Conceptual Assessment Framework (CAF): The whole set of models that make up the flow of ideas, relationships and argument in an ECD-based assessment. Everything. (An ECD term.)

Common Core State Standards: A set of grade-by-grade K-12 standards for literacy and for mathematics that was developed by the Council of Chief State School Officers (i.e., an organization of all the state superintendents of education and equivalent officials from across the entire United States) and released in 2010. The federal government was not involved in the authoring the Common Core State Standards, though it later encouraged states to adopt them. These standards have been adopted by many states, though sometimes renamed and/or in a slightly modified form.

Competency-Based Assessment: As assessment designed to collect evidence regarding a student’s ability to engage in carefully designed competencies, which are often integrated skills with relatively clear authentic applications. The goal of this class of assessments is often to provide evidence for examination (e.g., by educators, by parents), rather than merely reporting on or summarizing it.

Construct Definition: The operationalization of a particular construct into clearly defined language that educators in various roles can make use of. Importantly, construct definitions prevent confusion and/or unsurfaced disagreement about the meaning of a construct. Note that different forms of construct definitions are often suitable only for particular audiences or purposes. Assessment development can depend on assessment targets(i.e., a particular form of construction definitions that is distinct from curricular standards), which often include stimulus requirements, unpacked standards, archetypal examples and disambiguation among similar standards. (See also assessment target.)

Construct Irrelevance: Any factor other than the targeted content or KSA that may influence a student’s performance on an item or test. While items often have to include construct irrelevant material (e.g., written instructions for math items, arithmetic needed for many science items), one must be careful to minimize its inclusion to the extent feasible. Construct irrelevant influences are a threat to validity and are the main concern of fairness.

Construct isolation: The idea that individual test items should measure each measure a small set of particular KSAs or standards. Construct isolation is important because it make it easier to understand what to make of a test taker successfully or unsuccessfully responding to an item. Construct isolate can undermine authenticity and can thereby limit the validity of inferences that are made based upon test takers performances. Thus, construct isolation is apparently both entirely necessary and quite limiting.

Construct Validity: One type of validity. The degree to which the evidence elicited by a test supports valuable inferences about student mastery of a broad construct (e.g., 8th grade science). (Contrast with content validity, facial validity and predictive validity.)

Construct: The idea that is being measured or assessed (e.g., 2-digit multiplication, evaluating quality of an argument in a text). Constructs may be considered very broadly (e.g., 4th grade math), or with more precision and finer grain size (e.g., individual state curricular standards). Constructs must be defined and clarified so that they may be taught and so that they may be assessed.

Constructed Response Item: An item that asks a question or directs the test to some task or charge, requiring the test taker to develop and offer their own response. Contrast with selected response item.

Consequence Phase: The final phase of five phases in the life of a test. This phase includes various things that are outside the control of test developers because they all occur as downstream consequences of an assessment. This includes the uses and purposes to which a test is put, and the following consequences, as well. (An RTD term.)

Construct Definition: An explanation of what a test or test item is intend to asses. It may vary in scope (e.g., the entire domain or a single standard). It may also vary in explicitness and detail. Individual standards are a type of construct definition, and even they demonstrate how construct definitions can vary in score and in detail.

Content Development Professional: Any professional test developer who focuses on the content of test items and is in a position to edit or refine the items. Content development professionals have primary responsibility for item validity. Content development professionals often require content expertise, assessment expertise and knowledge range of typical takers of an assessment, but usually lack expertise in psychometrics. Contrast with psychometricians.

Content Development Professional: A test developer responsible for the contents of assessment items. CDPs own the process and work of improving items’ alignment with their standards and their item validity. CDPs must have knowledge of the content area, of assessment and item development, and of the range of takers of the assessment in question. Also known as an item specialist, a content specialist and other organization-specific terms. (An RTD term.)

Content Distribution: See Coverage Map.

Content Mistake: When an item includes or is based on a misunderstanding or misapplication of some element of the content domain. The mistake need not been with the intended standard or the targeted cognition. It could take the form of a miskeyed item, or could result in deeper problems with the item.

Content Specialist: One of many industry terms for content development professionals.

Content Specifications: A set of detailed construct definitions for use in item development. Content specifications unpack curricular standards from their technical language to explain the nature of the standard, boundaries on the standards and delineations from related standards. They may include descriptions of student work that reflect the KSAs embedded in the standard (i.e., evidence statements). Content specification often include requirements for stimuli to be able to provide students opportunities to demonstrate the targeted KSAs. (Contrast with style guide.)

Content Validity Committee: An external review committee that reviews test items for their alignment with their purportedly aligned standards – before the items are used in operational tests. Content Review Committees are primarily filled with educators with experience with students who will be faced with the items on tests.

Content Validity: One type of validity. The degree to which the content on a test fully and proportionately represents the targeted construct. Ideally, every curricular standard is included in every tests, and no standard or topic takes up a disproportionate amount of the test. In practice, this is an aspirational goal to be worked towards, given testing constraints (e.g., time). (Contrast with construct validity, facial validity and predictive validity.)

Coverage Map: The component of a test blueprint that lays out the desired breakdown of test points (or items) by curricular standard and/or reporting category. This breakdown is a target to be approached as nearly as feasible, as (for various reasons) it cannot always be met perfectly. Also referred to as content distribution.

CR: See Constructed Response Item.

Criterion-Referenced Assessment: Primarily, a form of reporting on student performance that focuses on the student’s performance relative to the content being assessed. For example, it might report that a student has advanced skills in one area, but only basic skills in another. However, this depends on the test being designed and developed to support this kind of reporting. (Usually contrasted with criterion referenced assessment).

Depth of Knowledge: Norman Webb’s 4-level model designed to describe cognitive complexity in the full range of tasks available within a classroom context. The levels are i) recall and reproduction, ii) skills and concepts, iii) short-term strategic thinking and iv) extended thinking. DOK is the most frequently used model of cognitive complexity in the assessment field, though often in a modified form. Note that DOK is not a measure of the number of steps involved in solving a problem. Further note that even DOK’s third level (i.e., short-term strategic thinking) – if accurately applied – is difficult to reach on a standardized test. DOK’s fourth level (i.e., extended thinking) requires the application of all three of the previous levels over time, and therefore is almost invariably out of scope of any assessment administered within any (brief) fixed time period.

Design Pattern: The template for the fully laid out assessment argument for the claims made by assessments, including reasoning, KSAs, evidence descriptions and task model. (An ECD term.)

Design Phase: The second phase of five phases in the life of a test. This includes domain and construct definition, blueprint design, score report design, task model development and most of the issues addressed in Evidence Centered Design.. It stops short of actual item development. (An RTD term.)

Diagnostic Assessment: Formal assessments that are intended to provide fine grained reporting on individual students regarding their status on a (potentially large) variety of knowledge, skills and abilities (i.e., KSA). Diagnostic assessments, by their very nature, require a much higher ratio of items or test length to the amount of material/KSAs.

Difficulty: The empirical observation of the share of test takers that respond to a item successfully. Item difficulty is not merely a function of the conceptual difficulty of the underlying content, as it is strongly influenced by instruction and other experiences of test takers. That is, item difficulty can easily change as curriculum and instructional practices change.

Discrimination: The psychometric idea that an individual item should be garner successful responses from test takers above a given level and unsuccessful responses from test takers below that level. Therefore, a highly discriminating item does a good job of sorting test takers into these two groups, as opposed implying that some test tests belong in the wrong groups. Item discrimination is not related to fairness or any bias issue, despite the lay association with the term.

Distractor: The incorrect answer options presented to test takers in multiple choice and other selected response items.

DOK: Depth of Knowledge, usually referring to Webb’s Depth of Knowledge typology.

Domain: The area or broad construct that the test is intended to assess knowledge and/or mastery of. (An ECD term.)

Domain Analysis: The large task of attempting to build lists and explanations of all of the knowledge, skills and abilities (KSAs) that make up the domain. This includes KSAs that may not appear on the test. This work is often best done with subject matter experts (SME) whose work is guided and facilitated by test development professions who understand the ECD process. Domain analysis depends on expert judgment, but it also depends on examination of documentation and explanations from various sources. They could include curricular guides, syllabi, published works on the content domain, prior assessments and much much more. There is no formula or set approach to conducting a domain analysis. Note that many assessments are commissioned for domains that already have domain models (or partial domain models), and therefore do not require a domain analysis. (An ECD term.)

Domain Model: The large task of organizing the KSAs that were found in the Domain Analysis. This consists of mapping the connections and links between them. Again, this work often relies of SMEs and expert ECD facilitator’s. Again, the domain model may well include things that will not (or cannot) be included on the eventual assessment. Again, there is no formula or set approach for creating a domain model or for what a domain model should look like. Note that many assessments are commissioned for domains that already have domain models (or partial domain models), such as state content standards. (An ECD term.)

Double Keyed: When a selected response item accidentally has two correct answer options.

Drag and Drop: A type of technology enhanced item in which test takers must drag elements on the screen to particular areas on the screen, as though placing physical objects in the correct positions.

Dragger: A component of drag and drop items, the element(s) that test takers must drag to the correct drop bay. Draggers may be single use or refillable.

Drop Bay: A component of drag and drop items, a location that draggers may be moved to. Items can include multiple drop bays. Drop bays may be single use or may receive multiple draggers.

Drop Down Item: Another term for cloze item.

ECD: See Evidence Centered Design.

End-of-Course Assessment: Formal assessments given on a schedule – at the end of a course – regardless of other considerations. A class of summative assessments. (Often contrasted with through-course assessment.)

Essay: Another term for On-Demand Writing.

Evidence: That which can be directly observed, as opposed to inferred. Test takers’ work products and elements of these products are evidence, because they can be directly observed. Aspect of student cognition are notevidence, because they must be inferred (i.e., we cannot actually get inside test takers’ head and know exactly what is going in there). (An ECD term.)

Evidence Centered Design: The dominant framework for thinking about different aspects and elements of test development. Evidence Centered Design collected and organized existing practice and ideas, put them together and offered a consistent terminology for understanding how they fit together. It focuses primarily are test design and secondarily on other aspects of assessment, but treats item development as a black box. Its original developers included Robert Mislevy, Russell Almond and Janice Lukas, though many others have contributed to the project.

Evidence Model: The qualities of test takers’ work products that are needed to support individual claims. (An ECD term.)

Evidence Statement: A description of a quality of student work that supports a particular claim. (An ECD term.)

Facial Validity: One type of validity. The important sense that something seems right, often made without deep investigation or careful consideration. For some individuals and constituencies, a failure at this step is sufficient to condemn a project as invalid. For some purposes, facial validity is sufficient. For other constituencies and/or purposes, facial validity is but a first step. (Contrast with content validity, construct validity and predictive validity.)

Fairness Committee: An external review committee that reviews test items for potential bias – before the items are used in operational tests. Fairness Committees are primarily filled with educators and/or others with particular expertise regarding students who will be faced with the items on tests.

Fairness: The broad topic or set of concerns that the widest range of students have equal opportunity to perform on an assessment. This is an umbrella term for a wide variety of concerns that point to different ways that issues outside of the targeted construct/KSAs can influence student performance, thereby introducing construct-irrelevant threats to validity. It includes matters of bias, accessibility and sensitivity, among others. Test development practices include a number of ways to look for and address these kinds of concerns. (See specifically: bias, accessibility and sensitivity.)

Fill-in-the-Blank Item: A constructed response item type that requires test takers to offer some number of words, phrases, symbols, numbers or other references that can be inserted into the indicated location(s) in a sentence or sentences so that they are accurate.

Focal KSA: The primary knowledge, skill and/or ability targeted by task model (or claim or item). (An ECD term.)

Form: A set of items that acts as a complete test for test takers. Many assessments have a single form taken by all test takers, while others have multiple equivalent forms. Multiple forms may be used to guard against cheating and/or as tests are given across many administrations. Forms require linking to report scores in a standardized fashion.

Formative Assessment: Assessments designed and intended to provide feedback to a teacher regarding the success of the instruction/learning process, generally used to determine whether students need further instruction or have obtained sufficient mastery for instruction to move on to other topics. (Often contrasted with summative assessment.)

Hot Spot: A technology enhanced item type that presents some visual (e.g., a map or diagram) to the test takers and requires them select their response by to pointing to, highlighting, clicking on or otherwise indicating a particular portion of the visual.

Humility: The rigorous practice of maintaining awareness of the limits of one’s expertise, the nature and degree of expertise of others, and a healthy uncertainty about conclusions. Humility is critical to collaboration, to learning from others and to producing high quality work products. Humility can be seen in those who know when to listen and to ask questions. Half of one of the Pillar Practices of Rigorous Test Development work. (An RTD term.)

IAE: See Item Alignment Examination.

Instructionally Embedded Assessment: Small assessment tasks (or sets of tasks) that are embedded as part of the activities of the lesson or of instructional plan when that lesson/plan is designed. A particular class of formative assessment (i.e., used to inform teachers as to the need for further instruction).

Instructionally Relevant: The quality of an assessment being aligned with instruction. That is, an assessment that collects and reports on evidence of student performance on material on which they received instruction. This has relevance both to opportunities-to-learn and on the appropriateness of using the results of the assessment to evaluate teachers. Note that shortcomings in this trait have different meanings, depending on whether it is the instruction or the assessment that is better aligned with the learning objectives.

Instructionally Sensitive: An attribute of an assessment on which the quality of the instruction that the student receives is reflected in the student’s performance on the test. This is a subtle and contentious issue, as while it may seem intuitive that of course better teaching leads to greater student achievement, the evidence for this is often lacking. Likely, this is because this evidence is not gathered -- as it may be prohibitively difficult or costly to do so. However, in fairness to those who raise the issue, our schools and our education policy are often based on the assumption that our assessments are instructionally sensitive, and if that assumption is incorrect, there are enormous and problematic implications. Note that instructional sensitivity does not mean that a test is useful or informative for instruction (see formative assessment).

Intended Task: The method by which content development professionals intend or anticipate test takers will respond to an item. The intended task incorporates how test takers are expected to understand the item and the broad outline of what they are expected to do. (An RTD term.)

Inter-Item Clueing: When elements an item (e.g., presentation or contents) inappropriately tip the identity of the key of another item in the same shared stimulus set or test form. Distinct from simple cluing.

Interaction Type: A classification system for test items that describes how the software system stores and describes traits of the item related to its item type, often related to data storage needs. While there is a strong relationship between an item’s interaction type and its item type, they are not merely parallel sets of terms. For example, fill-in-the-blank, short answer and essay are three different types, but the software stores them all as the same interaction type (i.e., text entry). (Contrast with item type.)

Item: The building block of an assessment. It is the smallest complete unit that a test taker responds to, what a lay person might refer to as a test question. However, an item includes the question, any directions, supplied answer options and/or stimulus. In the most common usage by CDPs, items do not include any shared stimulus – thought this is strictly speaking incorrect.

Item Alignment Examination: A full framework for the involved process of reviewing an item for the multiple cognitive paths that the range of test takers might take when responding to an item. This process in a rigorous approach to determining whether an item is properly aligned( and for whom. (An RTD term.)

Item Alignment: The quality of an item measuring the knowledge, skills and/abilities that it is supposed to measure.

Item Alignment: The quality of an item’s alignment with a particular assessment target, curriculum standard or KSA. Item alignment is determined by the typical students’ cognitive processes as they go through the work of responding to the item. Surface features (e.g., wording of the stem) or inclusion of the KSA are not sufficient to ensure item alignment, as other skills or KSAs may either provide stumbling blocks or to be key that unlocks the correct answer for the test taker. Because of this, item alignment cannot be determined or recognized merely by reading the item, and instead requires thinking through the item as a typical student might.

Item Hygiene: Traits, qualities of elements of an item that may undermine alignment of item validity that can be flagged without considering test takers’ cognitive paths and/or without deep knowledge of the content area. Item hygiene includes mistakes in language conventions, formatting and style. (An RTD term.)

Item Logic: See Logic of an Item.

Item Refinement: The rigorous practice of editing items to improve their item validity. Item refinement usually follows item review, though sometimes it done simultaneously with item review. Item refinement often benefits from reflecting on and reconciling what is learned in item review, especially when item review is done by multiple individuals. One of the Pillar Practices of Rigorous Test Development work. (An RTD term.)

Item Review: The rigorous practice of examining an item to determine how test takers might respond to it and which — if any — elements of the item might lead them astray from the item’s intend task. The goals of item review are generally to determine the degree of alignment between the item and its purported standard and/or what elements of the item are undermining that alignment. One of the Pillar Practices of Rigorous Test Development work. (An RTD term.)

Item Set: A small set of items that are presented together test takers, but does not constitute a complete test. All of the items in an item set are usually based upon a common stimulus.

Item Skeleton: A less restrictive model for an item that provides guidance for item writers and/or CDPS when producing items. Item skeletons may include item logic and/or explanations for how an item for a particular standard or assessment target might work, while leaving giving the item writer and/or CDP more room for creative and/or divergent work. Contrast with item template.

Item Specifications: The portion of a style guide that is particular to standardized test style guides. That is it, not the guidelines regarded preferred language use or convention. Item specifications include elements such as boilerplate language, rules for how many answer options for each multiple choice item and other test-specific issues. Note that item specifications, unlike content specifications, do not address content or individual curricular standards.

Item Template: A very restrictive model for an item that has specific words, phrases or other elements that may be adjusted or changed so that the template can be used to produce a larger number of very very similar items. Contrast with item skeleton.

Item Type: A classification system for test items that describes the mode of engagement with the test content by the student. Item types include multiple choice, drag-and-drop, fill-in-the-blank, essay, hot spot, among others. (Contrast with interaction type.)

Item Validity: The quality of an item eliciting evidence of the targeted cognition for the range of typical test takers. Closely related to alignment, but with emphasis on the fairness implications of variation across the range of typical test takers. (An RTD term.)

Item Writer: Someone who develops initial drafts for test items. Item writers require content expertise, though they often lack assessment expertise and/or knowledge range of typical takers of an assessment. Item writers are not responsible for editing or refining items, with their most valuable contributions coming in the creative ideas they have for items and their subject matter expertise. The term item writer is frequently misapplied to content development professionals, with some organization mistakenly using it as the a job title for content development professionals.

Items Specialist: One of many industry terms for content development professionals.

Key: The correct answer options presented to test takers in multiple choice and other selected response items.

Key KSAs: The knowledge, skills and/or abilities that successful test takers use when responding to an item and unsuccessful test takers do not use. (An RTD term.)

KSA: Literally, “knowledge, skills and/or abilities.” It refers to the individual constructs that items attempt to assess student mastery of. It is intended to be a vague enough term to be useful regardless of the nature of the targeted constructs.

Licensed Passage: A reading passage that was not written for the purpose of being used in an assessment, and therefore (due to copyright law) requires licensing to secure the legal right to include it as part of the assessment. Contrast with commissioned passage or public domain passage.

Linking: The psychometric practice of connecting the scores on one assessment to another assessment. This can be done across forms, across years of administration of an assessment, across different levels of an assessment (i.e., vertical linking) or even across different assessments (e.g., to compare SAT scores to ACT scores).

Logic of an Item: The reasoning for how test takers’ cognitive path through an item will produce positive and negative evidence of their level of proficiency with the targeted cognition. Also internal logic of an item.

Machinery of an Item: The complex interplay of the various elements of an item – down the level of individual words – that create a delicate context and creates meaning for test takers and communicates to them what is expected of them. A metaphor based on the intricacies of a mechanical watch. (An RTD term.)

Matrix Grid: A selected response item type that is similar to a multiple choice item. Matrix grid items present a grid (i.e., multiple columns and multiple rows) of examples, ideas, categories, traits or claims. Test takers indicate which cells in the grid (i.e., an intersection of a row and a column) that are true, are associated with other or otherwise fit the criteria explained in the directions to the item. Individual rows and or columns can be limited to one selection or may allow for multiple selections. Matrix grid items are sometimes worth more points than a single multiple choice item, and sometimes test takers can receive partial credit for their responses.

MC-MS: See Multiple Select Item.

MC: See Multiple Choice Item.

Measurement Model: Instructions that explain how test takers’ work product are turned into scores (e.g., rubrics and other scoring guides) and how scores on individual items or tasks are statistically combined into the final reported score(s) or results. (An ECD term.)

Metacognition: The act of thinking about thinking. Metacognition includes planning later work and reviewing the thinking behind earlier work. It also includes monitoring one’s thinking and work for mistakes.

MG: See Matrix Grid Item.

MS: See Multiple Select Item.

Multiple Choice: The most common item type on large scale assessments, a selected response item type. Multiple choice items include a small set of answer options, from which test takers must select the correct one (i.e., the key). The primary advantage of multiple items the efficiency with which they can be scored. They also are often faster to complete same question presented in a constructed response format. Many question this item type’sability to elicit high quality evidence of a variety of KSAs and standards.

Multiple Select: A selected response item type that is similar to a multiple choice item. Multiple select items have multiple correct answers among their answer options, and test takers are expected to correctly select all of them. Multiple select items usually include more answer options than are typically seen with multiple choice item. Multiple select items are sometimes worth more points than a single multiple choice item, and sometimes test takers can receive partial credit for their responses.

Next Generation Science Standards: A set of grade-by-grade K-12 standards for science (and engineering) that was released in 2013. These standards were developed under the auspices Achieve (i.e., the education policy arm of the National Governor Association) based upon the National Resource Council’s Framework K–12 Science Education. The Next Generation Science Standards have been adopted by roughly half of the states and apply to more than half of US public school students. The National Science Teachers Association (NSTA) was also involved in the development of these standards.

NextGen Standards: The Common Core State Standards and the Next Generation Science Standards. NextGen standards were developed after the testing regime of the No Child Left Behind Act, were intended to emphasize important content that was absent from those tests, and have been broadly adopted by states. Political, social and cultural controversy or disagreements have prevented the development of a set of NextGen standards in social studies.

NextGen Test: Any assessment based upon NextGen standards.

NGSS: See Next Generation Science Standards.

Norm-Referenced Assessment: Primarily, a form of reporting on student performance that focuses on the student’s performance relative to a comparison group (e.g., past or predicted performance on this test by the entire population of test takers, or even this cohort’s performance). For example, it might report that a student performed at the 63rd percentile overall or on a particular skill. However, this depends on the test being designed and developed to support this kind of reporting. (Usually contrasted with criterion referenced assessment).

On-Demand Writing: A constructed response item that requires test takers to respond to a question or prompt with a series of sentences, usually in the form of a series of paragraphs. This work must be begun and completed in a single session, often taking no longer than 20-60 minutes.

Open Stem: The part of an item that question or final prompts the test taker, but does not take the form of a question or statement that end with a mark of terminal punction (e.g., question mark or period). Instead, it is an incomplete sentence whose final portion must be either selected or constructed by the test taker. Contrasted with closed stem.

Operational Phase: The fourth of five phases in the life of a test. This phase follows all the work of developing a test and includes delivery, administration, scoring and reporting on test results. (An RTD term.)

Original Sin: The act and the history of conflating reliability and validity. Usually takes the form of simply using methods and metrics for reliability but calling it "validity" or “accuracy.” This stands in violation of explanations of validity and reliabliity in statistics and/or measurement texts and in the very structure of the Standards for Educational and Psychological Testing.

Other Cognition: Additional cognition that a test taker engages in when responding to an item, almost invariably distracting and impairing of their ability to engage in whatever task they have gone to. (An RTD term.)

Paper and Pencil: The traditional form in which assessments are delivered and administered, including physical printed materials and some manner in which to capture test taker responses. Paper and pencil tests may be taken with a pen, in some cases. Though the form is called paper and pencil, test developers are usually not responsible for including the writing implement.

PARCC: Partnership for Assessment of Readiness for College and Careers, a consortiums of state departments of education and US federal government offices of education that was funded by the Race to the Top program to develop items and tested based upon the Common Core State Standards.

PARCC’s Cognitive Complexity Measures: A pair of multi-dimensional models (for ELA/L and mathematics, respectively) of cognitive complexity, each of which can be summarized into a single Low/Medium/high summary rating. PARCC’s model was designed specifically for large scale standardized assessments. It’s ELA/L model includes command of textual evidence, response mode, processing demand and text complexity as factors. It’s mathematics model is made up of content complexity, practices complexity (prompting, integration, modeling, argument) and process complexity (stimulus material, response mode, linguistic demand, processing steps). One key advantage of PARCC’s model is that it acknowledges that response mode (e.g., selected response vs. constructed response vs. performance task, etc.) is a strong driver of the cognitive complexity of a task. That is, the PARCC model relies on considering the task (i.e. the work and cognitive processes in which a typical student engages as s/he works from the stem through to his/her answer), rather than just looking at the question or prompt in isolation. Although this model has a route to a single cognitive complexity rating for each item, it does highlight the difficulty of reducing such a multi-dimensional idea to such a simple score.

Passage Set: A small set of items that are presented together test takers, but does not constitute a complete test. All of the items in an item set are usually based upon a common reading passage.

Performance Assessment: A type of assessment in which students produce work to be evaluated, standing in contrast with multiple-choice, fill-in-the-blank and other highly constrained forms of answers. Common examples include student writing samples (e.g., essays) and providing answers that include showing their work to mathematics problems. Depending on the task, this may be scored by an expert rater or by computer algorithm. Note that there may be disagreements (even among experts) as to whether a particular task is open-ended enough to quality as a performance assessment.

Performance Level Descriptors: A set of profiles of the collections of skills that a typical student at each designated level of mastery should possess. Note that in practice, real students may possess some skills of higher performance levels before mastering all the skills lower performance levels.

Permissioned Passage: Another term for Licensed Passage.

Pillar Practices: The five basic sets of skills that CDPs use in their item development work: radical empathy, unpacking standards, item review, item refinement, and balancing confidence and humility. (An RTD term.)

Placement Test: As assessment given before instruction to determine whether a student needs an intervention (e.g., assignment to an ESL class) or which assignment (e.g., reading group) is appropriate for the student. (Also known as a screener.)

Platform: See Test Delivery Platform.

Predictive Validity: One type of validity. The ability of a test to accurately predict a particular outcome. Unfortunately, predictive validity is often taken for granted because evidence can be expensive and time consuming to collect. It is important to be careful when discussing or thinking about predictive validity. For example, the SATs famously are shown to predict freshman year grades, but the College Board makes no claims about their ability to predict any other outcomes. (Contrast with content validity, construct validity, and facial validity.)

Presentation Material: Anything that is part of what test takers encounter when engaging with an item (e.g., stimulus, instructions, stem, etc.). (An ECD term.)

Presentation Model: Instructions or requirements for how items and other presentation materials are presented to test takers. They may include many elements style guides, particular to the form and platform of delivery. (An ECD term.)

Production Phase: The third of the five phases in the life of a test. This phase includes on the development of items, field testing, and form construction, range finding and perhaps standard setting. It is called production because it is focused on the development of a high number of highly refined and developed good — like automobile production. (An RTD term.)

Psychometrician: An expert in psychometrics or someone whose work is large comprised of using psychometrics. Psychometricians may also be expert in other aspects of large scale test development and operational matters around large scale assessment, but usually lack any expertise in content development. Contrast with content development professional.

Psychometrics: A set of statistical and other quantitative techniques, methods and practices in the fields of educational and psychological measurement.

Public Domain Passage: A reading passage that was not written for the purpose of being used in an assessment, whose copyright has expired and therefore does not require licensing to secure the legal right to include it as part of the assessment. Contrast with commissioned passage or licensed passage.

QDR: See Qualitative Distractor Review.

Qualitative Distractor Review: The process of examining the incorrect answer options of a multiple choice item to identify the cognitive paths that might lead to their selection, and thereby the misunderstanding and/or misapplication of the targeted cognition or other cognition. QDR is vital to identifying what inferences might be appropriately made based upon unsuccessful responses from test takers. (An RTD term.)

Radical Empathy: The rigorous practice of thinking through a cognitive path in response to an item through the perspective of someone other than oneself, repeatedly. One of the Pillar Practices of Rigorous Test Development work. (An RTD term.)

Rangefinding: A process and set of formal meetings at which expert educators examine authentic student work samples elicited by the test in order to refine judgments about what constitutes work at a variety of score points. This process helps to identify exemplar work that may be used by teachers, test scorers and/or students to better understand differences between different levels of mastery.

rDOK: See Revised Depth of Knowledge. (An RTD term.)

Reading Passage: A text or excerpt of a text that is presented as part of an item set or passage set. Test takers are expected to read the passage before responding to items that are based on the contents of the reading passage.

Reliability: A quantifiable measure of the consistency and reproducibility of various aspects of a test. Because it is quantifiable, it is more easily reported than validity, but it should never be mistaken for accuracy or a colloquial definition of “reliability.” Rather, it should be considered a necessary prerequisite for validity. Tests cannot be valid without being reliable, but being reliable does not ensure that they are valid.

Response Mode:

Revised Depth of Knowledge: A 4-level model of cognitive complexity based on Norman Webb’s Depth of Knowledge. Revised Depth of Knowledge preserves Webb’s four levels and focuses even more closely on automaticity vs. deliberation as the construct’s central thrust. It differs from Webb’s DOK in that it recognizes that different test takers may respond to items with paths of different cognitive complexities and therefore – because there is no singular typical test taker – that a single item can can have multiple rDOK levels. (An RTD term.)

RIF: See Rigorous Item Feedback.

Rigorous Item Feedback: A rigorous protocol for recording and/or communicating content and/or cognition issues with an item. This protocol includes identifying i) who (i.e., the subgroup of test takers that might have the problem), ii) where (i.e., the point or element that prompts the problem), iii) how (i.e., how the test test taker’s cognitive path will be disrupted) and iv) which (i.e., which knowledge, skills and/or abilities are implicated). (An RTD term.)

Rigorous Test Development: An item-centric and validity-focused approach test development that is built up a set of rigorous approaching, practices a procedures for developing items that better elicit evidence of the targeted cognition for the range of typical test takers. Rigorous Test Development was inspired by Evidence Centered Design, and picks up where it stops short of item development.

RTD: See Rigorous Test Development.

SBAC: See SmarterBalanced (assessment consortium).

Scenario Set: A small set of items that are presented together test takers, but does not constitute a complete test. All of the items in an item set are usually based upon a common stimulus. Usually used for science item sets, which may intentionally scaffold a path towards more complex KSA or tasks.

Score Report: The form in which scores are reported to test takers and other interested parties. Score reports often include information about the assessment, how to interpret the scores and sometimes a profile of how the entire pool of test takers performed on the assessment.

Scoring: The complex set of procedures and tasks required assign scores to the work of test takers, including both automated scoring and human scoring of test taker responses. Part of the operational phase of an assessment.

Screener: A more casual term for a placement test.

Selected Response Item: An item that presents test takers with a set of options to select from as their response. Contrast with constructed response item.

Sensitivity: The fairness concern or fact that an item, elements of an item or a stimulus may present particular barriers to some identifiable subgroups of students. This may stem from differences in familiarity with a topic and/or differences in emotional responses to a topic, among other reasons. While lists of topics or sensitivity concerns can be useful when considering sensitivity, adequately addressing sensitivity requires a particular mindset and approach that depends on empathy, collaboration and trust.

Shoehorn: To stretch the meaning of part of the domain model (e.g., a standard or assessment target) inappropriately to force an item alignment

Short Answer: A constructed response item type that requires test takers to answer a question or prompt with brief response. Short answer responses are usually expected to a number, word, phrase or single sentence, though they can be a bit longer.

SmarterBalanced: A consortiums of state departments of education that was funded by the Race to the Top program to develop items and tested based upon the Common Core State Standards.

SME: See Subject Matter Expert.

SR: See Selected Response Item.

Standard: One element of a domain model or definition. The unit to which items are generally meant to align — though standards often include a broader range of KSAs (knowledge, skills and/or abilities) or applications than a single item can asses. Standards are often written in a short/pithy form that requires unpacking to identify the included KSAs and applications.

Standard Setting:

Standardized Test: ****************** Contrast with Classroom Assessment.

Stem: The actual question or final prompt of an item, as opposed to more general directions, the stimulus or the answer options.

Stimulus: The part of an item offers information upon which the question or prompt is based. Stimuli may contain reading passages, charts, graphs, diagrams, maps, images, videos, sound clips and/or other forms of information.

Student Model: The psychometric model for an assessment that represents test takers’ proficiency. In some cases, it may be a single unidimensional conception of mastery. It may be a multi-dimensional set of traits. It could be a linked set of masteries that depend on and feed into each other. Historically, standardized test have often been built upon a unidimensional conception, but many tests have more complex representations. (An ECD term.)

Style Guide: Guidelines that describe preferred standard presentation of items to ensure a consistent experience for students across items and a test. The style guide is aimed at ensuring that assessments provide the clearest communication to students of what is expected of them. Styles guides also take into account subjective preferences regarding language use, for a variety of reasons. Elements of a style guide include:

· Any boilerplate language that should be used within items (e.g., instructions for common tasks).

· Rules for formatting and laying out items and stimuli.

· Grammatical and other language use preferences (e.g., allowances for prepositions at the end of sentences, “email” vs. “e-mail”).

· Use of emphasis words within item stems or instructions (e.g., whether or not to instruct students to select the best answer option).

(Contrast with content specifications.)

Subject Matter Expert: Someone who may lack assessment expertise and/or knowledge of the range of takers of an assessment, but has expertise in the content that the test assesses. CDPs for professional licensure tests are often not SMEs, whereas CDPs for K-12 assessments usually are SMEs.

Summative Assessment: Formal assessments given at the end of period of time or set of instruction (e.g., a unit, a course). Summative assessments are intended to provide a report or summary of students proficiency with the content on the test. (Often contrasted with formative assessment.)

Targeted Cognition: The KSAs (knowledge, skills and/or abilities) that an item is intended to elicit evidence of. This is often a subset or facet of a standard, though it may be an entire standard or even cross multiple standards. (An RTD term.)

Task: In RTD, the cognitive work or path that a test taker engages in when responding to an item. Differs from the charge. (An RTD term.) Elsewhere in assessment, the charge and the task are both referred to as the task.

Task Model: A formal explanation of a particular kind of class of task or item. Because task models describe features and variables in this class of tasks or items, they are not specific enough to serve as item templates – though they can be used to generate item templates. They may offer explanations of requires features to assess a KSA and/or appropriate ways to vary items. They describe what should be presented to test takers and perhaps the products that test takers will produce. There is no set format for a task model. (Note that the ECD meaning of task is not quite the same as the RTD meaning of the term and that RTD task models explain far more than about the targeted cognition and how to assess it than do ECD task models.) (An ECD and an RTD term.)

Task Model Variable: Elements of a task model that describe features of tasks that may intentionally vary from item to item (e.g., to alter difficulty). (An ECD term.)

Technology Enhanced Item: Any non-traditional presentation of an item that would not be possible with a conventional printed assessment. TEIs may be functionally or cognitively quite similar to a traditional selected response item, in spite of the different presentation or mode of interaction.

TEI: See for Technology Enhanced Item.

Test Administration: The complex set of procedures and tasks required for a large number of test takers to actually take an assessment. Part of the operational phase of an assessment.

Test Delivery Platform: The client and server software the enable the administration of computer based tests. Test delivery platforms establish and limit the modes of interaction of test takers with items, item types, aspects of test delivery and sometimes even scoring and score reporting. Contrast with paper and pencil.

Test Development Vendor: An organization that specializes in developing large scale standardized assessments, including both content development professionals and other test developers. Test development vendors generally develop tests under contracts to clients, though they also may have prepared tests or items that the license to clients.

Test: Another term for assessment.

Testlet: A small set of items that are presented together test takers, but does not constitute a complete test.

Typical Test Taker: A fantastical — though commonly accepted idea. RTD acknowledges that there is no one typical test taker or typical type of test taker. Rather, any given assessment will typically be taken by a range of test takers — who will vary by identity, backgrounds, experiences, abilities, preparation and current state (e.g., sleepiness, hunger, mood, stress level). The fact that a test taker may me closer to the median in any number of these dimensions does not make them any more typical than kinds of test takers who are further from those medians but nonetheless still take the assessment regularly. (An RTD term.)

Validity:

Variable Task Feature: Another term for task model variable. (An ECD term.)

Vertical Scaling: The psychometric practice of reporting test taker performance across many levels on a single scale. For example, reporting math performance on the same scale for the entire elementary school years, such that every student is expected to move upwards through the scale every years. Vertical scales require linking in order to avoid making unsupported implications, as is seen in NAEP’s unlinked vertical scale.

Test Blueprint: A structured and highly detailed description of the requirements for building a test form. It explicitly addresses a wide variety of issues, including (but not limited to):

· The point value of items per form.

· The distribution of item types.

· The distribution of items per reporting category and/or curricular standard.

· Layout/sequencing requirements for item clusters vs. discrete items.

· Layout/sequencing requirements for operational blocks vs. field test blocks.

· Genre and passage length distribution requirements (for ELA tests).

· Testing time requirements.

(Contrast with test design and test specifications.)

Test Design: The results of high level decision making (as opposed to fine-grained) about the nature and operations of a test. Considerations are as varied as the purpose of a test, adaptive vs. linear design, targeted constructs, performance assessment vs. traditional assessment, among others. Test design considerations should be guided by the intended use of the test. (Contrast with test blueprint and test specifications.)

Test Specifications: Another term for Test Blueprint.

Test Taker: The person taking the cognitive (e.g., educational, professional certification, psychological) test.

Through-Course Assessment: A series of formal assessments given at different points during a course, either on a preset schedule or when local educators determine that the required portion of the course has been completed. A class of summative assessments. (Often contrasted with end of course assessment.)

Unpacking (a Standard): The rigorous practice of thinking about a standard or assessment target to identify the KSAs (knowledge, skills and/or abilities) and applications that a standard may include. This may be recorded in a Task Model or elsewhere, and it may be done on the fly by CDPs. One of the Pillar Practices of Rigorous Test Development work. (An RTD term.)

Validity: At the most theoretical level, the accuracy of the inferences that are made based on test result data. More practically, the issue of whether a test (or item) provides valid evidence upon which such inferences and judgments may be made. This is the ultimate purpose of any test, but its evaluation relies on a great deal of expert/professional judgment. One may consider a test as being “on target,” so long as one also considers that it needs to be reliably on target. There are many specific types of validity, each of which highlighting different aspects of this large and important set of issues. (See specifically: content validity, construct validity, facial validity and predictive validity.)

wDOK: Norman Webb’s Depth of Knowledge. (An RTD term.)

Work Product: Whatever the test taker produces in response to an item, be it the selection of an answer option, or some constructed response. (An ECD and RTD term).

Year-End Assessment: Formal assessments given on a schedule – at the end of the school year – regardless of other considerations. A class of summative assessments.