Chapter 3 Test Fairness | 2018-19 Summative Technical Report

3.1 Introduction

The Smarter Balanced Assessment Consortium (Smarter Balanced) has designed the assessment system to provide all eligible students with a fair test and equitable opportunity to participate in the assessment. Ensuring test fairness is a fundamental part of validity, starting with test design. It is an important feature built into each step of the test development process, such as item writing, test administration, and scoring. The 2014 Standards for Educational and Psychological Testing (AERA, APA, & NCME, 2014, p. 49) state, “The term fairness has no single technical meaning, and is used in many ways in public discourse.” It also suggests that fairness to all individuals in the intended population is an overriding and fundamental validity concern. As indicated in the Standards (p. 63), “The central idea of fairness in testing is to identify and remove construct-irrelevant barriers to maximal performance for any examinee.”

The Smarter Balanced system is designed to provide a valid, reliable, and fair measure of student achievement based on the Common Core State Standards¹ (CCSS). The validity and fairness of student achievement measures are influenced by a multitude of factors; central among them are:

a clear definition of the construct—the knowledge, skills, and abilities—intended to be measured;
the development of items and tasks that are explicitly designed to assess the construct that is the target of measurement;
delivery of items and tasks that enable students to demonstrate their achievement on the construct; and
the capturing and scoring of responses to those items and tasks.

Smarter Balanced uses several processes to address reliability, validity, and fairness. The fairness construct is defined in the CCSS. The CCSS are a set of high-quality academic standards in mathematics and English language arts/literacy (ELA/literacy) that outline what a student should know and be able to do at the end of each grade. The standards were created to ensure that all students graduate from high school with the skills and knowledge necessary for post-secondary success. The CCSS were developed during a state-led effort launched in 2009 by state leaders. These leaders included governors and state commissioners of education from 48 states, two territories, and the District of Columbia, through their membership in the National Governors Association Center for Best Practices (NGA Center) and the Council of Chief State School Officers (CCSSO).

The CCSS have been adopted by all members of the Smarter Balanced Consortium. The Smarter Balanced content specifications (Smarter Balanced, 2017a,b) define the knowledge, skills, and abilities to be assessed and their relationship to the CCSS. In doing so, these documents describe the major constructs—identified as “claims”—within ELA/literacy and mathematics for which evidence of student achievement is gathered and that form the basis for reporting student performance.

Each claim in the Smarter Balanced content specifications is accompanied by a set of assessment targets that provide more detail about the range of content and depth of knowledge levels. The targets serve as the building blocks of test blueprints. Much of the evidence presented in this chapter pertains to fairness to students during the testing process and to design elements and procedures that serve to minimize measurement bias (i.e., Differential Item Functioning, or DIF). Fairness in item and test design processes and the design of accessibility supports (i.e., universal tools, designated supports, and accommodations) in content development are also addressed.

3.2 Definitions for Validity, Bias, Sensitivity, and Fairness

Some key concepts for the ensuing discussion concern validity, bias, sensitivity, and fairness and are described as follows.

3.2.1 Validity

Validity is the extent to which the inferences and actions based on test scores are appropriate and backed by evidence (Messick, 1989). It constitutes the central notion underlying the development, administration, and scoring of a test, as well as the uses and interpretations of test scores. Validation is the process of accumulating evidence to support each proposed score interpretation or use. Evidence in support of validity is extensively discussed in Chapter 1.

3.2.2 Bias

According to the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 2014), bias is “construct underrepresentation or construct-irrelevant components of tests scores that differentially affect the performance of different groups of test takers and consequently affect the reliability/precision and validity of interpretations and uses of test scores” (p. 216).

3.2.3 Sensitivity

“Sensitivity” refers to an awareness of the need to avoid explicit bias in assessment. In common usage, reviews of tests for bias and sensitivity help ensure that test items and stimuli are fair for various groups of test takers (AERA, APA, & NCME, 2014, p. 64).

3.2.4 Fairness

The goal of fairness in assessment is to assure that test materials are as free as possible from unnecessary barriers to the success of diverse groups of students. Smarter Balanced developed the Bias and Sensitivity Guidelines (Smarter Balanced, 2012c) to help ensure that the assessments are fair for all groups of test takers, despite differences in characteristics that include, but are not limited to, disability status, ethnic group, gender, regional background, native language, race, religion, sexual orientation, and socioeconomic status. Unnecessary barriers can be reduced by:

measuring only knowledge or skills that are relevant to the intended construct;
not angering, offending, upsetting, or otherwise distracting test takers; and
treating all groups of people with appropriate respect in test materials.

These rules help ensure that the test content is fair for test takers and acceptable to the many stakeholders and constituent groups within Smarter Balanced member organizations. The more typical view is that bias and sensitivity guidelines apply primarily to the review of test items. However, fairness must be considered in all phases of test development and use.

3.3 Bias and Sensitivity Guidelines

Smarter Balanced strongly relied on the Bias and Sensitivity Guidelines in the development of the Smarter Balanced assessments, particularly in item writing and review. Items must comply with these guidelines in order to be included in the Smarter Balanced assessments. Use of the guidelines will help the Smarter Balanced assessments comply with Chapter 3, Standard 3.2 of the Standards for Educational and Psychological Testing. Standard 3.2 states that “test developers are responsible for developing tests that measure the intended construct and for minimizing the potential for tests being affected by construct-irrelevant characteristics such as linguistic, communicative, cognitive, cultural, physical or other characteristics” (AERA, APA, & NCME, 2014, p. 64).

Smarter Balanced assessments were developed using the principles of evidence-centered design (ECD). ECD requires a chain of evidence-based reasoning that links test performance to the claims made about test takers. Fair assessments are essential to the implementation of ECD. If test items are not fair, then the evidence they provide means different things for different groups of students. Under those circumstances, the claims cannot be equally supported for all test takers, which is a threat to validity. As part of the validation process, all items are reviewed for bias and sensitivity using the Bias and Sensitivity Guidelines prior to being presented to students. This helps ensure that item responses reflect only knowledge of the intended content domain, are free of offensive or distracting material, and portray all groups in a respectful manner. When the guidelines are followed, item responses provide evidence that supports assessment claims.

3.3.1 Item Development

Smarter Balanced has established item development practices that maximize access for all students, including English Learners (ELs), students with disabilities, and ELs with disabilities, but not limited to those groups. Three resources—the Smarter Balanced Item and Task Specifications (2015c), the General Accessibility Guidelines (Smarter Balanced, 2012a), and the Bias and Sensitivity Guidelines (Smarter Balanced, 2012c)—are used to guide the development of items and tasks to ensure that they accurately measure the targeted constructs. Recognizing the diverse characteristics and needs of students who participate in the Smarter Balanced assessments, the states worked together through the Smarter Balanced Test Administration and Student Access Work Group to incorporate research and practical lessons learned through universal design, accessibility tools, and accommodations (Thompson, Johnstone, & Thurlow, 2002).

A fundamental goal is to design an assessment that is accessible for all students, regardless of English language proficiency, disability, or other individual circumstances. The intent is to ensure that the following steps were achieved for Smarter Balanced.

Design and develop items and tasks to ensure that all students have access to the items and tasks. In addition, deliver items, tasks, and the collection of student responses in a way that maximizes validity for each student.
Adopt the conceptual model embodied in the Accessibility and Accommodations Framework (Smarter Balanced, 2016c) that describes accessibility resources of digitally delivered items/tasks and acknowledges the need for some adult-monitored accommodations. The model also characterizes accessibility resources as a continuum ranging from those available to all students, those that are implemented under adult supervision only, and those for students with a documented need.
Implement the use of an individualized and systematic needs profile for students, or Individual Student Assessment Accessibility Profile (ISAAP), that promotes the provision of appropriate access and tools for each student. Smarter Balanced created an ISAAP process that helps education teams systematically select the most appropriate accessibility resources for each student and the ISAAP tool, which helps teams note the accessibility resources chosen.

Prior to any item development and item review, Smarter Balanced staff trains item writers and reviewers on the General Accessibility Guidelines (Smarter Balanced, 2012a) and Bias and Sensitivity Guidelines (Smarter Balanced, 2012c). As part of item review, individuals with expertise in accessibility, bias, and sensitivity review each item and compare it against a checklist for accessibility and bias and sensitivity. Items must pass each criterion on both checklists to be eligible for field testing. By relying on universal design to develop the items and requiring that individuals with expertise in bias, sensitivity, and accessibility review the items throughout the iterative process of development, Smarter Balanced ensures that the items are appropriate for a wide range of students.

3.3.2 Guidelines for General Accessibility

In addition to implementing the principles of universal design during item development, Smarter Balanced meets the needs of English Learners (ELs) by addressing language aspects during development, as described in the Guidelines for Accessibility for English Language Learners (Smarter Balanced Assessment Consortium, 2012b). ELs have not yet acquired proficiency in English. The use of language that is not fully accessible can be regarded as a source of invalidity that affects the resulting test score interpretations by introducing construct-irrelevant variance. Although there are many validity issues related to the assessment of ELs, the main threat to validity when assessing content knowledge stems from language factors that are not relevant to the construct of interest. The goal of these EL guidelines was to minimize factors that are thought to contribute to such construct-irrelevant variance. Adherence to these guidelines helped ensure that, to the greatest extent possible, the Smarter Balanced assessments administered to ELs measure the intended targets. The EL guidelines were intended primarily to inform Smarter Balanced assessment developers or other educational practitioners, including content specialists and testing coordinators.

In educational assessments, there is an important distinction between content-related language that is the target of instruction versus language that is not content related. For example, the use of words with specific technical meaning, such as “slope” when used in algebra or “population” when used in biology, should be used to assess content knowledge for all students. In contrast, greater caution should be exercised when including words that are not directly related to the domain. ELs may have had cultural and social experiences that differ from those of other students. Caution should be exercised in assuming that ELs have the same degree of familiarity with concepts or objects occurring in situational contexts. The recommendation was to use contexts or objects based on classroom or school experiences, rather than ones that are based outside of school. For example, in constructing mathematics items, it is preferable to use common school objects, such as books and pencils, rather than objects in the home, such as kitchen appliances, to reduce the potential for construct-irrelevant variance associated with a test item. When the construct of interest includes a language component, the decisions regarding the proper use of language becomes more nuanced. If the construct assessed is the ability to explain a mathematical concept, then the decisions depend on how the construct is defined. If the construct includes the use of specific language skills, such as the ability to explain a concept in an innovative context, then it is appropriate to assess these skills. In ELA/literacy, there is greater uncertainty as to item development approaches that faithfully reflect the construct while avoiding language inaccessible for ELs.

The decisions of what best constitutes an item can rely on the content standards, the definition of the construct, and the interpretation of the claims and assessment targets. For example, if the skill to be assessed involves interpreting meanings in a literary text, then the use of original source materials is acceptable. However, the test item itself—as distinct from the passage or stimulus—should be written so that the task presented to a student is clearly defined using accessible language. Since ELs taking Smarter Balanced content assessments likely have a range of English proficiency skills, it is also important to consider the accessibility needs across the entire spectrum of proficiency. Since ELs, by definition, have not attained complete proficiency in English, the major consideration in developing items is ensuring that the language used is as accessible as possible. The use of accessible language does not guarantee that construct-irrelevant variance will be eliminated, but it is the best strategy for helping ensure valid scores for ELs and for other students as well.

Using clear and accessible language is a key strategy that minimizes construct-irrelevant variance in items. Language that is part of the construct being measured should not be simplified. For non-content-specific text, the language of presentation should be as clear and simple as possible. The following guidelines for the use of accessible language were proposed as guidance in the development of test items. This guidance intended to work in concert with other principles of good item construction. From the EL Guidelines (Smarter Balanced Assessment Consortium, 2012b), some general principles for the use of accessible language were proposed as follows.

Design test directions to maximize clarity and minimize the potential for confusion.
Use vocabulary widely accessible to all students, and avoid unfamiliar vocabulary not directly related to the construct (August, Carlo, Dressler, & Snow, 2005; Bailey, Huang, Shin, Farnsworth, & Butler, 2007).
Avoid the use of syntax or vocabulary that is above the test’s target grade level (Borgioli, 2008). The test item should be written at a vocabulary level no higher than the target grade level, and preferably at a slightly lower grade level, to ensure that all students understand the task presented (Young, 2008).
Keep sentence structures as simple as possible while expressing the intended meaning. In general, ELs find a series of simpler, shorter sentences to be more accessible than longer, more complex sentences (Pitoniak, Young, Martiniello, King, Buteux, & Ginsburgh, 2009).
Consider the impact of cognates (words with a common etymological origin) and false cognates (word pairs or phrases that appear to have the same meaning in two or more languages, but do not) when developing items. Spanish and English share many cognates, and because the large majority of ELs speak Spanish as their first language (nationally, more than 75%), the presence of cognates can inadvertently confuse students and alter the skills being assessed by an item. Examples of false cognates include: billion (the correct Spanish word is millones; not billón, which means trillion); deception (engaño; not decepción, which means disappointment); large (grande; not largo, which means long); library (biblioteca; not librería, which means bookstore).
Do not use cultural references or idiomatic expressions (such as “being on the ball”) that are not equally familiar to all students (Bernhardt, 2005).
Avoid sentence structures that may be confusing or difficult to follow, such as the use of passive voice or sentences with multiple clauses (Abedi & Lord, 2001; Forster & Olbrei, 1973; Schachter, 1983).
Do not use syntax that may be confusing or ambiguous, such as using negation or double negatives in constructing test items (Abedi, 2006; Cummins, Kintsch, Reusser, & Weimer, 1988).
Minimize the use of low-frequency, long, or morphologically complex words and long sentences (Abedi, 2006; Abedi, Lord, & Plummer, 1995).
Teachers can use multiple semiotic representations to convey meaning to students in their classrooms. Assessment developers should also consider ways to create questions using multi-semiotic methods so that students can better understand what is being asked (Kopriva, 2010). This might include greater use of graphical, schematic, or other visual representations to supplement information provided in written form.

3.4 Test Delivery

In addition to focusing on accessibility, bias, and sensitivity during item development, Smarter Balanced also maximizes accessibility through test delivery. Smarter Balanced works with members to maintain the original conceptual framework (Smarter Balanced Assessment Consortium, 2016c) that continues to serve as the basis underlying the usability, accessibility, and accommodations (Figure 3.1). This figure portrays several aspects of the Smarter Balanced assessment resources—universal tools (available for all students), designated supports (available when indicated by an adult or team), and accommodations (as documented in an Individualized Education Program (IEP) or 504 plan). It also displays the additive and sequentially inclusive nature of these three aspects.

Universal tools are available to all students, including those receiving designated supports and those receiving accommodations.
Designated supports are available only to students who have been identified as needing these resources (as well as those students for whom the need is documented as described in the following bullet).
Accommodations are available only to those students with documentation of the need through a formal plan (e.g., IEP, 504). Those students also may access designated supports and universal tools.

A universal tool or a designated support may also be an accommodation, depending on the content or grade. This approach is consistent with the emphasis that Smarter Balanced has placed on the validity of assessment results coupled with access. Universal tools, designated supports, and accommodations are all intended to yield valid scores. Use of universal tools, designated supports, and accommodations result in scores that count toward participation in statewide assessments. Also shown in Figure 3.1 are the universal tools, designated supports, and accommodations for each category of accessibility resources. There are both embedded and non-embedded versions of the universal tools, designated supports, or accommodations, depending on whether they are provided as digitally delivered components of the test administration or provided locally separate from the test delivery system.

Figure 3.1: Conceptual Model Underlying the Smarter Balanced Usability, Accessibility, and Accommodations Guidelines

3.5 Meeting the Needs of Traditionally Underrepresented Populations

Members decided to make accessibility resources available to all students based on need rather than eligibility status or other designation. This reflects a belief among Consortium states that unnecessarily restricting access to accessibility resources threatens the validity of the assessment results and places students under undue stress and frustration. Additionally, accommodations are available for students who qualify for them. The Consortium utilizes a needs-based approach to providing accessibility resources. A description as to how this benefits ELs, students with disabilities, and ELs with disabilities is presented here.

3.5.1 Students Who Are ELs

Students who are ELs have needs that are unique from students with disabilities, including language-related disabilities. The needs of ELs are not the result of a language-related disability, but instead are specific to the student’s current level of English language proficiency. The needs of students who are ELs are diverse and are influenced by the interaction of several factors, including their current level of English language proficiency, their prior exposure to academic content and language in their native language, the languages to which they are exposed outside of school, the length of time they have participated in the U.S. education system, and the language(s) in which academic content is presented in the classroom. Given the unique background and needs of each student, the conceptual framework is designed to focus on students as individuals and to provide several accessibility resources that can be combined in a variety of ways. Some of these digital tools, such as using a highlighter to highlight key information, are available to all students, including ELs. Other tools, such as the audio presentation of items or glossary definitions in English, may also be assigned to any student, including ELs. Still, other tools, such as embedded glossaries that present translations of construct-irrelevant terms, are intended for those students whose prior language experiences would allow them to benefit from translations into another spoken language. Collectively, the conceptual framework for usability, accessibility, and accommodations embraces a variety of accessibility resources that have been designed to meet the needs of students at various stages in their English language development.

3.5.2 Students with Disabilities

Federal law requires that students with disabilities who have a documented need receive accommodations that address those needs and that they participate in assessments. The intent of the law is to ensure that all students have appropriate access to instructional materials and are held to the same high standards. When students are assessed, the law ensures that students receive appropriate accommodations during testing so they can demonstrate what they know and can do, so that their achievement is measured accurately.

The Accessibility and Accommodations Framework (Smarter Balanced, 2016c) addresses the needs of students with disabilities in three ways. First, it provides for the use of digital test items that are purposefully designed to contain multiple forms of the item, each developed to address a specific access need. By allowing the delivery of a given access form of an item to be tailored based on each student’s access need, the Framework fulfills the intent of federal accommodation legislation. Embedding universal accessibility digital tools, however, addresses only a portion of the access needs required by many students with disabilities. Second, by embedding accessibility resources in the digital test delivery system, additional access needs are met. This approach fulfills the intent of the law for many, but not all, students with disabilities by allowing the accessibility resources to be activated for students based on their needs. Third, by allowing for a wide variety of digital and locally provided accommodations (including physical arrangements), the Framework addresses a spectrum of accessibility resources appropriate for math and ELA/literacy assessment. Collectively, the Framework adheres to federal regulations by allowing a combination of universal design principles, universal tools, designated supports, and accommodations to be embedded in a digital delivery system and through local administration assigned and provided based on individual student needs. Therefore, a student who is both an EL and a student with a disability benefits from the system, because they may have access to resources from any of the three categories as necessary to create an assessment tailored to their individual need.

3.6 The Individual Student Assessment Accessibility Profile (ISAAP)

Typical practice frequently required schools and educators to document, a priori, the need for specific student accommodations and document the use of those accommodations after the assessment. For example, most programs require schools to document a student’s need for a large-print version of a test for delivery to the school. Following the test administration, the school documented (often by bubbling in information on an answer sheet) which of the accommodations, if any, a given student received; whether the student actually used the large-print form; and whether any other accommodations, such as extended time, were provided. Traditionally, many programs have focused only on students who have received accommodations and thus may consider an accommodation report as documenting accessibility needs. The documentation of need and use establishes a student’s accessibility needs for assessment.

For most students, universal digital tools will be available by default in the Smarter Balanced test delivery system and need not be documented. These tools can be deactivated if they create an unnecessary distraction for the student. Other embedded accessibility resources that are available for any student needing them must be documented prior to assessment. The Smarter Balanced assessment system has established an Individual Student Assessment Accessibility Profile (ISAAP) to capture specific student accessibility needs. The ISAAP tool is designed to facilitate selection of the universal tools, designated supports, and accommodations that match student access needs for the Smarter Balanced assessments, as supported by the Usability, Accessibility, and Accommodations Guidelines (Smarter Balanced Assessment Consortium, 2017c). The ISAAP tool² should be used in conjunction with the Usability, Accessibility, and Accommodations Guidelines and state regulations and policies related to assessment accessibility as a part of the ISAAP process. For students requiring one or more accessibility resources, schools will be able to document this need prior to test administration. Furthermore, the ISAAP can include information about universal tools that may need to be eliminated for a given student. By documenting need prior to test administration, a digital delivery system will be able to activate the specified options when the student logs in to an assessment. In this way, the profile permits school-level personnel to focus on each individual student, documenting the accessibility resources required for valid assessment of that student in a way that is efficient to manage.

The conceptual framework shown in Figure 3.1 provides a structure that assists in identifying which accessibility resources should be made available for each student. In addition, the conceptual framework is designed to differentiate between universal tools available to all students and accessibility resources that must be assigned before the administration of the assessment. Consistent with recommendations from Shafer and Rivera (2011); Thurlow, Quenemoen, and Lazarus (2011); Fedorchak (2012); and Russell (2011), Smarter Balanced is encouraging school-level personnel to use a team approach to make decisions concerning each student’s ISAAP. Gaining input from individuals with multiple perspectives, including the student, will likely result in appropriate decisions about the assignment of accessibility resources. Consistent with these recommendations, one should avoid selecting too many accessibility resources for a student. The use of too many unneeded accessibility resources can decrease student performance.

The team approach encouraged by Smarter Balanced does not require the formation of a new decision-making team. The structure of teams can vary widely depending on the background and needs of a student. A locally convened student support team can potentially create the ISAAP. For most students who do not require accessibility tools or accommodations, a teacher’s initial decision may be confirmed by a second person (potentially the student). In contrast, for a student who is an English language learner and has been identified with one or more disabilities, the IEP team should include the English language development specialist who works with the student, along with other required IEP team members and the student, as appropriate. The composition of teams is not being defined by Smarter Balanced; it is under the control of each school and is subject to state and federal requirements.

3.7 Usability, Accessibility, and Accommodations Guidelines

Smarter Balanced developed the Usability, Accessibility, and Accommodations Guidelines (UAAG) for its members to guide the selection and administration of universal tools, designated supports, and accommodations. All ICAs (Interim Comprehensive Assessments) and IABs (interim assessment blocks) are fully accessible and offer all accessibility resources as appropriate by grade and content area, including ASL, braille, and Spanish. It is intended for school-level personnel and decision-making teams, particularly Individualized Education Program (IEP) teams, as they prepare for and implement the Smarter Balanced summative and interim assessments. The UAAG provides information for classroom teachers, English development educators, special education teachers, and related services personnel in selecting and administering universal tools, designated supports, and accommodations for those students who need them. The UAAG is also intended for assessment staff and administrators who oversee the decisions that are made in instruction and assessment. It emphasizes an individualized approach to the implementation of assessment practices for those students who have diverse needs and participate in large-scale assessments. This document focuses on universal tools, designated supports, and accommodations for the Smarter Balanced summative and interim assessments in ELA/literacy and mathematics. At the same time, it supports important instructional decisions about accessibility for students. It recognizes the critical connection between accessibility in instruction and accessibility during assessment. The UAAG is also incorporated into the Smarter Balanced Test Administration Manual (Smarter Balanced, 2017i).

According to the UAAG (Smarter Balanced, 2014, p. 2), all eligible students (including students with disabilities, ELs, and ELs with disabilities) should participate in the assessments. In addition, the performance of all students who take the assessment is measured with the same criteria. Specifically, all students enrolled in grades 3 to 8 and high school are required to participate in the Smarter Balanced mathematics assessment, except students with the most significant cognitive disabilities who meet the criteria for the mathematics alternate assessment based on alternate achievement standards (approximately 1% or less of the student population).

All students enrolled in grades 3 to 8 and high school are required to participate in the Smarter Balanced English language/literacy assessment except:

students with the most significant cognitive disabilities who meet the criteria for the English language/literacy alternate assessment based on alternate achievement standards (approximately 1% or fewer of the student population), and
ELs who are enrolled for the first year in a U.S. school. These students will participate in their state’s English language proficiency assessment.

Federal laws governing student participation in statewide assessments include the Elementary and Secondary Education Act - ESEA (reauthorized as the No Child Left Behind Act - NCLB of 2001), the Individuals with Disabilities Education Improvement Act of 2004 - IDEA, and Section 504 of the Rehabilitation Act of 1973 (reauthorized in 2008).

Since the Smarter Balanced assessment is based on the CCSS, universal tools, designated supports, and accommodations, the Smarter Balanced assessment may be different from those that state programs utilized previously. For the summative assessments, state participants can only make available to students the universal tools, designated supports, and accommodations consistent with the Smarter Balanced UAAG. According to the UAAG (Smarter Balanced, 2014, p. 1), when the implementation or use of the universal tool, designated support, or accommodation is in conflict with a member state’s law, regulation, or policy, a state may elect not to make it available to students.

The Smarter Balanced universal tools, designated supports, and accommodations currently available for the Smarter Balanced assessments have been prescribed. The specific universal tools, designated supports, and accommodations approved by Smarter Balanced may undergo change if additional tools, supports, or accommodations are identified for the assessment based on state experience or research findings. The Consortium has established a standing committee, including members from the Consortium and staff, that reviews suggested additional universal tools, designated supports, and accommodations to determine if changes are warranted. Proposed changes to the list of universal tools, designated supports, and accommodations are brought to Consortium members for review, input, and vote for approval. Furthermore, states may issue temporary approvals (i.e., one summative assessment administration) for individual, unique student accommodations. It is expected that states will evaluate formal requests for unique accommodations and determine whether the request poses a threat to the measurement of the construct. Upon issuing temporary approval, the petitioning state can send documentation of the approval to the Consortium. The Consortium will consider all state-approved temporary accommodations as part of the annual Consortium accommodations review process. The Consortium will provide to member states a list of the temporary accommodations issued by states that are not Consortium-approved accommodations.

3.8 Provision of Specialized Tests or Pools

Smarter Balanced provides a full item pool and a series of specialized items pools that allow students who are eligible to access the tests with a minimum of barriers. These accessibility resources are considered embedded accommodations or embedded designated supports. The specialized pools that were available in 2018-19 are shown in Table 3.1.

Table 3.1: SPECIALIZED TESTS AVAILABLE TO QUALIFYING STUDENTS
Subject	Test Instrument
ELA/literacy	ASL adaptive online (Listening only)
	Closed captioning adaptive online (Listening only)
	Braille adaptive online
	Braille paper pencil
Math	Translated glossaries adaptive online
	Stacked Spanish adaptive online
	ASL adaptive online
	Braille adaptive online
	Braille hybrid adaptive test (HAT)
	Spanish paper pencil
	Braille paper pencil
	Translated glossaries paper pencil

Table 3.2 and Table 3.3 show, for each subject, the number of online items in the general and accommodated pools by test segment (CAT and PT) and grade. Items in fixed forms, both online and paper and pencil, are not included in the counts shown in these tables.

Table 3.2: NUMBER OF ENGLISH LANGUAGE ARTS/LITERACY ITEMS IN GENERAL AND ACCOMMODATION POOLS BY ACCOMMODATION WITHIN TEST SEGMENT WITHIN GRADE
Grade	Online CAT General	Online CAT ASL	Online CAT Braille	Online PT General	Online PT ASL	Online PT Braille
3	867	58	260	38	NA	10
4	823	61	241	44	NA	10
5	787	47	257	50	NA	12
6	811	56	251	38	NA	10
7	735	51	209	48	NA	16
8	815	54	259	50	NA	16
11	2,612	121	466	56	NA	16

Table 3.3: NUMBER OF MATHEMATICS ITEMS IN GENERAL AND ACCOMMODATION POOLS BY ACCOMMODATION WITHIN TEST SEGMENT WITHIN GRADE
Grade	Online CAT General	Online CAT ASL	Online CAT Braille	Online CAT Glossaries	Online CAT Spanish	Online PT General	Online PT ASL	Online PT Braille	Online PT Glossaries	Online PT Spanish
3	1,234	333	370	229	388	95	35	35	30	50
4	1,325	336	317	226	384	116	41	32	29	43
5	1,268	367	332	232	389	105	41	41	28	51
6	1,147	343	347	246	378	92	23	23	28	39
7	1,047	317	323	233	335	92	20	20	17	25
8	915	298	257	208	301	79	25	19	39	34
11	2,610	660	492	370	672	70	17	17	16	28

Table 3.4 and Table 3.5 show the total score reliability and standard error of measurement (SEM) of the tests taken by students requiring the accommodated pools of items. The statistics in these tables were derived as described in Chapter 2 for students in the general population.

The measurement precision of accommodated tests is in line with that of the general population, taking into consideration the overall performance of students taking the accommodated tests and the relationship between overall performance level and measurement error. Measurement error tends to be greater at higher and lower deciles of performance, compared to deciles near the median (Table 2.29). To the extent that the average overall scale scores of students taking accommodated tests, shown in Table 3.4 and Table 3.5, fall into higher or lower deciles of performance (see tables in Section 5.4.3), one can expect the corresponding average SEMs in Table 3.4 and Table 3.5 to be larger than those for the general population (mean SEM in Table 2.29). To the extent that average SEM associated with the accommodated tests tend to be larger for this reason, one can also expect reliability coefficients in Table 3.4 and Table 3.5 to be smaller than those for the general population (see total score reliabilities in Table 2.4 and Table 2.5). Any differences in reliability coefficients between general and accommodated populations must also take into account differences in variability of test scores. Even if the groups had the same average scale score and measurement error, the group having a lower standard deviation of scale scores would have a lower reliability coefficient. In some cases, such as grade 3 ASL, the standard deviation of scale scores, 77.1, is much lower than that of the general population grade 3 ELA/literacy test-taking population (91.4 in Table 5.7).

Statistics concerned with test bias are reported for braille and Spanish tests in Chapter 2 and are based on simulation. Reliability and measurement error information for accommodated fixed forms, both paper and pencil and online, are the same for regular fixed forms, as reported in Table 2.10 and Table 2.11, since the accommodated and regular fixed forms use the same items.

Table 3.4: STUDENT COUNTS, TEST RELIABILITY, AND STANDARD ERROR OF MEASUREMENT FOR ACCOMMODATED STUDENTS BY GRADE IN ELA/LITERACY
Type	Grade	Count	Mean	SD	Rho	Avg.SEM	SEM.Q1	SEM.Q2	SEM.Q3	SEM.Q4
Braille	3	15	2430	66.1	0.875	23.3	24.8	22.0	22.7	23.5
	4	18	2482	111.9	0.940	27.1	31.0	25.2	24.5	26.8
	5	27	2494	92.3	0.921	25.8	28.1	24.3	24.0	26.6
	6	23	2452	79.5	0.885	26.8	31.8	25.8	24.6	24.5
	7	21	2518	100.9	0.916	28.4	34.5	26.6	25.0	26.4
	8	19	2528	80.1	0.886	26.9	29.4	27.0	25.8	25.4
	HS	17	2556	122.5	0.918	33.8	43.0	30.0	29.8	30.2
ASL	3	194	2309	77.1	0.838	30.2	39.9	30.7	26.7	23.4
	4	215	2321	86.9	0.845	32.9	43.2	31.8	29.4	26.4
	5	210	2370	89.1	0.876	30.4	38.8	30.5	26.6	25.1
	6	217	2392	93.4	0.880	31.5	39.4	32.7	28.3	25.3
	7	265	2405	83.5	0.844	32.5	39.7	33.5	30.3	26.4
	8	257	2423	89.3	0.855	33.2	42.5	33.3	30.0	26.8
	HS	246	2453	100.9	0.866	36.4	45.0	37.4	33.2	29.8

Table 3.5: STUDENT COUNTS, TEST RELIABILITY, AND STANDARD ERROR OF MEASUREMENT FOR ACCOMMODATED STUDENTS BY GRADE IN MATHEMATICS
Type	Grade	Count	Mean	SD	Rho	Avg.SEM	SEM.Q1	SEM.Q2	SEM.Q3	SEM.Q4
Braille	3	12	2458	69.7	0.934	17.9	19.3	17.3	17.0	18.0
	4	18	2438	122.0	0.940	25.3	41.6	20.8	18.3	17.6
	5	25	2473	92.5	0.931	23.6	31.0	24.2	19.0	19.2
	6	20	2407	117.5	0.883	36.2	60.6	35.4	28.0	21.0
	7	20	2469	108.6	0.909	31.2	45.0	32.0	25.8	22.0
	8	16	2431	92.1	0.820	37.9	50.5	40.8	33.2	27.2
	HS	9	2461	118.0	0.825	45.4	67.0	38.5	36.0	29.5
ASL	3	188	2330	83.8	0.909	24.2	34.9	24.4	20.0	17.4
	4	212	2362	89.3	0.888	27.8	42.3	28.3	21.8	18.4
	5	209	2393	82.0	0.839	31.2	44.6	31.7	27.0	21.3
	6	210	2383	94.0	0.817	37.5	56.1	39.0	31.5	23.3
	7	254	2385	96.4	0.765	43.3	67.4	44.9	34.7	25.9
	8	254	2410	98.2	0.805	41.5	56.7	43.4	37.3	28.2
	HS	239	2450	94.8	0.733	46.3	67.4	48.5	39.1	29.9
Spanish	3	8,038	2390	83.8	0.939	20.1	26.2	19.5	17.5	17.0
	4	7,087	2422	86.2	0.929	22.0	30.4	21.1	18.7	17.9
	5	5,956	2441	91.4	0.908	26.4	36.6	27.4	22.8	18.7
	6	4,402	2426	98.1	0.878	32.0	47.9	32.5	26.0	21.4
	7	4,008	2420	95.0	0.822	37.4	56.1	37.4	31.0	24.6
	8	3,650	2428	88.9	0.801	38.2	51.5	39.4	34.6	27.3
	HS	3,262	2447	86.6	0.686	45.9	66.0	47.0	39.6	31.0
TransGloss	3	10,918	2400	80.4	0.937	19.7	24.9	19.1	17.5	17.1
	4	11,310	2432	81.8	0.929	21.1	27.5	20.2	18.5	18.0
	5	10,037	2449	85.0	0.903	25.4	34.1	26.4	22.4	18.7
	6	8,861	2449	100.3	0.904	29.2	42.7	29.2	23.9	20.8
	7	7,598	2452	106.9	0.883	34.1	50.5	34.2	28.6	22.9
	8	7,372	2466	115.3	0.897	35.5	49.1	37.2	31.0	24.5
	HS	10,777	2502	122.3	0.879	39.5	59.4	41.2	32.8	24.6

The braille statistics in Table 3.5 include both the fully adaptive online braille test and the Braille hybrid adaptive test (Braille HAT). More specific information about the Braille HAT is available online (Smarter Balanced, 2017h). Documentation concerned with blueprint fulfillment, item exposure, measurement precision, and bias of the Braille HAT is currently under review for public release.

3.9 Differential Item Functioning (DIF)

DIF analyses are used to identify items for which groups of students that are matched on overall achievement, but differ demographically (e.g., males, females), have different probabilities of success on a test item. Information about DIF and the procedures for reviewing items flagged for DIF is a component of validity evidence associated with the internal properties of the test.

3.9.1 Method of Assessing DIF

DIF analyses are performed on items using data gathered in the field test stage. In a DIF analysis, the performance on an item by two groups that are similar in achievement, but differ demographically, are compared. In general, the two groups are called the focal and reference groups. The focal group is usually a minority group (e.g., Hispanics), while the reference group is usually a contrasting majority group (e.g., Caucasian) or all students that are not part of the focal group demographic. The focal and reference groups in Smarter Balanced DIF analyses are identified in Table 3.6.

Table 3.6: DEFINITION OF FOCAL AND REFERENCE GROUPS FOR DIF ANALYSES
Group Type	Focal Group	Reference Group
Gender	Female	Male
Ethnicity	African American	White
	Asian/Pacific Islander
	Native American/Alaska Native
	Hispanic
Special Populations	Limited English Proficient (LEP)	English Proficient
	Individualized Education Program (IEP)	No IEP
	Title 1 (Economically disadvantaged)	Not Title 1

A DIF analysis asks, “Do focal group students have the same probability of success on each test item as reference group students of the same overall ability (as indicated by their performance on the full test)?” If the answer is “no,” according to the criteria described below, the item is said to exhibit DIF.

Different DIF analysis procedures and flagging criteria are used depending on the number of points the item is worth. For one-point items (also called dichotomously scored items and scored 0/1), the Mantel-Haenszel statistic (Mantel & Haenszel, 1959) is used. For items worth more than one point (also called partial-credit or polytomously scored items), the Mantel chi-square statistic (Mantel, 1963) and the standardized mean difference (SMD) procedure (Dorans & Kulick, 1983, 1986) are used.

The Mantel-Haenszel statistic is computed as described by Holland and Thayer (1988). The common odds ratio is computed first:

\[\begin{equation} \alpha_{MH}=\frac{(\sum_m\frac{R_rW_f}{N_m})}{(\sum_m\frac{R_fW_r}{N_m})}, \tag{3.1} \end{equation}\]

where
\(R_r\) = number in reference group at ability level m answering the item right;
\(W_f\) = number in focal group at ability level m answering the item wrong;
\(R_f\) = number in focal group at ability level m answering the item right;
\(W_r\) = number in reference group at ability level m answering the item wrong; and
\(N_m\) = total group at ability level m.

This value is then used to compute MH D-DIF, which is a normalized transformation of item difficulty (p-value) with a mean of 13 and a standard deviation of 4:

\[\begin{equation} -2.35ln[\alpha_{MH}]. \tag{3.2} \end{equation}\]

The standard error used to test MH D-DIF for significance is equal to

\[\begin{equation} 2.35\sqrt{(var[ln(\hat{\theta}_{MH})])}, \tag{3.3} \end{equation}\]

where the variance of the MH common odds ratio, \(var[ln\hat{\theta}_{MH})],\) is given for example in Appendix 1 of Michaelides (2008). For significance testing, the ratio of MH D-DIF and this SE is tested as a deviate of the normal distribution (\(\alpha = .05\)).

The statistical significance of MH D-DIF could alternatively be obtained by computing the chi square statistic:

\[\begin{equation} X^2_{MH} = \frac{(|\sum_m R_r-\sum_m E(R_r)|-\frac{1}{2})^2}{\sum_m Var(R_r)}, \tag{3.4} \end{equation}\]

where

\[\begin{equation} E(R_r) = \frac{N_r R_N}{N_m,Var(R_r)} = \frac{N_r N_f R_N W_N}{N_m^2 (N_m-1)}, \tag{3.5} \end{equation}\]

\(N_r\) and \(N_f\) are the numbers of examinees in the reference and focal groups at ability level m, respectively, and \(R_N\) and \(W_N\) are the number of examinees who answered the item correctly and incorrectly at ability level m, respectively. Smarter Balanced uses the standard error to test the significance of MH D-DIF.

The Mantel chi-square (Mantel, 1963) is an extension of the MH statistic, which, when applied to the DIF context, presumes the item response categories are ordered and compares means between reference and focal groups. The Mantel statistic is given by:

\[\begin{equation} X^2_{Mantel} = \frac{(\sum_{m=0}^T F_m - \sum_{m=0}^T E(F_m))^2}{\sum_{m=0}^T \sigma_{F_m}^2} \tag{3.6} \end{equation}\]

where \(F_m\) is the sum of item scores for the focal group at the \(m^{th}\) ability level, E() is the expected value (mean), and \(\sigma^2\) is the variance. Michaelides (2008, p. 5) provides additional detail pertaining to the computation of the mean and variance.

The standardized mean difference used for partial-credit items is defined as:

\[\begin{equation} SMD = \sum p_{Fk} m_{Fk} - \sum p_{Fk} m_{Rk}, \tag{3.7} \end{equation}\]

where \(p_{Fk}\) is the proportion of the focal group members who are at the \(k^{th}\) level of the matching variable,\(m_{Fk}\) is the mean item score for the focal group at the \(k^{th}\) level, and \(m_{Rk}\) is the mean item score for the reference group at the \(k^{th}\) level. A negative value of the standardized mean difference shows that the item is more difficult for the focal group, whereas a positive value indicates that it is more difficult for the reference group.

To get the effect size, the SMD is divided by the total item group (reference and focal groups pooled) standard deviation:

\[\begin{equation} SD = \sqrt{\frac{(n_F-1)\sigma_{y_F}^2+(n_R-1)\sigma_{y_R}^2}{n_F + n_R - 2}}, \tag{3.8} \end{equation}\]

where \(n_F\) and \(n_R\) are the counts of focal and reference group members who answered the item, and \(\sigma_{y_F}^2\) and \(\sigma_{y_R}^2\) are the variances of the item responses for the focal and reference groups, respectively.

Items are classified into three categories of DIF: “A,” “B,” or “C” according to the criteria shown in Table 3.7 (for dichotomously scored items) and Table 3.8 (for partial-credit items). Category A items contain negligible DIF. In subsequent tables, category A levels of DIF are not flagged as they are too small to have perceptible interpretation. Category B items exhibit moderate DIF, and category C items have significant values of DIF. Positive values favor the focal group, and negative values favor the reference group. Positive and negative values are reported for B and C levels of DIF. Negative and positive DIF at the B level are denoted, respectively, B- and B+. Likewise for C-level DIF.

Table 3.7: DIF FLAGGING LOGIC FOR DICHOTOMOUSLY SCORED ITEMS
DIF Category	Definition
A (negligible)	\|MH D-DIF\| < 1
B (slight to moderate)	\|MH D-DIF\| \(\ge\) 1 and \|MH D-DIF\| < 1.5
B (slight to moderate)	Positive values are classified as “B+” and negative values as “B-”
C (moderate to large)	\|MH D-DIF\| \(\ge\) 1.5
C (moderate to large)	Positive values are classified as “C+” and negative values as “C-“

Table 3.8: DIF FLAGGING LOGIC FOR PARTIAL-CREDIT ITEMS
DIF Category	Definition
A (negligible)	Mantel chi-square p-value > 0.05 or \|SMD/SD\| \(\le\) 0.17
B (slight to moderate)	Mantel chi-square p-value < 0.05 and 0.17 < \|SMD/SD\| \(\le\) 0.25
C (moderate to large)	Mantel chi-square p-value < 0.05 and \|SMD/SD\| > 0.25

Items flagged for C-level DIF are subsequently reviewed by content experts and bias/sensitivity committees to determine the source and meaning of performance differences. An item flagged for C-level DIF may be measuring something different from the intended construct. However, it is important to recognize that DIF-flagged items might be related to actual differences in relevant knowledge and skills or may have been flagged due to chance variation in the DIF statistic (known as statistical type I error). Final decisions about the resolution of item DIF are made by the multi-disciplinary panel of content experts.

3.9.2 Item DIF in the 2018-19 Summative Assessment Pool

Table 3.9 and Table 3.10 show DIF analysis results for items in the 2018-19 ELA/literacy and mathematics summative item pools. The numbers of items with moderate or significant levels of DIF (B or C DIF) in the summative pools were relatively small. Items classified as N/A (not assessed) were items for which sample size requirements were not met. Most of these cases occurred for the Native American/Alaskan Native focal group. These students comprised only about 1% of the total test-taking population.

Table 3.9: NUMBER OF DIF ITEMS IN THE CURRENT SUMMATIVE POOL FLAGGED BY CATEGORY (ELA/LITERACY, GRADES 3-8 AND 11)
Grade	DIF Category	Female Male	Asian White	Black White	Hiapanic White	Native American White	IEP NonIEP	LEP NonLEP	Title1 NonTitle1
3	N/A	0	167	91	2	637	77	29	0
3	A	912	693	781	888	271	815	859	913
3	B-	8	30	38	24	12	23	29	10
3	B+	4	29	12	7	4	9	5	0
3	C-	0	2	2	3	0	0	2	1
3	C+	0	3	0	0	0	0	0	0
4	N/A	0	131	91	4	578	67	18	0
4	A	868	696	752	845	295	776	821	876
4	B-	12	19	36	35	11	36	41	11
4	B+	8	35	8	3	5	10	6	1
4	C-	1	4	1	2	0	0	3	1
4	C+	0	4	1	0	0	0	0	0
5	N/A	0	126	83	1	605	57	34	0
5	A	821	651	733	808	242	763	757	843
5	B-	22	43	28	40	12	32	55	17
5	B+	14	35	14	9	3	5	8	2
5	C-	4	2	1	4	0	5	6	0
5	C+	1	5	3	0	0	0	2	0
6	N/A	0	145	69	0	567	46	59	0
6	A	817	656	752	827	287	781	738	854
6	B-	22	32	39	29	7	30	62	11
6	B+	18	23	7	7	6	3	4	3
6	C-	3	5	0	5	1	8	5	0
6	C+	8	7	1	0	0	0	0	0
7	N/A	0	133	90	2	464	57	111	0
7	A	726	626	678	769	328	714	649	792
7	B-	30	21	28	27	8	30	40	14
7	B+	35	19	9	4	7	6	5	1
7	C-	6	1	1	5	0	0	2	0
7	C+	10	7	1	0	0	0	0	0
8	N/A	0	136	122	0	621	93	138	0
8	A	791	668	722	842	263	747	671	873
8	B-	35	42	32	29	4	43	63	16
8	B+	40	31	11	11	2	4	11	0
8	C-	7	3	2	5	0	3	7	1
8	C+	17	10	1	3	0	0	0	0
11	N/A	99	882	1,113	104	2,570	1,350	1,650	99
11	A	2,395	1,649	1,522	2,405	119	1,297	974	2,504
11	B-	117	86	34	147	1	28	51	81
11	B+	49	66	23	28	5	18	18	9
11	C-	15	3	2	11	1	1	2	3
11	C+	21	10	2	1	0	2	1	0

Table 3.10: NUMBER OF DIF ITEMS IN THE CURRENT SUMMATIVE POOL FLAGGED BY CATEGORY (MATHEMATICS, GRADES 3-8 AND 11)
Grade	DIF Category	Female Male	Asian White	Black White	Hiapanic White	Native American White	IEP NonIEP	LEP NonLEP	Title1 NonTitle1
3	N/A	0	132	40	0	1,258	2	0	0
3	A	1,292	1,058	1,175	1,231	69	1,270	1,267	1,306
3	B-	16	34	48	50	0	37	29	20
3	B+	16	81	58	42	1	16	27	3
3	C-	4	4	2	3	0	2	2	0
3	C+	1	20	6	3	1	2	4	0
4	N/A	0	137	129	0	1,258	17	1	0
4	A	1,386	1,150	1,226	1,371	173	1,354	1,358	1,414
4	B-	29	32	41	39	2	63	45	23
4	B+	24	99	37	26	7	5	29	4
4	C-	1	5	3	2	0	1	6	0
4	C+	1	18	5	3	1	1	2	0
5	N/A	0	133	125	0	1,123	1	18	0
5	A	1,322	1,071	1,174	1,325	235	1,275	1,302	1,360
5	B-	29	21	28	31	4	63	29	11
5	B+	19	122	45	15	7	23	20	2
5	C-	3	5	1	2	1	10	4	0
5	C+	0	21	0	0	3	1	0	0
6	N/A	0	125	267	0	1,172	111	71	0
6	A	1,179	985	927	1,187	67	1,069	1,118	1,209
6	B-	28	22	17	31	0	41	25	21
6	B+	29	67	21	15	0	13	21	7
6	C-	2	5	3	2	0	5	2	2
6	C+	1	35	4	4	0	0	2	0
7	N/A	0	191	325	1	1,063	174	160	0
7	A	1,081	825	786	1,064	75	919	936	1,091
7	B-	35	8	5	47	0	28	18	43
7	B+	22	82	17	21	1	16	18	4
7	C-	1	3	2	6	0	1	1	1
7	C+	0	30	4	0	0	1	6	0
8	N/A	0	240	210	0	955	120	251	0
8	A	968	678	743	948	37	824	708	971
8	B-	17	19	22	28	1	30	24	20
8	B+	8	42	17	14	1	12	8	3
8	C-	1	4	1	4	0	3	1	0
8	C+	0	11	1	0	0	5	2	0
11	N/A	500	1,714	1,222	501	2,666	1,946	2,184	500
11	A	2,053	833	1,400	2,070	13	690	463	2,115
11	B-	51	15	16	60	1	20	11	40
11	B+	56	79	36	42	0	22	18	20
11	C-	12	1	3	2	0	0	1	2
11	C+	8	38	3	5	0	2	3	3

3.10 Test Fairness and Implications for Ongoing Research

The evidence presented in this chapter underscores the Smarter Balanced Consortium’s commitment to fair and equitable assessment for all students, regardless of their gender, cultural heritage, disability status, native language, and other characteristics. In addition to these proactive development activities designed to promote equitable assessments, other forms of evidence for test fairness are identified in the Standards (2014). They are described and referenced in the validity framework of Chapter 1.