Chapter 4 Test Design

4.1 Introduction

The intent of this chapter is to show how the assessment design supports the purposes of Smarter Balanced assessments. Test design entails developing a test philosophy (i.e., Theory of Action); identifying test purposes; and determining the targeted examinee populations, test specifications, item pool design, and other features (Schmeiser & Welch, 2006). The Smarter Balanced Theory of Action, test purposes, and the targeted examinee population were outlined in the Overview and Chapter 1 of this report.

4.2 Evidence-Centered Design in Constructing Smarter Balanced Assessments

Evidence-centered design (ECD) is an approach to the creation of educational assessments in terms of reasoning about evidence (arguments) concerning the intended constructs. ECD begins with identification of claims or inferences users want to make concerning student achievement. Evidence needed to support those claims is then specified, and finally, items/tasks capable of eliciting that information are designed (Mislevy et al., 2003). Explicit attention is paid to the potential influence of unintended constructs. ECD accomplishes this in two ways. The first is by incorporating an overarching concept of assessment as an argument from imperfect evidence. This argument makes explicit the claims (the inferences that one intends to make based on scores) and the nature of the evidence that supports those claims (Hansen & Mislevy, 2008; Mislevy & Haertel, 2006). The second is by distinguishing the activities and structures involved in the assessment enterprise in order to exemplify an assessment argument in operational processes. By making the underlying evidentiary argument more explicit, the framework makes operational elements more amenable to examination, sharing, and refinement. Making the argument more explicit also helps designers meet diverse assessment needs caused by changing technological, social, and legal environments (Hansen & Mislevy, 2008; Zhang et al., 2009). The ECD process entails five types of activities, or layers, of assessment. The activities focus on 1) the identification of the substantive domain to be assessed; 2) the assessment argument; 3) the structure of assessment elements such as tasks, rubrics, and psychometric models; 4) the implementation of these elements; and 5) the way they function in an operational assessment, as described below.

  • Domain Analysis. In this first layer, domain analysis involves determining the specific content to be included in the assessment. Smarter Balanced uses the Common Core State Standards (CCSS) as its content domain for ELA/literacy and mathematics. Domain analysis was conducted by the developers of the CCSS, who first developed college and career readiness standards, to address what students are expected to know and be able to do by the time they graduate from high school. This was followed by development of the K–12 standards, which address expectations for students in elementary through high school.
  • Domain Modeling. In domain modeling, a high-level description of the overall components of the assessment is created and documented. For Smarter Balanced, the components include computer-adaptive summative and interim assessments in ELA/literacy and mathematics. The domain framework was developed by organizing the CCSS into domain areas that form the structure of test blueprints and reporting categories. This overall structure was created in the course of Smarter Balanced content specification development.
  • The Conceptual Assessment Framework. Next, the conceptual assessment framework is developed. For Smarter Balanced, this step was accomplished in developing the Smarter Balanced content specifications, which identify major claim structure, targets within claims, and the relationship of those elements to underlying content of the CCSS. In this step, the knowledge, skills, and abilities to be assessed (i.e., intended constructs, targets of assessment); the evidence that needs to be collected; and the features of the tasks that will elicit the evidence are specified in detail. Ancillary constructs that may be required to respond correctly to an assessment task but are not the intended target of the assessment are also specified (e.g., reading skills in a mathematics assessment). By identifying any ancillary knowledge, skills, and abilities (KSAs), construct-irrelevant variance can be identified a priori and minimized during item and task development. Potential barriers created by the ancillary KSAs can be removed or their effects minimized through the provision of appropriate access features. The item and task specifications describe the evidence required to support claims about the assessment targets and also identify any ancillary constructs.
  • Implementation. This layer involves the development of the assessment items or tasks using the specifications created in the conceptual assessment framework just described. In addition, scoring rubrics are created, and the scoring process is specified. Smarter Balanced items, performance tasks, and associated scoring rubrics were developed starting in the spring of 2012.
  • Delivery. In this final layer, the processes for administration and reporting are created. The delivery system describes the adaptive algorithm, collection of student evidence, task assembly, and presentation models required for the assessment and how they function together. The ECD elements chosen lead to the best evaluation of the construct for the intended test purposes.

4.3 Content Structure

In developing and maintaining a system of assessments, the goal of Smarter Balanced is to ensure that the assessment’s measurement properties reflect industry standards for content, rigor, and performance. A key step in this direction is to ensure that the Smarter Balanced assessments are aligned with the Common Core State Standards (CCSS). Figure 4.1 briefly encapsulates the Smarter Balanced content structure.

Components of Smarter Balanced Test Design

Figure 4.1: Components of Smarter Balanced Test Design

The Common Core State Standards are the content standards in ELA/literacy and mathematics that many states have adopted. Because the CCSS were not specifically developed for assessment, they contain extensive rationale and information concerning instruction. Therefore, adopting previous practices used by many state programs, Smarter Balanced content experts produced content specifications in ELA/literacy and mathematics that distill assessment-focused elements from the CCSS (Smarter Balanced, 2017b, 2017d). Item development specifications (https://contentexplorer.smarterbalanced.org/test-development) are then based on the content specifications. Each item is aligned to a specific claim and target and to a Common Core State Standard.

Within each of the two subject areas in grades 3-8 and high school, there are four broad claims. Within each claim, there are a number of assessment targets. The claims in ELA/literacy and mathematics are given in Table 4.1.

Table 4.1: CLAIMS FOR ELA/LITERACY AND MATHEMATICS
Claim ELA/Literacy Mathematics
1 Reading Concepts and Procedures
2 Writing Problem-Solving
3 Speaking/Listening Communicating Reasoning
4 Research Modeling and Data Analysis

Currently, only the listening part of ELA/literacy claim 3 is assessed. In mathematics, claims 2 and 4 are reported together as a single subscore, so there are only three reporting categories for mathematics, but four claims.

Because of the breadth in coverage of the individual claims, targets within each claim were needed to define more specific performance expectations. The relationship between targets and CCSS elements is made explicit in the Smarter Balanced content specifications (Smarter Balanced, 2017b, 2017d).

The Smarter Balanced item and task specifications (Smarter Balanced, 2015b) are comprised of many documents, all of which are based on Smarter Balanced content specifications. These documents provide guidance for translating the Smarter Balanced content specifications into actual assessment items. In addition, guidelines for bias and sensitivity (Smarter Balanced, 2022a), accessibility and accommodations (Smarter Balanced, 2016b, 2022d), and style (Smarter Balanced, 2015c) help item developers and reviewers ensure consistency and fairness across the item bank. The specifications and guidelines were reviewed by member states, school districts, higher education representatives, and other stakeholders. The item specifications describe the evidence to be elicited and provide sample task models to guide the development of items that measure student performance relative to the target.

4.4 Summative Assessment Blueprints

Test specifications and blueprints define the knowledge, skills, and abilities intended to be measured on each student’s test event, and explain how skills are sampled from a set of content standards (i.e., the CCSS). Specifically, a test blueprint is a formal document that guides the development and assembly of an assessment by explicating the following types of essential information:

  • Content (claims and assessment targets) that is included for each assessed subject and grade
  • Relative emphasis of content standards generally indicated as the number of items or percentage of points per claim and assessment target
  • Depth of knowledge (DOK) required by test items, indicating the complexity of item types for each claim and assessment target
  • Additional rules or specifications needed to administer the test

The Smarter Balanced summative blueprints were developed with broad input from member states, partners, and stakeholders, and reflect the depth and breadth of the performance expectations of the CCSS. Some innovative features of the Smarter Balanced blueprints are: a) the inclusion of both computer adaptive (CAT) and performance task (PT) components, and b) the provision of a variety of both machine-scored and human-scored items and response types.

The use of CAT methodologies helps ensure that students across the range of proficiency have an assessment experience with items well targeted to their skill level. CAT tests are also more efficient because they provide a higher level of score precision than fixed-form tests with the same number of items. The PT is administered on a computer but is not computer adaptive. PTs are intended to measure multiple standards in a coherent task that requires the use of integrated skill sets. They measure capacities such as essay writing, research skills, and complex analysis, which are not as easy to assess with individual, discrete items.
Responses from both CAT and PT components are combined to cover the test blueprint in a grade and content area and are used to produce the overall and claim scale scores. Figure 4.2 is a conceptual diagram of how claims are distributed across the adaptive and performance task parts of the tests.

Claim Distribution in Test Blueprints

Figure 4.2: Claim Distribution in Test Blueprints

Links to the Smarter Balanced ELA/literacy and mathematics Summative Assessment Full Blueprints for grades 3-8 and high school for 2020-21 are provided:

ELA/L Full

Math Full

4.4.1 Adjusted (Shortened) Blueprint

Beginning in the 2020-21 administration year, Smarter Balanced maintains an adjusted (shortened) blueprint for ELA/literacy and mathematics in addition to the full blueprint. Members select one blueprint or the other for use in their state or territory. The adjusted blueprint covers the same skills and knowledge as the full blueprints, but the number of questions has been reduced by about half for the CAT component, with the PT component unchanged. The adjusted blueprint is used with the same pool of items for mathematics and almost exactly the same4 pool of items for ELA/literacy as used for the full blueprint.

Links to the Smarter Balanced ELA/literacy and mathematics Summative Assessment Adjusted Blueprints for grades 3-8 and high school for 2020-21 and are provided:

ELA/L Adjusted

Math Adjusted

4.5 Performance Task Design

As shown in the test blueprints, performance tasks are an integral part of the Smarter Balanced test design, and they fulfill a specific role in the test blueprint for a grade and content area. Performance tasks are intended to measure the ability to integrate knowledge and skills across multiple content standards, a key component of college and career readiness. Performance assessments give students opportunities to demonstrate their ability to find, organize, or use information to solve problems; undertake research; frame and conduct investigations; analyze and synthesize data; and/or apply learning to novel situations.

Smarter Balanced performance tasks were constructed so they can be delivered effectively in the school/classroom environment (Dana & Tippins, 1993). Requirements for task specifications included, but were not limited to, compatibility with classroom activities, materials and technology needs, and allotted time for assessment. Performance tasks adhere to specifications used by item writers to develop new tasks that focus on different content but are comparable in contribution to the blueprint.

All Smarter Balanced performance tasks consist of three basic components: stimulus presentation, information processing, and scorable product(s) or performance(s). Stimuli for Smarter Balanced performance tasks are provided in various forms (e.g., readings, video clips, data). “Information processing” means student interactions with the stimulus materials and their content. It could include note-taking, data generation, and any other activities that increase students’ understanding of the stimulus content or the assignment. All activities within a task must have a rationale for inclusion (e.g., to increase understanding, for scaffolding, as early steps in product creation, or for product creation).

In ELA/literacy, each performance task comprises a targeted research effort in which students read sources and respond to one research item, followed by an essay. During the research component, students may take notes to which they may later refer. Students then write a full essay drawing from source material and research notes. Claim-level results in writing and research are based on both CAT and performance task item responses.

In mathematics, each performance task comprises a set of stimulus materials and a follow-up item set consisting of up to six items in claims 2, 3, and 4. These are combined with CAT items in claims 2, 3, and 4 to satisfy the blueprint and create a claim 3 score and a combined claim 2 and 4 score. Performance tasks address an integrated scenario in middle and high school and a common theme in grades 3-5.

4.6 Item and Task Specifications

The item and task specifications bridge the distance from the content specifications and achievement levels to the assessment itself. While the content specifications establish the Consortium’s claims and the types of evidence that are needed to support these claims, more specificity is needed to develop items and tasks that measure the claims.

The first iteration of the item and task specifications was developed in 2011. In early 2012, the Consortium held a series of showcases where the contractors introduced the item and task specifications and collected feedback from member states. The item and task specifications were revised during the first quarter of 2012 using this feedback.

A small set of items were developed and administered in fall 2012 during a small-scale trial using the revised item and task specifications. This provided the Consortium with the first opportunity to administer and score the new item types. During the small-scale trial, the Consortium also conducted cognitive laboratories to better understand how students respond to various types of items (American Institutes for Research, 2013). The cognitive laboratories used a think-aloud methodology in which students speak their thoughts while working on a test item. The item and task specifications were again revised based on the findings of the cognitive laboratories and the small-scale trial. These revised specifications were used to develop items for the 2013 pilot test, and they were again revised based on 2013 pilot test results and subsequent reviews by content experts.

The Smarter Balanced Item and Task Specifications (Smarter Balanced, 2015b) are designed to ensure that assessment items measure the assessment’s claims. Indeed, the purpose of item and task specifications is to define the characteristics of items and tasks that will provide evidence to support one or more claims. To do this, the item and task specifications delineate types of evidence that should be elicited for each claim within a grade level. Then, the specifications provide explicit guidance on how to write items in order to elicit the desired evidence.

Item and task specifications provide guidelines on how to create items specific to each claim and assessment target through the use of task models. In mathematics, a task model provides a description of an item/task’s key features. These task models describe the knowledge, skills, and processes being measured by each of the item types aligned to particular targets. In addition, task models sometimes provide examples of plausible distractors. Exemplar items are provided within every task model. In ELA/literacy, these functions are carried out through item specifications.

Task models were developed for each grade level and target to delineate the expectations of knowledge and skills to be represented through test items at each grade. In addition, both ELA/literacy and mathematics item and stimulus specifications provide guidance about grade appropriateness of task and stimulus materials (the materials a student must refer to when working on a test item). The task and stimulus models also provide information on characteristics of stimuli or activities to avoid because they are not germane to the knowledge, skill, or process being measured.

Guidelines concerning what to avoid in item writing are important because they underscore the Consortium’s efforts to use universal design principles to develop item accessible to the widest range of students possible. As the name suggests, the concept of universal design aims to create items that accurately measure the assessment target for all students. At the same time, universal design recognizes that one solution rarely works for all students. Instead, this framework acknowledges “the need for alternatives to suit many different people” (Rose & Meyer, 2000, p. 4).

To facilitate the application of universal design principles, item writers are trained to consider the full range of students who may answer a test item. A simple example of this is the use of vocabulary that is expected to be known by all third-grade students versus only those third-grade students who play basketball. Almost all third-grade students are familiar with activities (e.g., recess) that happen during their school day, while only a subset of these students will be familiar with basketball terms like “double dribble,” “layup,” “zone defense,” or “full-court press.”

Item specifications discuss accessibility issues unique to the creation of items for a particular claim and/or assessment target. Accessibility issues concern supports that various groups of students may need to access item content. By considering the supports that may be needed for each item, item writers are able to create items that can be adapted to a variety of needs.

The use of universal design principles allows the Consortium to collect evidence on the widest possible range of students. By writing items that adhere to item and task specifications, the Consortium is assured that assessments measure the claims and assessment targets established in the content specifications, as well as the knowledge, skills, and processes found in the CCSS for all students for whom the assessment is appropriate.

4.7 Item and Task Development

The Consortium’s test development cycle is iterative, involving experts from various education-related fields, and is based on assessment-related research and best practices. Each item that is used operationally on the Smarter Balanced summative assessment has been reviewed and/or written by educators. The active involvement of educators is critical to the success of the item-writing activities. Educators engage with students on a daily basis, and they understand the ways in which students can demonstrate their knowledge. Their involvement in item writing helps ensure that the items included in the assessment system are appropriate for the grade level and provide valid evidence of student learning. Section 4.7.1 describes vendor-managed item development that Smarter Balanced oversees. Section 4.7.2 describes member-managed item writing led by states. Section 4.7.3 explains the item review process that applies to all items.

4.7.1 Item Writing

The Consortium works with educators throughout the test development cycle to develop items. All K–12 participants:

  • are certified/licensed to teach ELA/literacy and/or mathematics in a K–12 public school;
  • are currently teaching in a public school within a Smarter Balanced governing state;
  • have taught ELA/literacy and/or mathematics in grades 3-8 and/or high school within the past three years (second-grade teachers are also recruited to participate in the development of grade 3 items and/or tasks);
  • have previously reviewed part or all of the CCSS for the content area for which they are writing items and/or performance tasks;
  • have submitted a statement of interest that describes their interest in developing Smarter Balanced items and/or performance tasks, along with their qualifications for doing so; and
  • have completed training and achieved qualifications through a certification process.

Qualifications for higher education faculty include:

  • current employment with, or recent retirement from, a college or university located within a Smarter Balanced member state;
  • having taught developmental and/or entry-level courses in English, composition, mathematics, statistics, or a related discipline within the last three years;
  • having previously reviewed part or all of the CCSS for the content area in which they are interested in writing items and/or performance tasks; and
  • having completed training and achieved qualifications through the certification process.

The Consortium’s staff trains contractors and educators on the item specifications, ELA/literacy stimulus specifications, and the guidelines for accessibility, bias, and sensitivity, as described in the next section.

Prior to the spring 2013 pilot test, the Consortium engaged 136 educators in K–12 and higher education from 19 member states to write items. Prior to the spring 2014 field test, 184 educators in K–12 and higher education from 16 member states participated in item writing. The items developed in this process were used in the 2014 field test and in the 2015 embedded field test. These items account for all of the items used in the 2020-21 summative assessment.

4.7.1.1 Training for Item Writers

For the development of all operational items in the 2020-21 summative assessment, educators participated in a series of facilitated online webinars in order to qualify as item writers. To facilitate participation, the Consortium scheduled multiple sessions in different time zones, including evening sessions. In addition to the facilitated sessions, the Consortium provided training modules that covered background on the Consortium, assessment design principles, and detailed information about item and performance task development. All modules were available in three formats: a PowerPoint presentation with notes, a streaming presentation with narration that could be viewed online, and a downloadable audio/video presentation.

For all item writing, including more recent processes, item writers are specifically trained on the Consortium’s content and item specifications, stimulus specifications, sensitivity and bias guidelines, and general accessibility guidelines. Training on these specifications and guidelines helps ensure that item writers are trained to write items that allow the widest possible range of students to demonstrate their knowledge, skills, and cognitive processes with regard to the content. This means that item writers need to understand the content for which they were writing items, as well as accessibility and sensitivity issues that might hinder students’ ability to answer an item. Item writers are also trained to be aware of issues that might unintentionally bias an item for or against a particular group.

4.7.2 Member-Managed Item Development

The Consortium invites member states to participate in a separate effort to write items. This voluntary effort, known as State-Managed Item Development, is conducted to build the capacity of states to write items and to support the overall sustainability of the Consortium. To this end, three states (Hawaii, Oregon, and Washington) participated in the member-managed field test item development opportunity. During this opportunity, educators within the three states developed approximately 450 items in ELA/literacy and mathematics across grades 3-8 and high school.

4.7.3 Item Reviews

Once items are written, groups of educators review items and item stimuli prior to field testing. Item stimuli refer to the reading passages used on the ELA/literacy assessments or to the stimulus materials provided in the performance tasks in both ELA/literacy and mathematics. The reviews take into consideration accessibility, bias/sensitivity, and content.

Prior to the spring 2013 pilot test, 122 ELA/literacy educators and 106 mathematics educators reviewed items and performance tasks for accessibility, bias/sensitivity, or content, and 60 educators reviewed the ELA/literacy stimuli. Prior to the spring 2014 field test, 107 ELA/literacy educators and 157 mathematics educators from 14 states reviewed items and performance, and 95 educators from 13 states reviewed the ELA/literacy stimuli.

The educator qualifications for the accessibility, bias/sensitivity, and content reviews are the same as the educator qualifications for item writing, except that participants are not required to submit a statement of interest. In addition, it is preferred (but not required) that educators have previous experience reviewing items, tasks, and/or stimuli.

During the committee reviews, educators specifically compare the items against the quality criteria for accessibility and for bias and sensitivity. The reviewers identify and resolve or reject any item, stimulus, or performance task that does not pass the criteria. This review removes any aspect that may negatively impact a student’s ability to access stimuli, items, or performance tasks, or to elicit valid evidence about an assessment target. Items flagged for accessibility, bias/sensitivity, and/or content concerns are either revised to address the issues identified by the panelists or removed from the item pool.

The committee also compares each stimulus, item, and performance task against the ELA/literacy or mathematics quality criteria. This review focuses on developmental appropriateness and alignment of stimuli, items, and performance tasks to the content specifications and appropriate depths of knowledge. Panelists in the content review also check the accuracy of the content, answer keys, and scoring materials. Items flagged for content concerns are either revised or removed from the item pool.

Details about the item development process in ELA/literacy and mathematics are found in Appendix A. These are the steps each item goes through before it can be presented to students.

4.8 Field Testing

After items pass the content, accessibility, bias, and sensitivity reviews, they become eligible for field testing. The first field test for developing the Smarter Balanced assessments was a stand-alone field test in 2014 prior to the first operational administration. Details of the 2014 field test can be found in Chapters 7, 8, and 9 of the 2014-15 Smarter Balanced Summative Technical Report (Smarter Balanced, 2016a).

Both CAT and PT items are field tested. For field testing in years subsequent to 2014, a small number of CAT items are embedded within each student’s operational CAT. These are called embedded field test (EFT) items. The number of EFT items administered per student is given in Table 4.2. CAT EFT items are administered randomly across a range of allowable positions within test segments as follows:

  • ELA/literacy: positions 5-34 (ELA/literacy has only one segment)
  • Mathematics Calculator: positions 5-15 within the calculator segment
  • Mathematics Non-Calculator: positions 5-10 within the non-calculator segment

Two EFT items are embedded in the mathematics CAT. For grades 6 and higher, one item is embedded in the calculator segment and one in the non-calculator segment. In the ELA/literacy CAT, three to four EFT items are administered to each examinee. The number of EFT items administered in ELA/literacy is a range instead of a constant because much of the ELA/literacy content is organized into passage sets and the number of items in a set varies.

Table 4.2: NUMBER OF FIELD-TEST ITEMS TO BE ADMINISTERED PER STUDENT
Grade ELA/Literacy Math: Calc Math: No Calc
3 3 to 4 N/A 2
4 3 to 4 N/A 2
5 3 to 4 N/A 2
6 3 to 4 1 1
7 3 to 4 1 1
8 3 to 4 1 1
11 3 to 4 1 1

4.8.1 Field Testing of Performance Tasks

Performance tasks (PTs) are field tested as stand-alone fixed forms consisting of three to six items per task. Each PT is randomly administered to approximately 2,000 students in total across all participating states. Thus, only a small number of randomly selected students receive a field test PT. Students who take a field test PT do not take an operational PT. These students take a CAT that has more operational items than the regular CAT to compensate for the lack of operational PT items. A link is provided to the blueprint for the CAT taken by students who take a field test PT, called the enhanced CAT blueprint.

4.9 Item Scoring

For those items that cannot be machine scored, the Consortium engages content experts in range-finding activities. Range finding improves the consistency and validity of scoring for the assessment. During range finding, educators focus on the performance tasks for ELA/literacy and mathematics. The participants review student responses against item rubrics, validate the rubrics’ accuracy, and select the anchor papers that would be used by scorers during operational scoring of test items. In mathematics, educators also review constructed response items for grades 7, 8, and high school. Following the 2013 pilot test, 102 participants from 20 states were engaged in range finding. After the spring 2014 field test, 104 educators participated in range finding. After the 2014–15 embedded field test, 34 educators participated in range finding.

The educator qualifications for range finding are the same as the educator qualifications for item writing. It is preferred (but not required) that educators have previous range-finding experience.

A rubric validation activity is conducted to verify correct scoring for machine-scored items. For multiple-choice items, this is a simple key check. For other item types, such as grid interaction items (drag-and-drop), matching tables, or equation entry, the procedure involves looking at a sample of raw student responses (screen coordinates or keystrokes) and assuring that the raw response was scored correctly. In the course of this process, reviewers may find unexpected responses that require adjustment of the scoring procedure to account for a wider response range. Item-scoring software is then changed accordingly.

4.10 Item Quality Control and Data Review

After items are field tested, the Consortium carries out statistical analyses of field test data to determine the statistical quality of the items. On the basis of these results, some field-tested items are put into operational use, some are rejected from operational use, and others go through a process called data review. In a data review, items flagged based on statistical criteria are reviewed by educators in collaboration with Smarter Balanced staff, for possible content flaws, bias, and other features that might explain the statistical qualities. Items that go through data review may be subsequently revised and field-tested again in a future year, rejected, or accepted for operational use.

4.11 CAT Algorithm

For the Smarter Balanced operational test, an item-level, fully adaptive CAT component is administered in ELA/literacy and mathematics. The adaptive part delivers blueprints in a manner that efficiently minimizes measurement error and maximizes information. Smarter Balanced members work with their service provider to adopt an algorithm that delivers the published blueprint.

4.12 Content Alignment

Content alignment addresses how well individual test items, test blueprints, and the tests themselves represent the intended construct and support appropriate inferences. With a computer adaptive test, a student’s test form is a sampling of items drawn from a much larger universe of possible items and tasks. The sampling is guided by a blueprint. Alignment studies investigate how well individual tests cover the intended breadth and depth of the underlying content standards. For inferences from test results to be justifiable, the sample of items in each student’s test has to be an adequate representation of the broad domain, providing strong evidence to support claims being made from the test results.

Four alignment studies have been conducted to examine the alignment between Smarter Balanced tests and the CCSS. The Human Resources Research Organization (HumRRO, 2016) conducted the first alignment study. HumRRO’s comprehensive study centered around the assumptions of evidence-centered design (ECD), which examined the connections in the evidentiary chain underlying the development of the Smarter Balanced foundational documents (test blueprints, content specifications, and item/task specifications) and the resulting summative assessments. Among those connections was the alignment between the Smarter Balanced evidence statements and content specifications, and the alignment between the Smarter Balanced blueprint and the content specifications. Results from this study were favorable in terms of the intended breadth and depth of the alignment for each connection in the evidentiary chain.

In 2016, the Fordham Institute and HumRRO investigated the quality of the Smarter Balanced assessments relative to Council of Chief State School Officers (CCSSO) criteria for evaluating high-equality assessments. In particular, the Smarter Balanced assessments were investigated to see if they placed strong emphasis on the most important content for college and career readiness and if they required that students demonstrate the range of thinking skills called for by those standards. Fordham Institute reviewed grades 5 and 8 ELA/literacy and mathematics, and HumRRO reviewed high school ELA/literacy and mathematics.

  • Doorey & Polikoff (2016) rated Smarter Balanced grades 5 and 8 ELA/literacy assessments an excellent match to the CCSSO criteria for content in ELA/literacy, and a good match for depth in ELA/literacy.
  • Fordham Institute rated Smarter Balanced grades 5 and 8 mathematics assessments as a good match to the CCSSO criteria for content in mathematics, and a good match to the CCSSO criteria for depth in mathematics.
  • HumRRO (2016) rated the Smarter Balanced high school ELA/literacy assessments an excellent match to the CCSSO criteria for content in ELA/literacy, and a good to excellent match for depth in ELA/literacy.
  • HumRRO (2016) rated the Smarter Balanced high school ELA/literacy assessments a good to excellent match to the CCSSO criteria for content in ELA/literacy, and a good to excellent match for depth in ELA/literacy.

An additional external alignment study, completed by WestEd Standards, Assessment, and Accountability Services Program (2017), employed a modified Webb alignment methodology to examine the summative assessments for grades 3, 4, 6, and 7 using sample test events built using 2015–16 operational data. This study provided evidence that the items within ELA/literacy and mathematics test events for grades 3, 4, 6, and 7 were well aligned to the CCSS in terms of both content and cognitive complexity.

4.13 2020-21 Summative Item Pool

This section describes the 2020-21 summative item pool.

Each grade’s item pool is large enough to support the summative blueprint. Unlike a traditional paper/pencil test where all students take the same items, students taking the CAT take items and tasks targeted to their ability level. This means that the Consortium needs to develop a large number of items to deliver tests that simultaneously meet the blueprint and are at a level of difficulty that is tailored to each student’s performance.

In addition to the items for the CAT, the Consortium also developed performance tasks. All students take performance tasks designed to measure a student’s ability to integrate knowledge and skills across multiple claims and assessment targets. Prior to 2018–19, each ELA/literacy performance task had a set of related stimuli presented with two or three research items and an essay. Beginning with the 2018–19 assessment, the performance task includes only one research item, and the reduction is compensated by including more research items in the CAT component. Each mathematics performance task continues to have four to six items relating to a central problem or stimulus. The PT items are organized into distinct sets that are delivered intact to students. The number of PT item sets per grade and subject in the 2020-21 summative assessment is shown in Table 4.3. The sets are delivered in randomized fashion to students rather than adaptively.

Table 4.3: NUMBER OF PERFORMANCE TASKS BY GRADE AND SUBJECT
Grade ELA/literacy Mathematics
3 18 16
4 21 21
5 24 19
6 18 16
7 23 16
8 24 16
11 23 13

The distribution of item parameters by grade and claim are shown in Table 4.4 (ELA/literacy) and Table 4.5 (mathematics). Note that there is a wide range of difficulty in each category. This enables the CAT algorithm (described previously in this chapter) to find the best items for each student. As such, adaptive tests provide more precise measurement for all levels of student performance than would be provided with a fixed-form test of the same length. This is accomplished through having a bank of previously calibrated items to deliver during the adaptive portion of the test. In addition, fixed, randomly assigned performance tasks add information to student performance.

Table 4.4: ITEM DIFFICULTY (B-PARAMETER) AND DISCRIMINATION (A-PARAMETER), ELA/LITERACY
Grade Claim # of Items b parameter Mean b parameter Min b parameter Max a parameter Mean
3 1 317 -0.542 -2.725 4.693 0.70
2 239 -0.818 -2.896 4.115 0.68
3 183 -0.171 -2.920 3.815 0.54
4 140 -0.227 -2.216 1.864 0.68
Total 879 -0.490 -2.920 4.693 0.66
4 1 255 0.282 -2.529 6.233 0.63
2 246 -0.394 -3.252 2.935 0.59
3 194 0.047 -2.822 4.254 0.55
4 148 0.496 -1.939 3.727 0.56
Total 843 0.068 -3.252 6.233 0.58
5 1 273 0.669 -1.784 5.651 0.61
2 244 0.007 -2.278 3.294 0.61
3 148 0.491 -2.403 3.481 0.53
4 144 0.556 -1.494 3.832 0.66
Total 809 0.417 -2.403 5.651 0.60
6 1 247 1.096 -1.636 4.779 0.60
2 255 0.836 -2.719 5.542 0.56
3 160 0.827 -1.497 7.385 0.50
4 147 0.972 -1.305 3.609 0.56
Total 809 0.938 -2.719 7.385 0.56
7 1 251 1.383 -1.836 6.630 0.57
2 236 1.089 -2.019 5.305 0.56
3 156 0.878 -1.706 5.885 0.50
4 116 1.765 -0.815 5.613 0.56
Total 759 1.246 -2.019 6.630 0.55
8 1 248 1.548 -1.170 6.421 0.59
2 267 1.088 -3.013 4.558 0.53
3 185 0.940 -2.119 3.871 0.48
4 123 1.621 -1.788 5.188 0.57
Total 823 1.273 -3.013 6.421 0.54
11 1 874 1.943 -2.087 9.101 0.54
2 742 1.768 -1.880 9.145 0.47
3 589 1.351 -1.648 6.621 0.45
4 386 2.023 -1.197 8.941 0.47
Total 2,591 1.770 -2.087 9.145 0.49


Table 4.5: ITEM DIFFICULTY (B-PARAMETER) AND DISCRIMINATION (A-PARAMETER), MATHEMATICS
Grade Claim # of Items b parameter Mean b parameter Min b parameter Max a parameter Mean
3 1 771 -1.127 -4.338 4.163 0.84
2 123 -0.502 -2.537 1.380 1.00
3 236 -0.130 -2.424 5.116 0.73
4 151 -0.174 -2.677 3.201 0.80
Total 1,281 -0.771 -4.338 5.116 0.83
4 1 826 -0.277 -3.260 4.483 0.84
2 149 -0.069 -2.248 2.574 0.89
3 246 0.304 -2.083 5.184 0.75
4 162 0.266 -2.148 3.284 0.70
Total 1,383 -0.088 -3.260 5.184 0.81
5 1 777 0.341 -2.791 6.202 0.77
2 126 0.769 -2.208 3.939 0.92
3 247 0.910 -1.903 5.976 0.67
4 172 1.194 -1.232 4.634 0.70
Total 1,322 0.599 -2.791 6.202 0.76
6 1 737 0.865 -3.934 9.158 0.69
2 115 1.118 -2.978 5.497 0.77
3 219 1.792 -2.161 8.754 0.59
4 121 1.652 -0.853 6.439 0.78
Total 1,192 1.140 -3.934 9.158 0.69
7 1 683 1.772 -1.792 7.801 0.73
2 114 1.747 -1.085 5.071 0.83
3 171 2.223 -1.654 6.594 0.61
4 126 2.047 -0.881 4.777 0.76
Total 1,094 1.871 -1.792 7.801 0.73
8 1 624 2.047 -1.868 7.752 0.58
2 84 2.441 -1.570 5.751 0.74
3 159 2.831 -0.878 9.022 0.50
4 100 2.214 -2.844 6.476 0.68
Total 967 2.227 -2.844 9.022 0.59
11 1 1,813 2.366 -4.432 8.724 0.60
2 185 2.969 -1.101 6.680 0.63
3 423 3.049 -1.049 9.254 0.46
4 202 3.226 0.324 6.379 0.54
Total 2,623 2.585 -4.432 9.254 0.58

The Consortium develops many different types of items beyond the traditional multiple-choice item. This is done to measure claims and assessment targets with varying degrees of complexity by allowing students to respond in a variety of ways rather than simply recognizing a correct response. These different item types and their abbreviations are listed in Table 4.6. The frequencies of item types by claim within grade and subject are shown in Table 4.7 and Table 4.8. Note that each essay written is associated with two items. Essays are scored on three traits, two of which are combined, resulting in two scores for each essay.

Table 4.6: ITEM TYPES FOUND IN THE SUMMATIVE ITEM POOLS
Item Types ELA/literacy Mathematics
Multiple Choice (MC) X X
Multiple Select (MS) X X
Evidence-Based Selected Response (EBSR) X
Match Interaction (MI) X X
Hot Text (HTQ) X
Short Answer Text Response (SA) X X
Essay/Writing Extended Response (WER) X
Equation Response (EQ) X
Grid Item Response (GI) X
Table Interaction (TI) X


Table 4.7: DISTRIBUTION OF ELA/LITERACY ITEM TYPES BY GRADE AND CLAIM
Grade Claim EBSR HTQ MC MI MS SA WER Total
3 1 47 54 168 0 48 0 0 317
2 0 49 117 0 55 0 18 239
3 47 0 77 20 39 0 0 183
4 0 21 62 6 40 11 0 140
Total 94 124 424 26 182 11 18 879
4 1 48 51 107 0 49 0 0 255
2 0 50 127 0 48 0 21 246
3 48 0 88 20 38 0 0 194
4 0 19 59 5 51 14 0 148
Total 96 120 381 25 186 14 21 843
5 1 53 47 113 0 60 0 0 273
2 0 43 115 0 62 0 24 244
3 38 0 65 17 28 0 0 148
4 0 24 53 3 49 15 0 144
Total 91 114 346 20 199 15 24 809
6 1 35 58 77 0 48 29 0 247
2 0 52 91 0 67 27 18 255
3 41 0 77 18 24 0 0 160
4 0 11 68 3 52 13 0 147
Total 76 121 313 21 191 69 18 809
7 1 38 49 91 0 48 25 0 251
2 0 47 83 0 60 23 23 236
3 43 0 69 13 31 0 0 156
4 0 33 31 3 28 21 0 116
Total 81 129 274 16 167 69 23 759
8 1 42 48 77 0 50 31 0 248
2 0 44 96 0 77 26 24 267
3 25 0 118 5 37 0 0 185
4 0 29 40 4 31 19 0 123
Total 67 121 331 9 195 76 24 823
11 1 155 168 233 0 213 105 0 874
2 0 182 234 0 253 50 23 742
3 110 0 326 18 135 0 0 589
4 0 86 165 14 102 19 0 386
Total 265 436 958 32 703 174 23 2,591
All Total 770 1,165 3,027 149 1,823 428 151 7,513


Table 4.8: DISTRIBUTION OF MATHEMATICS ITEM TYPES BY GRADE AND CLAIM
Grade Claim EQ GI MC MI MS SA TI Total
3 1 482 69 116 67 4 0 33 771
2 80 16 14 6 5 2 0 123
3 13 56 86 22 34 25 0 236
4 55 17 31 10 12 9 17 151
Total 630 158 247 105 55 36 50 1,281
4 1 438 81 112 181 0 0 14 826
2 92 13 31 7 4 0 2 149
3 23 78 62 19 32 31 1 246
4 44 18 59 4 10 16 11 162
Total 597 190 264 211 46 47 28 1,383
5 1 415 47 221 94 0 0 0 777
2 92 13 11 2 3 0 5 126
3 17 64 86 20 24 33 3 247
4 61 35 27 6 6 19 18 172
Total 585 159 345 122 33 52 26 1,322
6 1 353 72 63 100 134 0 15 737
2 71 14 7 3 11 2 7 115
3 23 51 49 30 40 26 0 219
4 61 11 8 3 12 13 13 121
Total 508 148 127 136 197 41 35 1,192
7 1 377 51 54 73 128 0 0 683
2 79 7 8 6 11 0 3 114
3 23 45 34 17 32 20 0 171
4 68 26 16 2 10 1 3 126
Total 547 129 112 98 181 21 6 1,094
8 1 245 43 160 77 79 0 20 624
2 54 12 4 4 2 0 8 84
3 15 52 24 18 27 23 0 159
4 40 23 15 5 7 6 4 100
Total 354 130 203 104 115 29 32 967
11 1 666 295 423 312 110 0 7 1,813
2 92 39 24 12 13 0 5 185
3 50 141 120 50 38 23 1 423
4 93 30 43 13 12 6 5 202
Total 901 505 610 387 173 29 18 2,623
All Total 4,122 1,419 1,908 1,163 800 255 195 9,862

Although there is a wide distribution of item difficulty, pools tend to be difficult in relation to the population and to the cut score that is typically associated with proficiency (the level 3 cut score). Figure 4.3 shows mean item difficulty, level 3 cut score, and mean student achievement scores (all in theta units) by grade and subject. The mean item difficulty and student achievement plotted in this figure are based on the 2020-21 assessment.

Comparison of Item Difficulty, Mean, Student Scores, and Cut Scores for ELA/Literacy and Mathematics

Figure 4.3: Comparison of Item Difficulty, Mean, Student Scores, and Cut Scores for ELA/Literacy and Mathematics

4.14 Blueprint Fidelity

Whether the tests students receive in Smarter Balanced assessments satisfy the blueprint requirements described earlier in this chapter depends on two basic elements of test design: 1) the computer adaptive test (CAT) algorithm and 2) the item pool. The CAT algorithm endorsed by Smarter Balanced is publicly available (Cohen & Albright, 2014) and is used by Cambium Assessment to deliver Smarter Balanced assessments in the majority of member states. Key features of the item pool are described in the preceding section and include the number of items in specific areas of the blueprint (such as claims) and their distribution in difficulty relative to the distribution of student achievement. This section presents results from blueprint fidelity analyses carried out with 2020-21 operational assessment data to examine how well Smarter Balanced assessments satisfied the full or adjusted blueprints.

Analyses were performed for both ELA/literacy and mathematics and in all the tested grade levels (3–8 and high school). For ELA/literacy, blueprint fulfillment was evaluated separately for three populations at each grade level: the general student population, the braille student population, and the American Sign Language (ASL) population. For mathematics, blueprint fulfillment was evaluated separately for five populations at each grade level: the general student population, the braille student population, the ASL student population, the Spanish student population, and the translated glossaries population. Only operational items from the computerized adaptive test (CAT) component were considered in this study; field test items and performance task (PT) component items were not included.

For each population of students within grade and content area, both Smarter Balanced full and adjusted blueprint fulfillment was evaluated at the following levels of detail:

  1. Claims. The blueprint specifies the number of items per claim and, for ELA/literacy, the number of items associated with informational vs. literary texts in claim 1. In mathematics, the blueprint fidelity analysis combines claims 2 and 4 since these claims are combined for purposes of subscore reporting.
  2. Targets or target groups within claims. The number of items per target group is specified in the blueprint.
  3. Depth of Knowledge (DOK) requirements. In both ELA/literacy and mathematics, the blueprint specifies the number of items that must represent a given DOK level or higher within certain categories of the blueprint, such as claims and target groups.

The analyses showed that the operational tests delivered in the 2020-21 administration fulfilled the blueprint requirements very well. Virtually all tests delivered to students in the general population met blueprint requirements for the number of items per claim, target group, and DOK within claim. Specifically:

In ELA/literacy full blueprint:

  • At the level of claims within grade, at least 99.7% of tests met the blueprint across 35 evaluations.
  • At the level of target groups within grade, at least 95.7% of tests met the blueprint across 87 evaluations.
  • At the level of DOK requirements within grade, at least 99.5% of tests met the blueprint across 115 evaluations.

In ELA/literacy adjusted blueprint:

  • At the level of claims within grade, at least 98.0% of tests met the blueprint across 35 evaluations.
  • At the level of target groups within grade, at least 98.1% of tests met the blueprint across 97 evaluations.
  • At the level of DOK requirements within grade, at least 98.7% of tests met the blueprint across 34 evaluations.

In mathematics full blueprint:

  • At the level of claims within grade, at least 99.8% of tests met the blueprint across 34 evaluations.
  • At the level of target groups within grade, at least 99.5% of tests met the blueprint across 115 evaluations.
  • At the level of DOK requirements within grade, at least 99.8% of tests met the blueprint across 34 evaluations.

In mathematics adjusted blueprint:

  • At the level of claims within grade, at least 98.7% of tests met the blueprint across 34 evaluations.
  • At the level of target groups within grade, at least 98.6% of tests met the blueprint across 115 evaluations.
  • At the level of DOK requirements within grade, at least 98.7% of tests met the blueprint across 34 evaluations.

For special populations, the average percentages of blueprint met across all students, grades, and blueprint evaluations were usually over 90%. Exceptions included the mathematics full blueprint in American sign language and for Spanish students, as well as the mathematics adjusted blueprint for braille students Table 4.9.

Table 4.9: MEAN PERCENTAGE OF BLUEPRINT MET IN SPECIAL POPULATIONS
Blueprint Content Area Group Category Evaluations Mean Blueprint Met
Full ELA/L American Sign Language Claim 35 100.0%
Target 87 100.0%
DOK 39 100.0%
Math Claim 34 89.2%
Target 115 88.8%
DOK 34 88.2%
Spanish Claim 34 85.6%
Target 115 87.5%
DOK 34 91.1%
Translated Glossaries Claim 34 90.0%
Target 115 90.6%
DOK 34 92.8%
Adjusted ELA/L American Sign Language Claim 35 100.0%
Target 97 100.0%
DOK 39 100.0%
Braille Claim 35 99.0%
Target 97 98.1%
DOK 39 99.1%
Math American Sign Language Claim 34 99.4%
Target 115 99.5%
DOK 34 99.5%
Braille Claim 34 88.6%
Target 115 93.1%
DOK 34 93.6%
Spanish Claim 34 98.1%
Target 115 98.7%
DOK 34 98.7%
Translated Glossaries Claim 34 99.5%
Target 115 96.7%
DOK 34 98.9%

Deviations from blueprint requirements, though rare, are investigated by Smarter Balanced. For purposes of future item development, Smarter notes the few combinations of requirements that were met by fewer than 90% of tests. These cases are more likely to occur for combinations of claims, targets, and DOK requirements and within certain grades and accommodations pools. They indicate the possibility of systematic shortages or surpluses of items in some areas of the blueprint that should be addressed through item development. The possibilities that the CAT algorithm should be adjusted or that the blueprint is more restrictive than necessary and should be modified are also considered. Also considered are the sample sizes for some groups, such as students who take the braille test, which can be quite small and cause blueprint fidelity percentages to fall below a certain threshold, such as 90%, by chance.

Practical, logistical constraints in test delivery that are not accommodated in the blueprint can also lead to minor deviations between the blueprints and the tests actually delivered to students. The mathematics CAT session in grades 6 and higher is a mixture of calculator and non-calculator items. For logistic reasons, the CAT session is therefore partitioned into calculator and non-calculator segments, which are sequential. The blueprint does not specify the number of items nor any other details pertaining to each segment. This lack of specificity can occasionally lead to a blueprint distribution of items in the first segment that cannot be balanced in complementary fashion in the second segment, such that both segments combined meet the blueprint. This and other similar issues will have to be addressed in the future by one or more of the following: 1) increasing the specificity of goals in item development, 2) modifying the CAT algorithm, and 3) modifying the test blueprints.

4.15 Item Exposure

Item exposure, like test blueprint fidelity, is a function of the item pool and CAT algorithm, which are basic features of test design. Hence, information about item exposure is included in this chapter on test design. Item exposure rates were obtained using online and adaptive test instances with valid scale scores for which item data were available from the 2020-21 summative administration. The exposure rate for a given item is the proportion of test instances in the grade and content area on which the item appeared.

Table 4.10 and Table 4.11 present a summary of the item exposure results for ELA/literacy and mathematics, respectively. Within each grade and component (CAT and PT), both tables present the number of items in the operational pool (N), along with various descriptive statistics, including the mean, standard deviation (SD), range (min, max), and median of the observed exposure rates. For example, Table 4.10 shows that, on average, each CAT item eligible for administration at grade 3 was seen by 3% of grade 3 examinees. As a rule of thumb, Smarter Balanced attempts to maintain a maximum exposure rate of 25% (i.e., no more than 25% of examinees will see the same item). Table 4.10 shows that the mean and median exposure rates for ELA/literacy items are well below 25%. Table 4.11 shows that the mean and median exposure rates for mathematics items are also well below 25%. Patterns of item exposure for PT items will differ from those for CAT items due to the fact that PT item sets are randomly selected and administered within grade, whereas CAT items are administered adaptively.

Table 4.12 and Table 4.13 provide further information about item exposure by showing the number of and proportion of items in the operational pool (N) with exposure rates falling into certain ranges (bins with a width of 0.1), including those that were completely unexposed (unused). Due to rounding and the large number of items per grade within subject, values of 0.00 in these tables does not necessarily mean that no items had exposure rates falling into the ranges indicated by the column headings. Table 4.12 and Table 4.13 show that exposure for the vast majority of items was between 0 and 10% (0.0–0.1].

Table 4.10: SUMMARY OF ELA/LITERACY ITEM EXPOSURE RATES BY GRADE AND COMPONENT
Grade Type N Mean SD Min Max Median
3 CAT 857 0.03 0.03 0.00 0.20 0.02
4 CAT 866 0.04 0.04 0.00 0.26 0.02
5 CAT 863 0.04 0.04 0.00 0.27 0.02
6 CAT 824 0.04 0.04 0.00 0.28 0.02
7 CAT 763 0.04 0.05 0.00 0.26 0.02
8 CAT 819 0.03 0.03 0.00 0.23 0.02
HS CAT 2510 0.01 0.01 0.00 0.08 0.01
3 PT 37 0.06 0.04 0.00 0.15 0.08
4 PT 45 0.06 0.04 0.00 0.14 0.08
5 PT 54 0.05 0.03 0.00 0.13 0.02
6 PT 40 0.06 0.04 0.00 0.14 0.06
7 PT 50 0.05 0.03 0.00 0.12 0.05
8 PT 53 0.07 0.05 0.00 0.17 0.04
HS PT 46 0.10 0.06 0.04 0.16 0.10


Table 4.11: SUMMARY OF MATHEMATICS ITEM EXPOSURE RATES BY GRADE AND COMPONENT
Grade Type N Mean SD Min Max Median
3 CAT 1192 0.02 0.02 0.00 0.11 0.01
4 CAT 1290 0.02 0.02 0.00 0.19 0.01
5 CAT 1260 0.02 0.02 0.00 0.14 0.01
6 CAT 1120 0.02 0.02 0.00 0.15 0.01
7 CAT 1021 0.02 0.03 0.00 0.19 0.01
8 CAT 902 0.02 0.04 0.00 0.29 0.01
HS CAT 2557 0.01 0.02 0.00 0.26 0.00
3 PT 90 0.06 0.03 0.04 0.13 0.04
4 PT 106 0.04 0.02 0.03 0.11 0.03
5 PT 105 0.05 0.02 0.04 0.11 0.04
6 PT 92 0.06 0.02 0.05 0.12 0.05
7 PT 82 0.05 0.02 0.04 0.13 0.04
8 PT 79 0.05 0.02 0.04 0.11 0.04
HS PT 66 0.07 0.01 0.07 0.09 0.07


Table 4.12: PROPORTION OF ELA/LITERACY ITEMS BY EXPOSURE RATES
Grade Type N Unused (0.0, 0.1] (0.1, 0.2] (0.2, 0.3] (0.3, 0.4] (0.4, 0.5] (0.5, 0.6] (0.6, 0.7] (0.7, 0.8] (0.8, 0.9] (0.9, 1.0]
3 CAT 857 0 0.93 0.07 0.00 0 0 0 0 0 0 0
4 CAT 866 0 0.91 0.08 0.01 0 0 0 0 0 0 0
5 CAT 863 0 0.93 0.06 0.01 0 0 0 0 0 0 0
6 CAT 824 0 0.90 0.09 0.01 0 0 0 0 0 0 0
7 CAT 763 0 0.89 0.10 0.01 0 0 0 0 0 0 0
8 CAT 819 0 0.94 0.06 0.00 0 0 0 0 0 0 0
HS CAT 2510 0 1.00 0.00 0.00 0 0 0 0 0 0 0
3 PT 37 0 0.92 0.08 0.00 0 0 0 0 0 0 0
4 PT 45 0 0.96 0.04 0.00 0 0 0 0 0 0 0
5 PT 54 0 0.96 0.04 0.00 0 0 0 0 0 0 0
6 PT 40 0 0.92 0.07 0.00 0 0 0 0 0 0 0
7 PT 50 0 0.96 0.04 0.00 0 0 0 0 0 0 0
8 PT 53 0 0.55 0.45 0.00 0 0 0 0 0 0 0
HS PT 46 0 0.50 0.50 0.00 0 0 0 0 0 0 0


Table 4.13: PROPORTION OF MATHEMATICS ITEMS BY EXPOSURE RATES
Grade Type N Unused (0.0, 0.1] (0.1, 0.2] (0.2, 0.3] (0.3, 0.4] (0.4, 0.5] (0.5, 0.6] (0.6, 0.7] (0.7, 0.8] (0.8, 0.9] (0.9, 1.0]
3 CAT 1192 0 1.00 0.00 0.00 0 0 0 0 0 0 0
4 CAT 1290 0 1.00 0.00 0.00 0 0 0 0 0 0 0
5 CAT 1260 0 0.99 0.01 0.00 0 0 0 0 0 0 0
6 CAT 1120 0 0.99 0.01 0.00 0 0 0 0 0 0 0
7 CAT 1021 0 0.97 0.03 0.00 0 0 0 0 0 0 0
8 CAT 902 0 0.96 0.04 0.01 0 0 0 0 0 0 0
HS CAT 2557 0 0.99 0.01 0.00 0 0 0 0 0 0 0
3 PT 90 0 0.87 0.13 0.00 0 0 0 0 0 0 0
4 PT 106 0 0.92 0.08 0.00 0 0 0 0 0 0 0
5 PT 105 0 0.89 0.11 0.00 0 0 0 0 0 0 0
6 PT 92 0 0.87 0.13 0.00 0 0 0 0 0 0 0
7 PT 82 0 0.90 0.10 0.00 0 0 0 0 0 0 0
8 PT 79 0 0.95 0.05 0.00 0 0 0 0 0 0 0
HS PT 66 0 1.00 0.00 0.00 0 0 0 0 0 0 0

4.16 Summary of Test Design

The intent of this chapter is to show how the assessment design supports the purposes of Smarter Balanced summative assessments. Content specifications were derived directly from the CCSS, expressing the standards as measurable elements and made explicit in Smarter Balanced claims and assessment targets structure. Building on the content specifications, test blueprints provide appropriate proportions of CCSS content coverage. Using the blueprints, item writers wrote items and tasks in quantities that supported CAT and performance task delivery. Expansion of item and task types promoted student responses that provide more insight into proficiency than that provided by multiple-choice items alone. The use of performance tasks addresses the need to assess application and integration of skills. The method of delivery and test scoring, combining adaptive and non-adaptive elements, provides the most precise information and an enhanced student testing experience. The 27 major types of assessment design specifications are summarized in Appendix B.

The measurement properties summarized in Chapter 2 and in the sections of this chapter on item exposure and blueprint fidelity are very much functions of the item pool and CAT algorithm. The CAT algorithm has not substantially changed since it was first used in the 2014–15 summative assessment. Details of this algorithm are available in a separate report (Cohen & Albright, 2014). Details concerning the item pool are provided in this chapter. The item pool used in the 2020-21 summative assessment was large and, although relatively difficult compared to the students assessed, supported the delivery of reliable CAT tests, met blueprint requirements, and did not overexpose items. These outcomes support the conclusion that the 2020-21 summative assessment was well designed.

References

American Institutes for Research. (2013). Cognitive laboratories technical report.
Cohen, J., & Albright, L. (2014). Smarter balanced adaptive item selection algorithm design report.
Dana, T. M., & Tippins, D. J. (1993). Considering alternative assessments for middle level learners. Middle School Journal, 25(2), 3–5.
Doorey, N., & Polikoff, M. (2016). Evaluating the content and quality of next generation assessments. In Thomas B. Fordham Institute. Retrieved from https://eric.ed.gov/?id=ED565742.
Hansen, E. G., & Mislevy, R. J. (2008). Design patterns for improving accessibility for test takers with disabilities (pp. i–32) [ETS Research Report]. https://doi.org/10.1002/j.2333-8504.2008.tb02135.x
HumRRO. (2016). Smarter Balanced Assessment Consortium: Alignment Study Report. Retrieved from https://portal.smarterbalanced.org/library/smarter-balanced-assessment-consortium-alignment-study-report/.
Mislevy, R. J., & Haertel, G. D. (2006). Implications of evidence-centered design for educational testing. Educational Measurement: Issues and Practice, 25(4), 6–20.
Mislevy, R. J., Steinberg, L. S., & Almond, R. (2003). On the structure of educational assessments. Measurement: Interdisciplinary Research and Perspectives, 1(1), 3–67.
Rose, D., & Meyer, A. (2000). Universal design for learning. Journal of Special Education Technology, 15, 67–70.
Schmeiser, C. B., & Welch, C. J. (2006). Test development. In R. L. Brennan (Ed.), Educational measurement, 4th ed. American Council on Education/Praeger.
Smarter Balanced. (2015b). Item and task specifications. Retrieved from http://www.smarterbalanced.org/assessments/development/.
Smarter Balanced. (2015c). Style guide. Retrieved from https://portal.smarterbalanced.org/library/style-guide-for-smarter-balanced-assessments/.
Smarter Balanced. (2016a). 2013-2014 technical report. Retrieved from https://portal.smarterbalanced.org/library/2013-14-technical-report/.
Smarter Balanced. (2016b). Accessibility and accommodations framework. Retrieved from https://portal.smarterbalanced.org/library/accessibility-and-accommodations-framework/.
Smarter Balanced. (2017b). English Language Arts/Literacy Content Specifications. Retrieved from https://portal.smarterbalanced.org/library/english-language-artsliteracy-content-specifications/.
Smarter Balanced. (2017d). Mathematics content specifications. Retrieved from https://portal.smarterbalanced.org/library/mathematics-content-specifications/.
Smarter Balanced. (2022a). Bias and sensitivity guidelines. Retrieved from https://portal.smarterbalanced.org/library/bias-and-sensitivity-guidelines/.
Smarter Balanced. (2022d). Usability, accessibility, and accommodations guidelines. Version 5.2. Retrieved from change log at https://portal.smarterbalanced.org/library/usability-accessibility-and-accommodations-guidelines/.
WestEd Standards, Assessment, and Accountability Services Program. (2017). Evaluation of the alignment between the common core state standards and the smarter balanced assessment consortium summative assessments for grades 3, 4, 6, and 7 in english language arts/literacy and mathematics. Retrieved from https://portal.smarterbalanced.org/library/wested-alignment-evaluation/.
Zhang, T., Haertel, G., Javitz, H., Mislevy, R. J., & Wasson, J. (2009). A design pattern for a spelling assessment for students with disabilities. Paper presented at the annual conference of the American Psychological Association, Montreal, Canada.

  1. There are a small number of ELA/literacy passages that may possibly be administered with the full blueprint but not the adjusted blueprint due to implementation details related to target requirements.↩︎