# Smarter Balanced Scoring Specifications for Summative and Interim Assessments

*2023-05-31*

# 1 Introduction

This document describes the scoring methods of the Smarter Balanced ELA/literacy and Mathematics interim and summative assessments for grades 3-8 and high school, designed for accountability purposes. In some instances, the document specifies options available to vendors that may differ from the approach used in the open source test scoring system.

# 2 Estimating Student Proficiency

## 2.1 Maximum Likelihood Estimation of Theta Scores

Maximum likelihood estimation (MLE) is used to construct the \(\theta\) score for examinee \(j\) based on their pattern of responses to items \(i\) where \((i = 1, 2, ..., k_{j})\). The likelihood for each examinee is:

\[\begin{equation} L_{j}(\theta_{j}|Z_{j},\mathbf{x})= \prod^{k_{j}}_{i=1} p_{i}(z_{ij}|\theta_{j},\mathbf{x}_{i}), \tag{2.1} \end{equation}\]

where \(Z_{j} = z_{1j}, z_{2j}, ...z_{kj}\) is the examinee’s response pattern and \(\mathbf{x} = \mathbf{x}_{1},\mathbf{x}_{2},...\mathbf{x}_{k_{j}}\) holds the operational item parameters. The probability of each item response, \(p_{i},\) is specified by the operational item response model.

If item \(i\) is scored in two categories, the probability of a correct response is given by the two-parameter logistic model (2PL):

\[\begin{equation} p_{i}(z_{ij}=1) = \frac{exp[1.7a_{i}(\theta_{j}-b_{i})]}{1+exp[1.7a_{i}(\theta_{j}-b_{i})]}, \tag{2.2} \end{equation}\]

where \(exp\) refers to the exponential function, and \(a_{i}\) and \(b_{i}\) are the discrimination and difficulty parameters for item \(i\), respectively.

If item \(i\) is scored in \(m\) categories where \(m>2\), the probability of responding in category \(v\) where \(v=0,1,...,m-1\) is given by the generalized partial credit model (GPC; Muraki, 1992):

\[\begin{equation} p_{iv}(z_{ij}=v) = \frac{exp[1.7a_{i}(\sum_{r=0}^{v}\theta_{j}-b_{i}+d_{ir})]}{\sum_{c=0}^{m-1}exp[1.7a_{i}\sum_{r=0}^{c}(\theta_{j}-b_{i}+d_{ir})]}, \tag{2.3} \end{equation}\]

where \(d_{ir}\) are category boundary (threshold) parameters for item \(i\) and category \(r = 0,1,...,m-1\), with constraints that \(d_{i0}=0\) and \(\sum_{r=1}^{m-1}d_{ir}=0\). To estimate both the 2PL and GPC model parameters, flexMIRT (Cai, 2020) is used. The flexMIRT parameterization of the GPC is based on the fact that the GPC model is a constrained version of the nominal response model (Thissen et al., 2010):

\[\begin{equation} p_{iv}(z_{ij}=v) = \frac{exp(a_{i}\theta_{j}+d_{iv})}{\sum_{r=0}^{m-1}exp(a_{i}\theta_{j}+d_{ir})}, \tag{2.4} \end{equation}\]

where \(d_{i0}=0\). Therefore, for each item scored in \(m\) categories, the number of estimated (non-zero) \(d\) parameters is \(m-1\) in both Muraki Equation (2.3) and flexMIRT’s Equation (2.4) formulations. Parameters from each formula can be converted to the other (conversion equations not shown).

## 2.2 All Correct and All Incorrect Cases

Finite MLE proficiency estimates for zero and perfect scores are not obtainable. To handle all correct and all incorrect cases, assign the highest obtainable scores (HOT and HOSS) or the lowest obtainable scores (LOT and LOSS). Options for HOSS/LOSS values are presented in Section 4.

# 3 Participating, Attempting, and Completing

This section presents rules for when scores are reportable and how scores should be computed for incomplete tests. A student is considered having logged into a session if a student-login for the session is recorded by the test delivery system. An item is considered answered if a non-blank student response for the item is recorded by the test delivery system.

## 3.1 Participating

If a student logged onto both the CAT and the Performance Task parts/sessions of the test, the student is considered as having participated, even if no items are answered. This condition is not evaluated for students taking interim assessment blocks (IABs) or focused IABS (FIABs) because these tests consist of only a single part.

Non-participating students should be included in the data file sent to Smarter Balanced, but no test scores should be reported.

## 3.2 Attempting

- A Summative test is considered attempted if at least one CAT item and one performance task item is answered.
- An interim Comprehensive Assessment (ICA) is considered attempted if at least one item from the non-PT section and one item from the performance task (PT) section is answered.
- An interim assessment block (IAB) or focused interim assessment block (FIAB) is considered attempted if at least one item is answered.

Test scores should be reported for all attempted tests.

## 3.3 Flag for Participating and Attempting

The data file sent to Smarter Balanced should include a flag to indicate whether the student participated and attempted. Possible values for summative and ICA assessments are N, P, and Y. Possible values for IABs and FIABs are P and Y.

P = The student did not attempt the test even if they participated.

Y = The student attempted the test.

## 3.4 Completing

- Summative and ICA tests are considered “complete” if the student answered the minimum number of operational items specified in the blueprint for the CAT and all items in the performance task part.

- IABs and FIABs are considered complete if the student answered all items in the test.

If a student completes a test, but did not submit the test, the test delivery system (TDS) should mark the test as completed. If the TDS allowed the student to submit his/her test it is considered “complete”.

## 3.5 Scoring Incomplete Tests

MLE is used to score the incomplete tests counting unanswered items as incorrect. For summative fixed form tests and ICAs, both total scores and subscores are computed.

Online Summative Tests include both the CAT and the performance task parts. The performance task part includes a fixed form test. For the performance task items, unanswered items are treated as incorrect. If the CAT part of summative tests is incomplete, only a total score should be reported. Claim scores (subscores) should not be reported^{1}. In the open source scoring system, simulated item parameters are generated with the following rules:

- The minimum of the CAT operational test length is used to determine the test length of the incomplete tests;
- It is assumed that the remainder of the CAT, following the last answered item, consists of items whose IRT item parameters are equal to the average values of the on-grade items in the summative item pool. Table 3.1, below, may be used for average discrimination and difficulty parameters. Vendors may use other equivalent methods of generating item parameters (e.g., inverse TCC; Stocking, 1996).
- All unanswered items, including the assumed items, are scored as “incorrect.”

Grade | ELA/L a | ELA/L b | Math a | Math b |
---|---|---|---|---|

3 | 0.67 | -0.42 | 0.85 | -0.81 |

4 | 0.59 | 0.13 | 0.81 | -0.06 |

5 | 0.61 | 0.51 | 0.77 | 0.68 |

6 | 0.54 | 1.01 | 0.70 | 1.06 |

7 | 0.54 | 1.11 | 0.71 | 1.79 |

8 | 0.53 | 1.30 | 0.61 | 2.29 |

HS | 0.50 | 1.69 | 0.53 | 2.71 |

## 3.6 Hand Scoring Rules

Scoring rules for hand scored items:

Evidence/elaboration, organization/purpose, and conventions are the scoring dimensions for essays. Scores for the first two dimensions are averaged, and the average is rounded up.

All condition codes are recoded to zero for calculations.

In most cases, when an essay receives a condition code, the code is assigned to all dimensions. The exception to that is described below.

Starting in 2022-2023 a new scoring rule is in effect for ELA/literacy items. If an essay is identified with a condition code of ‘off-purpose,’ the conventions trait is scored, and the traits of evidence/elaboration and organization/purpose are not scored. Consistent with essays that do not have condition codes, item scores for conventions are included when calculating the total ELA/literacy score and the writing claim score. The rule for ‘off-purpose’ scoring of conventions is applied regardless of the writing purpose.

# 4 Rules for Transforming Theta to Vertical Scale Scores

The IRT vertical scale is formed by linking across grades using common items in adjacent grades. The vertical scale score is the linear transformation of the post-vertically scaled IRT proficiency estimate.

\[\begin{equation} SS=a*\theta +b \tag{4.1} \end{equation}\]

The scaling constants a and b are provided by Smarter Balanced. Table 4.1 lists the scaling constants for each subject for the theta-to-scaled score linear transformation. Scale scores are rounded to an integer.

Subject | Grade | Slope (a) | Intercept (b) |
---|---|---|---|

ELA/Literacy | 3-8, HS | 85.8 | 2508.2 |

Mathematics | 3-8, HS | 79.3 | 2514.9 |

## 4.1 Lowest/Highest Obtainable Scale Scores (HOSS/LOSS)

**HOSS/LOSS Options**
Options for HOSS/LOSS values have been set in policy. Implementation of the option desired by each member needs to be negotiated with the test scoring contractor. Smarter Balanced members have the following options:

**Option 1:** Members may choose to retain the 2014-15 LOSS/HOSS values which are shown in Table 4.2.

Subject | Grade | LOT | HOT | LOSS | HOSS |
---|---|---|---|---|---|

ELA/L | 3 | -4.5941 | 1.3374 | 2114 | 2623 |

ELA/L | 4 | -4.3962 | 1.8014 | 2131 | 2663 |

ELA/L | 5 | -3.5763 | 2.2498 | 2201 | 2701 |

ELA/L | 6 | -3.4785 | 2.5140 | 2210 | 2724 |

ELA/L | 7 | -2.9114 | 2.7547 | 2258 | 2745 |

ELA/L | 8 | -2.5677 | 3.0430 | 2288 | 2769 |

ELA/L | HS | -2.4375 | 3.3392 | 2299 | 2795 |

Math | 3 | -4.1132 | 1.3335 | 2189 | 2621 |

Math | 4 | -3.9204 | 1.8191 | 2204 | 2659 |

Math | 5 | -3.7276 | 2.3290 | 2219 | 2700 |

Math | 6 | -3.5348 | 2.9455 | 2235 | 2748 |

Math | 7 | -3.3420 | 3.3238 | 2250 | 2778 |

Math | 8 | -3.1492 | 3.6254 | 2265 | 2802 |

Math | HS | -2.9564 | 4.3804 | 2280 | 2862 |

**Option 2:** Members may choose to use other LOSS/HOSS values beginning in administration year 2015-16 as long as the revised LOSS values do not result in more than 2% of students falling below the LOSS level and the revised HOSS values do not result in more than 2% of students falling above the HOSS level.

**Option 3:** Members may choose to eliminate LOSS/HOSS altogether.

**Additional Details:**

- For all-wrong/All-right response patterns: assign the LOT/HOT or
adjust the student’s score on the item with the smallest a-parameter among all administered operational items (CAT and PT where applicable) as follows:
- For all incorrect tests, add 0.5 to the item score.
- For all correct cases, subtract 0.5 from the item score.

- Smarter Balanced will need to retain both the calculated theta score and the reported scale score for students whose scores fall into HOSS/LOSS ranges.
- If using the HOSS/LOSS options #1 or #2 above:
- When the scale score corresponding to the estimated \(\theta\) is lower than the LOSS or higher than the HOSS, the scale score is assigned the associated LOSS and HOSS values. The \(\theta\) score is retained as originally computed.
- LOSS and HOSS scale score rules are applied to all tests (Summative, ICA, IAB and FIAB) and all scores (total and subscores).

- The standard error for LOSS and HOSS is computed using \(\theta\) proficiency estimates given the administered items. For example, in Equation (5.1) and Equation (5.2), the LOSS or HOSS is plugged in for \(\theta\), and a and b are for the administered items.
- If using Option #3, the scale score is calculated directly from estimated \(\theta\).

# 5 Calculating Measurement Error

## 5.1 Standard Error of Measurement

With MLE estimation, the standard error (SE) for student \(j\) is:

\[\begin{equation} SE({\theta_j}) = \frac{1}{\sqrt{I({\theta_j})}}, \tag{5.1} \end{equation}\]

where \(I(\theta_{j})\) is the test information for student \(j\), calculated as:

\[\begin{equation} \begin{split} I({\theta}_{j}) = \sum_{i=1}^{I}D^2a_{i}^2 (\frac{\sum_{l=1}^{m_{i}}l^2Exp(\sum_{k=1}^{l}Da_{i}({\theta-b_{ik}}))} {1+\sum_{l=1}^{m_{i}}Exp(\sum_{k=1}^{l}Da_{i}({\theta-b_{ik}}))} - \\ (\frac{\sum_{l=1}^{m_{i}}lExp(\sum_{k=1}^{l}Da_{i}({\theta-b_{ik}}))} {1+\sum_{l=1}^{m_{i}}Exp(\sum_{k=1}^{l}Da_{i}({\theta-b_{ik}}))})^2), \end{split} \tag{5.2} \end{equation}\]

where \(m_{i}\) is the maximum possible score point (starting from 0) for the \(i^{th}\) item, and \(D\) is the scaling factor, 1.7.

The SE is calculated based only on the answered item(s) for both complete and incomplete tests. The upper bound of SE is set to 2.5 in the \(\theta\) metric. Any value larger than 2.5 is truncated at 2.5 in the \(\theta\) metric.

## 5.2 Standard Error Transformation

Standard errors of the MLEs are transformed to be placed onto the reporting scale. This transformation is:

\[\begin{equation} SE_{ss} = a*SE_{\theta_{j}}, \tag{5.3} \end{equation}\]

where \(SE_{\theta}\) is the standard error of the proficiency estimate on the \(\theta\) scale and \(a\) is the slope from Table 4.1 that transforms \(\theta\) to the reporting scale.

# 6 Calculating Claim Scores

## 6.1 MLE Scoring for Claim Scores

For individual students taking the full blueprint, claim scores are calculated using MLE, as described in Section 2.1; however, the scores are based on the items contained in a particular claim.

In ELA/literacy, claim scores are computed for each claim. In Math, claim scores are computed for Claim 1, Claim 2 and 4 combined, and Claim 3.

Smarter Balanced is collecting validity evidence regarding composite claim scores in support of the adjusted blueprint and as needed to meet peer review requirements. For 2022-2023, states may elect to report only a total score for the adjusted blueprint with the understanding that peer review approval will necessarily be delayed until after the 2023-2024 administration.

Alternatively, for 2022-2023, states may report composite claim scores for ELA/literacy and mathematics with the understanding that validity data is going to be collected through September 2022 in support of the composites. The planned composite claim scores are as follows:

ELA/literacy:

Composite Claim 1: Reading and listening

Composite Claim 2: Writing and research

Mathematics:

Composite Claim 1: Concepts and procedures

Composite Claim 2: Problem Solving, Communicating Reasoning, Data Analysis and Modeling

## 6.2 Scoring All Correct and All Incorrect Cases

Apply the rule in Section 2.2 to each Claim, except in the case of incomplete CAT tests, where hypothetical items are scored incorrect in order to obtain a total score.

## 6.3 Performance Levels on Claims

For individual students taking the full blueprint, performance levels are reported in addition to scaled scores. If the difference between the proficiency cut score and the claim score is greater (or less) than 1.5 standard errors of the claim score, the student’s performance level is reported as ‘above’ or ‘below’ the standard, depending on the direction of the difference. A plus or minus indicator may appear on the student’s score report to indicate whether student’s performance is above (+) or below (-) the standard. Otherwise, the student is classified a “Near” the standard.

For IAB, FIAB, ICA, and Summative, the specific rules are as follows:

**Below Standard**(Code=1): if \(SS_{rc} + 1.5*SE(SS_{rc}) < SS_{p}\)**Near Standard**(Code=2): if \(SS_{rc} + 1.5*SE(SS_{rc}) \ge SS_{p}\) and \(SS_{rc} - 1.5*SE(SS_{rc}) < SS_{p}\), a strength or weakness is indeterminable**Above Standard**(Code=3): if \(SS_{rc} - 1.5*SE(SS_{rc}) \ge SS_{p}\)

where \(SS_{rc}\) is the student’s scale score on a reporting category; \(SS_{p}\) is the proficiency scale score cut (Level 3 cut); and \(SE(SS_{rc})\) is the standard error of the student’s scale score on the reporting category. Assign Above Standard (code=3) to HOSS and assign Below Standard (code=1) to LOSS. For each computation, round the left side to an integer before comparing to \(SS_{p}\).

## 6.4 Aggregated Claim Scores with the Adjusted Blueprint

The adjusted blueprint does not support reporting of three math and four ELA/literacy claim scores for individual students because the number of items per claim is too small. So, with the adjusted blueprint in administration years 2020-21 and 2021-22, reporting of claim scores should be either suppressed completely, or only reported in aggregate so that performance on claims is summarized at the grade level for schools or districts within each state or territory.

The following options for aggregated claim score reporting are recommended to Smarter Balanced members for 2020-21 and 2021-22 for the calculation of aggregated claim scores along with a 95% confidence interval. Aggregated claim scores should not be calculated for schools/districts with fewer than 20 examinees.

**Option 1:** The median of individual student scale scores for the claim, reported with a 95% confidence interval. If this option is selected, claim scale scores for individuals are calculated as described in the Smarter Balanced scoring specifications.
The confidence interval for the median may be computed based on the Binomial distribution where n = sample size and q = .5 (Conover, 1980). All observations in the data must be ranked and then the rank of the scores that constitute the lower and upper limits of the interval, respectively, are provided by the following formulas, rounded up to the nearest integer:

- Rank of lower limit = \(nq-1.96((nq(1-q))^{1/2})\),
- Rank of upper limit = \(nq+1.96((nq(1-q))^{1/2})\).

**Option 2:** Scores for schools/districts may be produced from a unidimensional multilevel item response model with item parameters fixed to their operational values and students clustered within schools. This approach requires expected a posteriori estimation rather than maximum likelihood estimation of thetas but uses the same scaling constants as provided in the Smarter Balanced scoring specifications. A standard error (SE) for the school-level scores is a by-product of the estimation; thus, a 95% confidence interval for each score may be calculated in a standard way: estimate ± 1.96*SE.

For 2022-23, composite claim scores are under consideration, which would provide satisfactory reliability without sacrificing validity for individual claim score reporting with the adjusted blueprint.

# 7 Calculating Target Scores

Target scores are produced for online summative tests only, either the full or adjusted blueprint, and computed for all four of the ELA/literacy claims but only for Claim 1 of Mathematics. They are computed for attempted tests based on items with responses. Unanswered items (either CAT or PT) are ignored.

Target scores are computed either relative to student’s overall estimated proficiency (\(\theta\)) for each subject or relative to the proficiency standard (level 2v3 cut score for \(\theta\)). Equations and interpretations are written here for the case of overall estimated proficiency. To use the proficiency standard instead, substitute the cut score (see Table 8.1 and Table 8.2) into Equation (2.2) or Equation (2.3) when calculating expected responses, and interpret the results relative to the proficiency standard, not relative to the overall test.

For each item *i*, the residual between the observed and expected score for each student is defined as:

\[\begin{equation} \delta_{ij} = z_{ij} - E(z_{ij}), \tag{7.1} \end{equation}\]

where \(z_{ij}\) is the observed response from student *j* to item *i* and the expected response, \(E(z_{ij})\), is given by the model in Equation (2.2) or Equation (2.3).

Within each target, *T*, a summary statistic, (\(\delta_{jT}\)), is created for each student: The residuals for each student are summed over items aligned to the target and divided by the total number of points possible for items aligned to the target:

\[\begin{equation} \delta_{jT} = \frac{\sum_{i=1}^{k}\delta_{ij}}{\sum_{i=1}^{k}m_{i}}, \tag{7.2} \end{equation}\]

where *k* = the total number of items aligned to this target and \(m_{i}\) = the number of points for item *i*.

To aggregate over students for target *T* and group *g*, calculate the mean of \(\delta_{jT}\) values from Equation (7.2) and use the following formula for the standard error:

\[\begin{equation} SE(\overline{\delta}_{Tg}) = \sqrt{\frac{1}{n_{g}(n_{g}-1)}\sum_{j=1}^{n_{g}} (\delta_{jT}-\overline{\delta}_{Tg})^{2}}, \tag{7.3} \end{equation}\]

where \(n_{g}\) = sample size for this target and group and \(\overline{\delta}_{Tg}\) = the mean of \(\delta_{jT}\) values for this target and group.

Report whether the group of students, in aggregate, performed better, worse, or as expected on this target compared to the overall test. This decision is made by testing whether the mean target residual is statistically significantly smaller or larger than 0 using 1-sided confidence intervals (\(\alpha\) = .16). In some cases, insufficient information is available to make a categorization.

Specifically:

- If \(\overline{\delta}_{Tg} \ge (1)(SE(\overline{\delta}_{Tg}))\) then performance is
**better**on the target than on the overall test. - If \(\overline{\delta}_{Tg} \le (-1)(SE(\overline{\delta}_{Tg}))\) then performance is
**worse**on the target than on the overall test. - Otherwise, performance on the target is similar to performance on the overall test.
- If \(SE(\overline{\delta}_{Tg}) > 0.2\) the information is insufficient.

# 8 Calculating Achievement Levels

Overall scale scores for Smarter Balanced are mapped into four achievement levels per grade/content area. The achievement level designations are Level 1, Level 2, Level 3, and Level 4. The definitions of these levels were defined after achievement level setting.

## 8.1 Threshold Scores for Four Achievement Levels

Table 8.1 and Table 8.2 show the theta cut scores and reported scaled scores (SS) for the ELA/literacy assessments and the Mathematics assessments, respectively.

Grade | Theta 1v2 | SS 1v2 | Theta 2v3 | SS 2v3 | Theta 3v4 | SS 3v4 |
---|---|---|---|---|---|---|

3 | -1.646 | 2367 | -0.888 | 2432 | -0.212 | 2490 |

4 | -1.075 | 2416 | -0.410 | 2473 | 0.289 | 2533 |

5 | -0.772 | 2442 | -0.072 | 2502 | 0.860 | 2582 |

6 | -0.597 | 2457 | 0.266 | 2531 | 1.280 | 2618 |

7 | -0.340 | 2479 | 0.510 | 2552 | 1.641 | 2649 |

8 | -0.247 | 2487 | 0.685 | 2567 | 1.862 | 2668 |

9 | -0.224 | 2489 | 0.732 | 2571 | 1.909 | 2672 |

10 | -0.200 | 2491 | 0.802 | 2577 | 1.979 | 2678 |

11 | -0.177 | 2493 | 0.872 | 2583 | 2.026 | 2682 |

Grade | Theta 1v2 | SS 1v2 | Theta 2v3 | SS 2v3 | Theta 3v4 | SS 3v4 |
---|---|---|---|---|---|---|

3 | -1.689 | 2381 | -0.995 | 2436 | -0.175 | 2501 |

4 | -1.310 | 2411 | -0.377 | 2485 | 0.430 | 2549 |

5 | -0.755 | 2455 | 0.165 | 2528 | 0.808 | 2579 |

6 | -0.528 | 2473 | 0.468 | 2552 | 1.199 | 2610 |

7 | -0.390 | 2484 | 0.657 | 2567 | 1.515 | 2635 |

8 | -0.137 | 2504 | 0.897 | 2586 | 1.741 | 2653 |

9 | 0.026 | 2517 | 1.086 | 2601 | 2.032 | 2676 |

10 | 0.228 | 2533 | 1.250 | 2614 | 2.296 | 2697 |

11 | 0.354 | 2543 | 1.426 | 2628 | 2.561 | 2718 |

# 9 References

Conover, W.J. (1980). *Practical Nonparametric Statistics.* John Wiley and Sons, New York.

Cai, L. (2020). flexMIRT version3.62: Flexible multilevel multidimensional item analysis and test scoring [Computer software]. Chapel Hill, NC: Vector Psychometric Group.

Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. *Applied Psychological Measurement,* 16, 159-176.

Stocking, M. L. (1996). An alternative method for scoring adaptive tests. *Journal of Educational and Behavioral Statistics,* 21, 365-389.

Thissen, D., Cai, L., & Bock, R. D. (2010). The nominal categories item response model. In M.Nering & R. Ostini (Eds.), *Handbook of Polytomous Item Response Theory Models*. Routledge.

For the CAT items, the identity of most of the specific unanswered items is unknown; If items have been lined up for administration (through the pre-fetch process), parameters are known and the items are scored as incorrect. That is, they are treated in the same manner as known items in interim tests, paper/pencil tests and performance tasks. For the remainder of items, simulated parameters are used in place of administered items.↩︎