Smarter Balanced CAT Algorithm

Hotaka Maeda and Shumin Jing

June 05, 2024

Introduction

This is the first document that comprehensively describes the computer adaptive testing (CAT) algorithm used by Smarter Balanced (SB). This document focuses on the current algorithm we use in practice for most of our member states. Cambium’s simulator and SB’s simulator under development use a slightly different algorithm. These are noted near the end of the document. In general, the algorithm logic makes decisions in the order presented below.

6/5/2024 updated sections: 5. EFT CAT Items and 6. Off-Grade CAT OP Items.

11/2/2023 updated sections: 3.2. Selection within Testlets, 5. EFT CAT Items, and 7. CAT Segment 1 End.

1 Blueprint Selection

Key Points: For participating states, a random 2 to 4% of students will receive EFT PT items instead of the OP PT items, and receive additional CAT items to compensate for the lack of OP PT items.

First, the CAT algorithm must decide whether the student will receive Embedded Field-Test (EFT) performance task (PT) items. EFT PT items are administered to only about a randomly sampled 2 to 4% of the students who have registered for the assessment. The exact percentage varies by administration year and exam. Not all states participate in field-testing.

If the student is selected to receive EFT PT items, the operational (OP) PT items will be replaced by a random EFT PT item set instead. Furthermore, the CAT algorithm will be based on the enhanced blueprint. The enhanced blueprint includes additional CAT items compared to the non-enhanced blueprint to accommodate for the loss of test information from having no OP PT items. Therefore, these students will have a longer CAT exam than others. In terms of the blueprint, the enhanced and non-enhanced blueprint are different exams. Therefore, a plethora of algorithm parameters may change when a student is selected to receive EFT PT items.

2 CAT Segment 1 Begin

Key Points: Exams are separated into two or more segments, including at least one CAT segment.

Exams are separated into two or more segments, depending on the blueprint. The first segment is typically a CAT segment. This can be Math calculator CAT, Math non-calculator CAT, or ELA CAT segment. The order of the segments is indicated in the blueprint as segment_position. The order of segments is important for CAT segments as the interim \(\hat{\theta}\) from the previous segments will be carried over.

3 CAT Segment 1 OP Item Group 1

Key Points: Items are initially selected at the item group level, which is a single item with no stimulus, or a group of items with the same stimulus.

In the item pool, create OP item groups. An item group is a single item with no stimulus (i.e., passages), or a group of items with the same stimulus. Then, eliminate individual items that would violate any blueprint element maximum max_op_items with isstrictmax==TRUE. Initial ability \(\hat{\theta}\) is the previous year’s mean score for the exam, calculated by Cambium. Initial test information startinfo (reciprocal of prior variance) is 0.2.

3.1 Item Selection

Key Points: Selection of the 1st CAT item is nearly completely random. This helps spread out the item exposure, and prevent students from seeing the same items administered in a similar order throughout the entire segment. The algorithm tries to emphasize increasing test information (score accuracy) in the beginning of the segment, and emphasize meeting the blueprint near the end.

The 1st OP item group is selected using the cset2initialrandom variable. Often, cset2initialrandom divided by the number of item groups in the pool is 1.0 (this is the cset2initialp), in which case the item selection is random with equal selection probability for every item group in the pool.

There are exceptions, such as when cset2initialp is 0.8 for ELA grades 3-6, where one random item group is administered out of the top 80% of items with the highest content value (i.e., blueprint value). First, calculate content value \(c_{ijt}\) for examinee \(i\), item \(j\), and selection \(t\):

\[c_{ijt}=\frac{1}{\sum_{r=1}^Rd_{rj}} \sum_{r=1}^RS_{rit} p_r d_{rj}\]

and

\[S_{rit} = \begin{cases} \left( \frac{T}{T-t}\right)\left(2-\frac{z_{rit}}{Min_r}\right), & \text{if } z_{rit}<Min_r\\ 1-\frac{z_{rit}-Min_r}{Max_r-Min_r}, & \text{if } Min_r \leq z_{rit} < Max_r \\ Max_r-z_{rit}-1, & \text{if } Max_r \leq z_{rit} \end{cases}\]

where

  • \(d_{rj}=1\) when item \(j\) contributes to blueprint element \(r\), but is otherwise 0.

  • \(T\) is the minimum segment length min_op_items. Only operational items count towards \(T\).

  • \(t\) is the selection index, which begins with 0. Therefore, \(\frac{T}{T-t}=\frac{T}{1}\) when the last item required to meet \(T\) is being selected. Only operational items count towards \(t\). \(t\) lets the algorithm focus on increasing test information near the beginning of the segment, and emphasize meeting the blueprint near the end.

  • \(z_{rit}\) represents the number of items with blueprint element \(r\) already administered to examinee \(i\) at selection \(t\).

  • \(Min_r\) and \(Max_r\) are the minimum and maximum number of items to be administered for blueprint element \(r\).

  • \(p_r\) is the local blueprint weight (bpweight).

Overall, \(S_{rit}\) exists to increase test information as quickly as possible in the beginning of the exam, then emphasize meeting the blueprint as the segment ends. It also allows isstrictmax==FALSE blueprint categories to be violated when necessary.

Then, \(c_{ijt}\) is averaged over all items in the item group \(k\) to evaluate items by groups:

\[c_{ikt}= \frac{1}{length(k)}\sum_{j\in k}{c_{ijt}}\]

where

  • \(length(k)\) is the number of items in item group \(k\)

Sort all item groups based on \(c_{ikt}\) (from largest to smallest), then select the top cset2initialrandom item groups. Then, randomly select one item group to administer.

Random or near-random selection of the first item group is useful for somewhat randomizing the algorithm decisions for the following item selections. Without this special treatment of the first item group, many students may see the same or similar sets of items administered in a similar order throughout the entire segment.

3.2 Selection within Testlets

Key Points: When a stimulus is selected, the algorithm will administer a pre-specified number of items from that stimulus.

If the selected item group is a testlet that contains multiple items (only ELA has CAT testlets), there are additional steps.

  1. Select an item within the testlet with the highest \(c_{ijt}\). If there are ties, select the item with the highest item information \(u_{ijt}\) (explained in the next section). If there are still ties, select randomly.

  2. Re-evaluate the blueprint by removing any remaining items within the testlet that would exceed max_op_items in blueprint categories with isstrictmax==‘TRUE’, and update \(c_{ijt}\).

  3. Repeat this process until maxitems for the item group is met. The maxitems variable indicates the exact number of items required to be administered from a stimulus, and it does not indicate an upper bound of a range. If maxitems number of items cannot be administered from the stimulus, then do not administer the stimulus at all. This can happen when there are no more items left in the stimulus that meets the blueprint, or the maximum segment length is met.

The number of items allowed for each stimulus is specified in the public blueprint as a range, but the operational blueprint specifies as a fixed number (maxitems).

Testlets associated with the same stimulus are shown in a single page. Item-browsing within the same page is allowed, where students are able to skip ahead or change answers on a previous item as long as they do not go on to the next page. Interim \(\hat{\theta}\) is not updated until all items in the testlet are completed. The order in which items within a stimulus are administered is configurable in the test packages.

Ideally, there should be an efficient way to remove testlets from the item pool that cannot meet the maxitems requirement. However, this may be difficult.

3.3 Item Information

Key Points: Item information is usually only considered for the 2nd and later CAT items in the segment. Item discrimination is ignored in the item selection decision, but item difficulty is considered.

Item information \(u_{ijt}\) is calculated for the 1st item group of the segment only to break ties within a selected testlet, which may be rare. Item information is used more frequently later in the exam. For binary items,

\[u_{ijt} = p_j(\theta)(1-p_j(\theta))\]

where

  • \(p_j(\theta)\) is the probability of obtaining the correct response on item \(j\) based on the 2-parameter logistic model.

Because \(\theta\) is unknown, the most updated ability estimate \(\hat{\theta}\) is used instead. Binary items do not take \(\theta\) estimation error into account when calculating \(u_{ijt}\). This is different for polytomous items, where the generalized partial credit (GPC) item response model is used to calculate \(u_{ijt}\) based on the expected item information to capture the measurement error associated with the estimation of \(\theta\):

\[u_{ijt} = \int \sum^V_{v=1} I_{jv}(q) p_j(v|q) \phi(q;\hat{\theta},\sigma_\hat{\theta})dq\] where

  • \(V\) is the maximum score.

  • \(I_{jv}(q)\) is the item information for response category \(v\) for item \(j\) when \(\theta=q\), based on the GPC model.

  • \(p_j(v|q)\) is the probability of obtaining response category \(v\) for item \(j\) when \(\theta=q\), based on the GPC model.

  • \(\phi(q;\hat{\theta},\sigma_\hat{\theta})\) is the normal probability density function at \(q\) when the mean is the updated \(\hat{\theta}\), and standard deviation is its standard error \(\sigma_\hat{\theta}\).

Given the difficulties of integration, a Gauss-Hermite quadrature with five quadrature points is used instead to approximate \(u_{ijt}\). For both the 2PL and GPC models, item discrimination parameter is changed to 1.0 when calculating \(u_{ijt}\) in order to avoid overexposure of highly discriminating items. For more details on the item response models used, see the Scoring Specifications Report.

3.4 Administer Item Group

Key Points: Machine-scored items are scored to update the current estimate of the student’s performance level, which is used to select the next item.

Once the item group is selected, administer it. If the item is machine-scored, score it and update ability \(\hat{\theta}\) using a 1-step Newton-Raphson maximum a-posteriori (MAP), with the previous \(\hat{\theta}\) as the starting ability. This interim \(\hat{\theta}\) is bounded between -4 and 4. Hand-scored items do not affect the interim \(\hat{\theta}\). For testlets, \(\hat{\theta}\) is not updated until all items in the item group are answered. Use the updated \(\hat{\theta}\) to select the next item.

To prepare the item pool for the next item selection, eliminate item groups that were already administered (including all items associated with the stimulus), and all individual items that would violate any blueprint element max_op_items with isstrictmax==TRUE. In rare cases, an administered OP stimulus can also be in the item pool as EFT, and vice versa. In this case, remove both OP and EFT items in the remaining pool associated with that stimulus.

4 CAT Segment 1 OP Item Group 2+

Key Points: For the 2nd CAT item group and beyond, the item selection becomes very adaptive based on the student’s performance. The blueprint includes parameters used in the algorithm to improve processing speed and spread out item exposure.

For selecting the 2nd item group and beyond, cset1size (candidate set 1) and cset2random (candidate set 2) parameters are used. cset1size exists to improve processing speed. cset2random exists to control item exposure.

4.1 Candidate Set 1

Sort all item groups in the item pool based on content value \(c_{ikt}\) from largest to smallest, then select the top cset1size item groups. The main purpose of cset1size is to avoid calculating item information for every item in the pool at every selection \(t\), which would slow the program. cset1size is often around 5 to 50 items.

4.2 Candidate Set 2

For item groups in cset1size, calculate the objective function \(f_{ikt}\), which is a weighted combination of normalized content value and item information:

\[f_{ikt} = w_2c^\prime_{ikt} + w_0u^\prime_{ikt}\]

\[c^\prime_{ikt}= n(c_{ikt})\]

\[u^\prime_{ikt}= n(u_{ikt})\]

\[u_{ikt}= \frac{1}{length(k)}\sum_{j\in k}{u_{ijt}}\]

\[n(x)= \begin{cases} 1, & \text{if } \min(x)=\max(x)\\ \frac{x-\min(x)}{\max(x)-\min(x)}, & \text{else} \end{cases}\]

where

  • \(w_2\) is the global blueprint weight (bpweight)

  • \(w_0\) is the global ability weight (abilityweight)

  • \(c^\prime_{ikt}\) is the normalized average item group content value

  • \(u_{ikt}\) is the average item group information

  • \(u^\prime_{ikt}\) is the normalized average item group information

  • \(n(x)\) is the normalization function used to place both \(c_{ikt}\) and \(u_{ikt}\) on the same scale so they can be compared (number between 0 to 1).

  • \(\min(x)\) is the minimum of \(x\) across all items \(j\) within examinee \(i\) and selection \(t\)

  • \(\max(x)\) is the maximum of \(x\) across all items \(j\) within examinee \(i\) and selection \(t\)

Sort the cset1size item groups based on \(f_{ikt}\) (from largest to smallest), then select the top cset2random item groups. The main purpose of cset2random is to control item exposure. Note that cset2initialrandom used for the 1st item group is often hundreds, while cset2random is often around 1 to 5. So the selection becomes much more precise from the 2nd item group and beyond.

4.3 Select 1 Item Group

From cset2random candidate item groups, select one item group randomly to administer. If the item is a testlet that contains more items than required in the blueprint (only ELA has CAT testlets), the process for selecting items within the testlet is the same as the 1st item group (see previous section 4.2).

4.4 Administer Item Group

The process of administering the 2nd item group and beyond is the same as the 1st item group (see above). Unlike during item selection, item discrimination parameter is not artificially changed to 1.0 when estimating ability.

5 EFT CAT Items

Key Points: All students receive EFT CAT items if the blueprint specifies so. EFT CAT item selection is not affected by student performance. EFT CAT items do not influence the OP item selection, nor are they influenced by them. Students that receive EFT PT items can also receive EFT CAT items. EFT CAT items are administered in random positions in the segment.

All students receive EFT CAT items if the blueprint specifies a non-zero minimum number of EFT items (min_ft_items). EFT CAT item selection is not affected by student performance. They do not influence the OP item selection, nor are they influenced by them. Students that receive EFT PT items can also receive EFT CAT items.

If the student needs to receive EFT CAT items in the segment, first, find the subset of EFT items that can belong in the current CAT Segment. Then, create EFT blocks by grouping item_id based on stim_id and block_id. Items without a stimulus (i.e., passage) are EFT blocks containing one item. If items have a stimulus, items with the same stim_id and block_id are included in the same EFT block. The block_id exists to separate stimuli with many items into smaller subgroups. Currently, only ELA exams contain EFT CAT items with stimuli. When EFT CAT items are administered, the entire EFT block is administered.

5.1 EFT CAT Item Position

At every item position, whether to administer a EFT CAT block is determined with probability \(p\):

\[p= \begin{cases} 0, & \text{if } g < \text{ftstartpos}\\ 0, & \text{else if } h \geq \text{min_ft_items}\\ 1, & \text{else if } m*(\text{ftendpos} - g + 1) \leq \text{min_ft_items}-h \\ \frac{\text{min_ft_items}-h}{m*(\text{ftendpos} - g + 1)}, & \text{else} \end{cases}\]

where

  • \(p\) is the probability of administering a EFT CAT block instead of an OP item group at item position \(g\). The \(p\) can reach 1 near the end of the test segment.

  • \(g\) is the current item position in the segment including both OP and EFT items

  • \(h\) is the number of EFT CAT items already administered in the segment

  • \(m\) is the shortest EFT CAT block length remaining in the pool (usually, \(m=1\))

Note that EFT CAT items cannot be administered in the middle of an OP testlet.

5.2 EFT CAT Item Block Selection

If a EFT CAT block must be administered at the current item position, select a EFT CAT block from the item pool:

  1. Remove all EFT blocks from the item pool that would exceed the maximum EFT items (max_ft_items) if selected.

  2. Remove all EFT blocks containing stim_id that have already been administered.

  3. Select a random EFT block from the pool to administer, weighted by ftweight. Every EFT block is associated with a single ftweight. For example, an EFT block with an ftweight of 2 is twice as likely to be selected as blocks with ftweight of 1. Administer the entire EFT block beginning from the current item position \(g\).

The ftweight is configured so that the final exposure rates for all EFT CAT items are roughly equal. Usually, ftweight should be positively correlated to the size of the block: ftweight is 1 for all items without passages, and largest for EFT CAT blocks with the highest number of items. This is because individual items are never removed in step 1 above, while large blocks are most frequently thrown out. Higher ftweight for large blocks increases the likelihood that it will be selected as the first EFT items on the exam, before they can be removed from the pool.

6 Off-Grade CAT OP Items

Key Points: Off-grade items up to 2 grades above or below can be administered in the last 1/3 of the entire CAT exam for students performing at extremely high or low levels. Off-grade items exist mainly to address item pool deficiency.

At or after exactly 2/3 of the rounded sum of min_op_items of all CAT segments (i.e., offGradeMinItemsAdministered), if the student’s performance reaches below or above the standard middle (level 3) achievement cutoff (i.e., proficientplevel) with a probability \(p < 0.0000001\) (i.e., offGradeProbAffectProficiency), the corresponding off-grade items are added to the item pool. If the examinee is performing at a lower level than the achievement cutoff, the lower grade items are made available. Conversely, if the examinee is performing well, the higher grade items are triggered. To decide whether to include the off-grade pool to the main item pool, we assume that

  • the average information for the items to be administered is the same as those administered so far

  • the final score at the end of the test will be the information-weighted average of the score based on the items administered so far, and the items administered subsequently.

Given these, we can say that

\[\frac{1}{\bar I K}\left[\bar I k\hat\theta + \bar I (K-k)\theta^*\right] = \text{proficientplevel},\] simplified as

\[\frac{k}{K}\hat\theta + \frac{K-k}{K}\theta^* = \text{proficientplevel},\]

where

  • \(K\) is the CAT test length (sum of min_op_items of all CAT segments)

  • \(k\) is the number of CAT OP items administered so far

  • \(\bar I\) is the average information of items administered so far

  • \(\hat\theta\) is an ability estimate based on the first \(k\) items (not including hand-scored items)

  • \(\theta^*\) is a synthetic ability estimate based on items to be administered as \((k + 1)\)th to \(K\)th items

The left side of the equation is a weighted composite of \(\hat\theta\) and \(\theta^*\), with the weights proportional to the items administered so far, \(\frac{k}{K}\), and the number of items to be administered \(\frac{K-k}{K}\). Then, we can solve for \(\theta^*\) to find the synthetic ability estimate required for the composite to hit proficientplevel as

\[\theta^* = \frac{K}{K-k} \left(\text{proficientplevel} - \frac{k}{K} \hat\theta \right)\] The p-value for triggering the below-grade item pool is based on the cumulative normal distribution with mean \(\hat\theta\) and variance \(\sigma_{\hat\theta}^2\), which is the squared current standard error

\[p(\theta > \theta^*) = 1 - CDF(\theta^*; \hat\theta, \sigma_{\hat\theta}^2)\] The p-value for triggering the above-grade item pool is

\[p(\theta < \theta^*) = CDF(\theta^*; \hat\theta, \sigma_{\hat\theta}^2)\] If \(p < 0.0000001\), then add the appropriate off-grade pool to the main item pool. This decision to trigger the off-grade is checked at every item position after 2/3 of the CAT exam, until an off-grade pool is triggered or the test ends. Once an off-grade pool is triggered, the pool is added to the main item pool for the rest of the CAT exam.

  • offGradeMinItemsAdministered = round(N*offGradeMinItemsAdministeredP), where N is sum of min_op_items for all CAT segments, and offGradeMinItemsAdministeredP is always 2/3.

  • Math can have 2 CAT segments. Sometimes, the off-grade items can be triggered on the 1st segment or the 2nd, depending on the exam. If they are triggered in the 1st CAT segment, off-grade items eligible for selection during the entire 2nd CAT segment as well.

  • The two initial assumptions are strong assumptions that are not always met.

  • Off-grade items can be up to 2 grades above or below the exam grade.

  • In general, off-grade items are in the algorithm to accommodate the lack of easy items in the pool.

7 CAT Segment 1 End

Key Points: The first CAT segment ends if the blueprint is met at or after reaching the minimum segment length. The segment can never be longer than its maximum length.

When the minimum segment length is met based on the operational items, check following:

  1. If the maximum segment length max_op_items is met based on the operational items, end the segment.

  2. If the min_op_items of all blueprint element values are met, end the segment.

  3. If not, administer another item group and repeat steps 1 and 2 above.

Minimum and maximum segment lengths are the same number in Math CAT segments. For ELA, maximum segment length is often higher than the minimum in order to provide flexibility for ELA CAT testlets. Maximum segment length is never allowed to be violated, even when testlets are involved.

8 CAT Segment 2

Key Points: For Math, there may be a 2nd CAT segment. Information about the student’s performance is carried over from the previous CAT segment. If the off-grade pool has been triggered in the 1st segment, keep them included in the item pool. Otherwise the 2nd CAT segment proceeds just like the 1st.

If there is a 2nd CAT segment (Math calculator or Non-calculator), then begin the 2nd segment. If the blueprint specifies a non-zero min_ft_items, follow the same logic from Segment 1 to administer EFT CAT items to every student.

If the off-grade pool has been included in the item pool, keep them included. Use the blueprint associated with the 2nd segment_position, which can have a variety of parameters that differ from segment 1, such as cset2initialrandom, cset1size, cset2random, \(w_2\), and \(w_0\).

The most updated \(\hat{\theta}\) and test information is carried over from the previous CAT segment. However, the 1st item group in the 2nd CAT segment uses cset2initialrandom. This item group selection is, again, often random with equal item group selection probability.

Otherwise, the item selection, administration, and termination process is the same as in Segment 1.

9 PT Segments

Key Points: PT items are in their own segment. PT items are selected randomly, without regard to the student’s performance. ELA PT items are separated into 2 segments so students can be provided breaks between each part.

PT items are in their own segment. The segment order does not matter because as PT items are neither influenced by the CAT exam performance \(\hat{\theta}\), nor do they influence the CAT exam item selection. Simply administer the PT item group that has the lowest exposure rate (random selection for ties). All ELA and Math PT items are in one testlet. All items associated with the selected PT testlet are administered. PT exposure rate is usually extremely balanced because of these item selection rules.

If the student is selected to receive EFT PT items (i.e., the exam is based on the enhanced blueprint), randomly select and administer an EFT PT item set instead of OP PT items. This is not explicitly specified in the blueprint such as in min_ft_items.

ELA PT items are separated into 2 segments so students can be provided breaks between each part. Once students moves on to Part 2, they will not be able to review or revise items in Part 1. For this reason, Smarter Balanced recommends that students complete Part 1 in one test session, and Part 2 on the next school day. Note that these two items must come from the same stim_id and item group. This is the only situation when one item group is administered in two separate segments.

10 PRN Segments

Key Points: PRN items in the Braille-HAT exams are in their own segment. The item selection is predetermined. Every student will receive the same set of PRN items.

Printer output files (PRN) items in Braille Hybrid Adaptive (HAT) exams are in their own segment. They can be split into calculator and non-calculator segments for Math. PRN items are always OP items. The item selection is fixed (predetermined, and no randomness is involved), so all PRN items in the item pool are administered. The blueprint often indicates PRN segments are administered at the end of the exam, but the order does not matter as PRN items are neither influenced by the CAT exam performance \(\hat{\theta}\), nor do they influence the CAT exam item selection.

11 Final Ability Estimation

Key Points: After the exam is complete, the test scores are calculated with all items included.

After all exam segments are complete, \(\theta\) must be estimated again with all items included. This is because the most updated \(\hat{\theta}\) excludes hand scored items, PT items, and PRN items. Also, we use maximum-likelihood estimation rather than MAP for the final estimation. The final \(\hat{\theta}\) must be range-restricted using highest (HOT) and lowest obtainable theta (LOT), then converted into scaled scores (SS) using slope \(a\) and intercept \(b\):

\[SS=a*\hat{\theta}+b\]

where \(a\) and \(b\) are different values for ELA and Math, but are otherwise consistent across all exams and grades. The final score is also categorized into four achievement level categories based on three cutoff values. For more details on scoring, see the Scoring Specifications Report.

Parameters Summary

  • Shown are the list of CAT algorithm parameters that are worth noting, either because they can vary between tests, segments, blueprint elements, or administration years. The exceptions are offGradeMinItemsAdministeredP, offGradeProbAffectProficiency, and proficientplevel used to trigger the off-grade item pool. These are currently always fixed to a single value, but were included because we may change them in the future.

  • There are a plethora of additional parameters in the test admin packages that are always fixed to the same value. These are excluded from the list below.

  • Identifier variables needed to link each item to the segments and blueprint elements are not shown.

  • All parameters are integer or numeric unless otherwise noted.

Test Parameters

Test-level parameters are values that are fixed for an entire test. These are stored in the control file for the Smarter balanced simulator, but are either in the segment or scoring file in the test admin packages.

table name symbol definition
segment intercept Used to convert theta to Scale Score: Scale Score = Intercept + Slope * theta
segment slope Used to convert theta to Scale Score: Scale Score = Intercept + Slope * theta
segment offGradeMinItemsAdministered Number of items that need to be administered before offgrade items are added to the item pool. In the Smarter Balanced simulator, offGradeMinItemsAdministeredP is used instead, which is a proportion variant. This is useful because offGradeMinItemsAdministeredP is currently always 2/3
segment offGradeMinItemsAdministeredP Proportion of items that need to be administered before offgrade items are added to the item pool. Used only in the Smarter Balanced simulator. Currently always fixed to 2/3.
segment offGradeProbAffectProficiency Null hypothesis that true theta equals the proficiency cut off needs to be p < offgradeprobaffectproficiency for offgrade items to trigger. Currently always 0.0000001
segment proficientplevel Determines the proficiency cutoff and how off-grade items are triggered. Currently always 3, which indicates the middle cutoff.
segment startability \(\theta\) mu_prior. mean of the prior theta, which is also the starting theta. This is typically the average of the previous year’s scores.
segment startinfo \(1/\sigma^2\) information of the prior theta. Usually always 0.2. Variance of the prior theta (sigma_prior^2) is 5 because 1/startinfo = 5
scoring HOT Highest obtainable theta
scoring LOT Lowest obtainable theta
scoring HOSS Highest obtainable scale score. This is not in the test admin packages, but can be calculated: Intercept + Slope * HOT
scoring LOSS Lowest obtainable scale score. This is not in the test admin packages, but can be calculated: Intercept + Slope * LOT
scoring plevel cut_score_theta. Multiple cutoff values that divide theta into 4 achieivement categories. In the scoring XML, min and max theta range is provided for every category.

Segment Parameters

For PT and PRN segments, only the min_op_items and max_op_items variables are used in the CAT algorithm. Most of the other variables are missing.

name symbol definition
segment_position Order of the segment. The order does not matter if NA, which happens for PT and PRN segments
min_op_items \(T\) minimum number of operational items in the segment
max_op_items maximum number of operational items in the segment
min_ft_items minimum number of CAT field test items in the segment
max_ft_items maximum number of CAT field test items in the segment
abilityweight \(w_0\) global ability or item information weight. Higher values emphasize higher test information
bpweight \(w_2\) global blueprint weight. Higher values emphasize importannce of meeting the blueprint
cset2initialrandom The first item group in the segment is selected from the best cset2initialrandom item groups based on the blueprint.
cset2initialp The first item group in the segment is selected from the best cset2initialp proportion of item groups based on the blueprint. Instead of using cset2initialrandom, the Smarter balanced simulator uses the more convenient cset2initialpbecause it is currently always 1 or 0.8. Only used in simulations.
cset1size set of item groups selected based on their blueprint. cset2random will be selected from cset1size
cset2random For item groups in cset1size, calculate the objective function. Then select 1 random item group out of the bestcset2random item groups.
ftendpos CAT field test items should be placed between positionsftstartpos and ftendpos in the segment
ftstartpos CAT field test items should be placed between positions ftstartpos and ftendpos in the segment

Blueprint Element Parameters

For PT and PRN segments, only the min_op_items and max_op_items variables are used in the CAT algorithm. Most of the other variables are missing.

name symbol definition
min_op_items \(Min_r\) minimum number of operational items in the blueprint element
max_op_items \(Max_r\) maximum number of operational items in the blueprint element
bpweight \(p_r\) blueprint element weight. Higher values emphasize importannce of meeting the blueprint
isstrictmax Often TRUE. If FALSE, the max_op_items is allowed to be violated in some circumstances

Future Algorithm Considerations

  • In general, off-grade items are in the algorithm to accommodate the lack of easy items in the pool. The selection algorithm is not ideal as it triggers the off-grade pool using arbitrarily set values that may not always work well (p<.0000001 of being standard proficiency after 2/3 of the exam is complete). Simply including all off-grade items in the main pool (maybe except on the first item) could make the algorithm more elegant and efficient in achieving maximum test information. The off-grade item pool can be removed entirely if the main pool contains sufficient number and variety of items.

  • PT items are not selected adaptively and are not part of the CAT segments. There is no strong reason for this. PT items used to be administered after a classroom activity, which required it to be administered separate from CAT items, but this is no longer the case. PT items have no distinct characteristics that set them apart from CAT items. Both PT and CAT items can have stim sets/testlets, polytomous or binary, and handscored or machine scored. Especially if we develop more PT items in the future, we could consider making the PT item selection adaptive to optimize the test information. Currently, it is possible that there are not enough PT items for adaptive item selection to be useful.

  • Interim \(\hat{\theta}\) is estimated with a 1-step Newton-Raphson MAP, with the previous \(\hat{\theta}\) as the starting ability. This may be ideal for processing speed, but the accuracy may be poor. Adding more iterations may improve the item selection quality.

  • The CAT exam termination rule only considers the blueprint. We may consider other termination rules, such as meeting a minimum standard error of measurement.

Implications for Item Development

  • The maximum number of items allowed for each ELA CAT stimulus is not directly specified in the blueprint. When a stimulus is selected, the algorithm will administer as many items as allowed by the blueprint. Therefore, SB must manually ensure that each stimulus set in the item pool have the minimum numbers of the appropriate items.

  • Items with high and low discrimination are administered equally frequently. Students receiving many low discriminating items by chance will receive the same number of items as those receiving highly discriminating items.

  • If Math CAT passage sets are introduced in the future, the blueprint as well as the other non-passage CAT items in the pool will need considerable manipulation.

  • Enemy items cannot be handled by the current algorithm.

CAT Simulators

Cambium Simulations

The CAT algorithm used in practice in real exams are slightly different from those used in Cambium simulations. An initial starting theta within +/- 1 of the student’s initial true theta, rather than the average score from the previous year.

The Cambium simulator skips the selection decision between enhanced and non-enhanced blueprint. Only one blueprint is simulated at a time.

Through multiple simulations, Cambium selects various operational blueprint parameter values iteratively, manually, and somewhat subjectively. They consider 3 factors: meeting the blueprint requirements, raising test information, and balancing exposure (including field test item exposure). Above all, meeting the blueprint is the most important. Example parameters selected by Cambium:

  • Global bpweight (defaults to 1) The purpose of using bpweight is to meet the requirement of blueprint and to even out item exposure as much as possible. Cambium begins with the value of 1, and may adjust the value later based on results. For example, lower the weight if blueprint is not met, or increase it to emphasize certain claims.

  • Global abilityweight (defaults to 1)

  • cset2initialrandom

  • cset1size

  • cset2random

  • Local bpweight (defaults to 1)

  • isstrictmax (prioritizes claim level categories to be set to TRUE )

  • Field test weights \(a_j\). Cambium uses internally generated \(a_j\) to balance sample size.

  • Some blueprint categories that are not specified in the public blueprint, like Math content domain. When some domains have no items in the pool and SB does not have the intention to develop items in those domains, these categories may be removed so as to improve the blueprint.

Note that SB provided Cambium the public blueprint, and Cambium operationalized it with SB’s approval, but SB often did not seem to fully understand all the parameters. Regardless, it is SB’s approval and responsibility. For example, Math content domain and isstrictmax in the operational blueprint was selected by Cambium.

Smarter Balanced Simulations

Smarter Balanced CAT Simulator is currently under development as an R package. It is an attempt to replicate Cambium’s Simulator, while proving some extra flexibility in the test configurations. The simulator is designed around the Test Construction Database.

Smarter Balanced CAT Simulator skips the selection decision between enhanced and non-enhanced blueprint. Only one blueprint is simulated at a time. Also, the exposure control for PT items is done randomly in order to leverage parallel processing, rather adjusting the exposure in every administration in real exams.

Smarter Balanced CAT Simulator ignores the order in which items within stimulus sets are administered. This ordering is configurable in the test packages.

References

The information gathered on this document comes from a combination of these references. Other documents are often unhelpful or redundant. Note that EFT PT sampling percentage can vary greatly from year to year, which is in a separate document.

  1. For an algorithm overview, see Smarter Balanced Summative Assessments Simulation Results: 2021-2022

  2. For operational blueprint and parameters we use, see the historical test admin packages

  3. For equations, see Cohen, J., & Albright, L. (2014). Smarter Balanced adaptive item selection algorithm design report, Washington, D.C,

  4. For off-grade administration rule, see Cohen, J., & Albright, L. (2014). Talking Points for Out of Grade Level Testing V1. 7/14/2014]

  5. Identifying patterns in the post-administration item response data helped understand what items students actually receive. This was used to supplement or confirm the other findings.

  6. Other undocumented details about the algorithm were provided by Cambium.