The Art and Science of SBA Standard Setting: Beyond the Arbitrary Cut Score

Matt Bowker · Mar 9, 2025 · 7 min read

Have you ever wondered how examiners decide where to place the pass/fail boundary on your exam? It's not (or at least shouldn't be) a matter of simply declaring "70% is a pass" or drawing an arbitrary line based on how difficult they think the test was. There's actually a whole field dedicated to this process, known as standard setting.

Why Standard Setting Matters

Standard setting is fundamental to assessment validity. Without a defensible method for determining the pass mark, exams lose credibility and fairness. Consider these scenarios:

  1. An arbitrary 70% pass mark on an unusually difficult exam could fail competent candidates
  2. A norm-referenced approach (passing the top 80%) might fail competent students in a strong cohort
  3. A subjective "this feels about right" approach lacks transparency and defensibility

The goal of proper standard setting is to establish a cut score that accurately separates those who are minimally competent from those who aren't—regardless of how others performed or an arbitrary percentage.

Criterion-Referenced vs. Norm-Referenced Standards

Before diving into methods, it's worth understanding two fundamental approaches:

Criterion-referenced standards focus on what candidates should know or be able to do. They're absolute standards based on content mastery, not relative performance. These are generally preferred in healthcare education because they maintain consistent expectations of competence regardless of cohort strength.

Norm-referenced standards compare candidates against each other. They rank examinees and determine passing based on relative performance (e.g., passing the top 70%). While this ensures a predictable pass rate, it means competence requirements fluctuate depending on cohort performance.

Healthcare education has largely embraced criterion-referenced methods because they align with the concept that anyone who achieves minimal competence should pass. The question becomes: how do we determine what minimal competence looks like?

Common Standard Setting Methods for SBAs

The Angoff Method

The Angoff method is perhaps the most widely used standard setting approach for SBA exams. Here's how it works:

  1. A panel of subject matter experts (judges) reviews each exam question
  2. For each question, judges estimate the probability that a "minimally competent" candidate would answer correctly
  3. These probabilities are averaged across judges for each item
  4. The sum of these averages becomes the cut score

For example, if ten judges estimate that a borderline candidate has a 70% chance of answering a particular question correctly, the Angoff value for that item would be 0.7. If the exam has 100 questions and the average Angoff value across all items is 0.65, then the cut score would be 65%.

The strength of Angoff lies in its conceptual simplicity and direct connection to content. It focuses precisely on minimal competence for each specific question. However, it does require judges to make somewhat abstract probability judgements, which can be challenging.

The Ebel Method

The Ebel method adds another dimension to standard setting by considering both difficulty and relevance:

  1. Judges classify each question into a grid based on:
    • Difficulty level (easy, moderate, hard)
    • Importance (essential, important, acceptable, nice-to-know)
  1. For each cell in the grid (e.g., "easy-essential"), judges estimate the percentage of borderline candidates who would answer correctly
  2. The number of questions in each cell is multiplied by the expected performance
  3. These products are summed to determine the cut score

Ebel's approach acknowledges that not all content carries equal weight. A difficult question on essential material might be more important than an easy question on peripheral content. This nuance can create a more content-valid standard, though the process is more complex and time-consuming than Angoff.

The Hofstee Method

The Hofstee method offers a compromise between criterion-referenced and norm-referenced approaches:

  1. Judges are asked four key questions:
    • What is the maximum acceptable cut score? (even if most fail)
    • What is the minimum acceptable cut score? (even if most pass)
    • What is the maximum acceptable fail rate?
    • What is the minimum acceptable fail rate?
  1. These boundaries create an "acceptable zone" on a graph
  2. The actual performance distribution is plotted
  3. The cut score is determined where the cumulative score distribution intersects with the acceptable zone

Hofstee is often used as a reality check alongside other methods. It acknowledges both content standards and practical considerations. For instance, if an Angoff cut score would result in failing 80% of candidates, Hofstee might suggest moderating that extreme outcome.

The Reality of Standard Setting in Practice

In the real world, standard setting is rarely as clean as textbooks suggest. Examination boards often use multiple methods in parallel and compare results. They might start with Angoff, check against Hofstee, and use their professional judgement to reconcile differences.

Some practical considerations that complicate standard setting:

  1. Varying judge expertise - Some judges may be more familiar with certain content areas
  2. Judge severity/leniency - Individual judges may systematically estimate higher or lower probabilities
  3. Conceptualising borderline candidates - Different judges may envision "minimal competence" differently
  4. Real-world consequences - High-stakes examinations must balance rigour with reasonable pass rates

The best standard setting exercises acknowledge these complexities and incorporate deliberation, iterative processes, and multiple methodologies.

What Happens After Setting Standards? Item Analysis

Standard setting doesn't end with establishing a cut score. After an exam, psychometric analysis helps identify potentially problematic questions through several key metrics:

Item Difficulty (p-value)

The p-value simply indicates the proportion of candidates who answered correctly. A p-value of 0.75 means 75% of test-takers got the question right.

Questions with extremely high p-values (>0.95) might be too easy to discriminate effectively, while those with very low p-values (<0.30) might be too difficult or poorly taught. However, these boundaries aren't absolute—a very easy question might be appropriate if testing essential knowledge everyone should have.

Discrimination Index

The discrimination index (often measured as point-biserial correlation) indicates how well a question differentiates between high and low performers. It correlates performance on the specific item with overall test performance.

A positive discrimination value means stronger students tended to answer correctly while weaker students didn't. Values above +0.30 are generally considered good. Values near zero suggest the question isn't discriminating effectively. Negative values are red flags—they suggest something fundamentally wrong, like a miskeyed answer.

Distractor Analysis

In SBAs, examining how incorrect options performed can reveal much about question quality:

  • Non-functioning distractors (options no one selected) effectively reduce the question to fewer options
  • Distractors that attract strong students might be partially correct or ambiguous
  • If more high-performing students choose an incorrect option than the intended answer, the question likely has serious issues

When Standard Setting Goes Wrong

Poor standard setting processes can undermine assessment validity in numerous ways:

  1. Insufficient judge training - Judges must understand what "minimal competence" means in context
  2. Inadequate range of experts - Panels should include diverse expertise and perspectives
  3. Over-reliance on statistics - Numbers should inform judgement, not replace it
  4. Ignoring item flaws - Problematic questions identified through psychometric analysis should be addressed
  5. Failing to document process - Standard setting decisions should be transparent and defensible

The consequences of flawed standard setting can be severe: qualified candidates might fail, unqualified ones might pass, and the entire examination system loses credibility.

Implications for Candidates

Understanding how standard setting works has practical implications for exam preparation:

  1. Focus on competence, not competition - In criterion-referenced exams, you're measured against a standard, not your peers
  2. Target core content - Items classified as "essential" likely carry more weight in standard setting
  3. Know what "good enough" looks like - Understanding minimal competence helps calibrate your preparation
  4. Trust the process (mostly) - Well-designed standard setting processes should identify and address problematic questions

Conclusion

Standard setting is both art and science—a blend of expert judgement, statistical analysis, and educational philosophy. When done properly, it ensures that pass/fail decisions reflect genuine competence rather than arbitrary thresholds or relative ranking.

For those creating assessments, investing in proper standard setting is essential for credibility and fairness. For those taking exams, understanding these processes provides context for how your performance is evaluated.

The next time you see a cut score of 67% rather than a neat 70%, remember there's likely a thoughtful, evidence-based process behind that seemingly odd number—one designed to separate those who are competent from those who aren't, as fairly and accurately as possible.

Did you find this post helpful? Check out our question banks at PrepForCHSE.com for targeted exam preparation.

APA
Bowker, M. (2025). The Art and Science of SBA Standard Setting: Beyond the Arbitrary Cut Score. https://prepforchse.com/blog/the-art-and-science-of-sba-standard-setting-beyond-the-arbitrary-cut-score
MLA
Bowker, Matt. "The Art and Science of SBA Standard Setting: Beyond the Arbitrary Cut Score." 09 Mar 2025, https://prepforchse.com/blog/the-art-and-science-of-sba-standard-setting-beyond-the-arbitrary-cut-score
Written by Matt Bowker

Dr. Matt Bowker is a simulation educator and with over a decade of experience in healthcare simulation across multiple continents and student groups.