The Art and Science of SBA Standard Setting: Beyond the Arbitrary Cut Score

Matt Bowker · Mar 9, 2025 · 7 min read

Have you ever wondered how examiners decide where to place the pass/fail boundary on your exam? It's not (or at least shouldn't be) a matter of simply declaring "70% is a pass" or drawing an arbitrary line based on how difficult they think the test was. There's actually a whole field dedicated to this process, known as standard setting.

Why Standard Setting Matters

Standard setting is fundamental to assessment validity. Without a defensible method for determining the pass mark, exams lose credibility and fairness. Consider these scenarios:

An arbitrary 70% pass mark on an unusually difficult exam could fail competent candidates
A norm-referenced approach (passing the top 80%) might fail competent students in a strong cohort
A subjective "this feels about right" approach lacks transparency and defensibility

The goal of proper standard setting is to establish a cut score that accurately separates those who are minimally competent from those who aren't—regardless of how others performed or an arbitrary percentage.

Criterion-Referenced vs. Norm-Referenced Standards

Before diving into methods, it's worth understanding two fundamental approaches:

Criterion-referenced standards focus on what candidates should know or be able to do. They're absolute standards based on content mastery, not relative performance. These are generally preferred in healthcare education because they maintain consistent expectations of competence regardless of cohort strength.

Norm-referenced standards compare candidates against each other. They rank examinees and determine passing based on relative performance (e.g., passing the top 70%). While this ensures a predictable pass rate, it means competence requirements fluctuate depending on cohort performance.

Healthcare education has largely embraced criterion-referenced methods because they align with the concept that anyone who achieves minimal competence should pass. The question becomes: how do we determine what minimal competence looks like?

Common Standard Setting Methods for SBAs

The Angoff Method

The Angoff method is perhaps the most widely used standard setting approach for SBA exams. Here's how it works:

A panel of subject matter experts (judges) reviews each exam question
For each question, judges estimate the probability that a "minimally competent" candidate would answer correctly
These probabilities are averaged across judges for each item
The sum of these averages becomes the cut score

For example, if ten judges estimate that a borderline candidate has a 70% chance of answering a particular question correctly, the Angoff value for that item would be 0.7. If the exam has 100 questions and the average Angoff value across all items is 0.65, then the cut score would be 65%.

The strength of Angoff lies in its conceptual simplicity and direct connection to content. It focuses precisely on minimal competence for each specific question. However, it does require judges to make somewhat abstract probability judgements, which can be challenging.

The Ebel Method

The Ebel method adds another dimension to standard setting by considering both difficulty and relevance:

Judges classify each question into a grid based on:
- Difficulty level (easy, moderate, hard)
- Importance (essential, important, acceptable, nice-to-know)

For each cell in the grid (e.g., "easy-essential"), judges estimate the percentage of borderline candidates who would answer correctly
The number of questions in each cell is multiplied by the expected performance
These products are summed to determine the cut score

Ebel's approach acknowledges that not all content carries equal weight. A difficult question on essential material might be more important than an easy question on peripheral content. This nuance can create a more content-valid standard, though the process is more complex and time-consuming than Angoff.

The Hofstee Method

The Hofstee method offers a compromise between criterion-referenced and norm-referenced approaches:

Judges are asked four key questions:
- What is the maximum acceptable cut score? (even if most fail)
- What is the minimum acceptable cut score? (even if most pass)
- What is the maximum acceptable fail rate?
- What is the minimum acceptable fail rate?

These boundaries create an "acceptable zone" on a graph
The actual performance distribution is plotted
The cut score is determined where the cumulative score distribution intersects with the acceptable zone

Hofstee is often used as a reality check alongside other methods. It acknowledges both content standards and practical considerations. For instance, if an Angoff cut score would result in failing 80% of candidates, Hofstee might suggest moderating that extreme outcome.

The Reality of Standard Setting in Practice

In the real world, standard setting is rarely as clean as textbooks suggest. Examination boards often use multiple methods in parallel and compare results. They might start with Angoff, check against Hofstee, and use their professional judgement to reconcile differences.

Some practical considerations that complicate standard setting:

Varying judge expertise - Some judges may be more familiar with certain content areas
Judge severity/leniency - Individual judges may systematically estimate higher or lower probabilities
Conceptualising borderline candidates - Different judges may envision "minimal competence" differently
Real-world consequences - High-stakes examinations must balance rigour with reasonable pass rates

The best standard setting exercises acknowledge these complexities and incorporate deliberation, iterative processes, and multiple methodologies.

What Happens After Setting Standards? Item Analysis

Standard setting doesn't end with establishing a cut score. After an exam, psychometric analysis helps identify potentially problematic questions through several key metrics:

Item Difficulty (p-value)

The p-value simply indicates the proportion of candidates who answered correctly. A p-value of 0.75 means 75% of test-takers got the question right.

Questions with extremely high p-values (>0.95) might be too easy to discriminate effectively, while those with very low p-values (<0.30) might be too difficult or poorly taught. However, these boundaries aren't absolute—a very easy question might be appropriate if testing essential knowledge everyone should have.

Discrimination Index

The discrimination index (often measured as point-biserial correlation) indicates how well a question differentiates between high and low performers. It correlates performance on the specific item with overall test performance.

A positive discrimination value means stronger students tended to answer correctly while weaker students didn't. Values above +0.30 are generally considered good. Values near zero suggest the question isn't discriminating effectively. Negative values are red flags—they suggest something fundamentally wrong, like a miskeyed answer.

Distractor Analysis

In SBAs, examining how incorrect options performed can reveal much about question quality:

Non-functioning distractors (options no one selected) effectively reduce the question to fewer options
Distractors that attract strong students might be partially correct or ambiguous
If more high-performing students choose an incorrect option than the intended answer, the question likely has serious issues

When Standard Setting Goes Wrong

Poor standard setting processes can undermine assessment validity in numerous ways:

Insufficient judge training - Judges must understand what "minimal competence" means in context
Inadequate range of experts - Panels should include diverse expertise and perspectives
Over-reliance on statistics - Numbers should inform judgement, not replace it
Ignoring item flaws - Problematic questions identified through psychometric analysis should be addressed
Failing to document process - Standard setting decisions should be transparent and defensible

The consequences of flawed standard setting can be severe: qualified candidates might fail, unqualified ones might pass, and the entire examination system loses credibility.

Implications for Candidates

Understanding how standard setting works has practical implications for exam preparation:

Focus on competence, not competition - In criterion-referenced exams, you're measured against a standard, not your peers
Target core content - Items classified as "essential" likely carry more weight in standard setting
Know what "good enough" looks like - Understanding minimal competence helps calibrate your preparation
Trust the process (mostly) - Well-designed standard setting processes should identify and address problematic questions

Conclusion

Standard setting is both art and science—a blend of expert judgement, statistical analysis, and educational philosophy. When done properly, it ensures that pass/fail decisions reflect genuine competence rather than arbitrary thresholds or relative ranking.

For those creating assessments, investing in proper standard setting is essential for credibility and fairness. For those taking exams, understanding these processes provides context for how your performance is evaluated.

The next time you see a cut score of 67% rather than a neat 70%, remember there's likely a thoughtful, evidence-based process behind that seemingly odd number—one designed to separate those who are competent from those who aren't, as fairly and accurately as possible.

Did you find this post helpful? Check out our question banks at PrepForCHSE.com for targeted exam preparation.

APA

Bowker, M. (2025). The Art and Science of SBA Standard Setting: Beyond the Arbitrary Cut Score. https://prepforchse.com/blog/the-art-and-science-of-sba-standard-setting-beyond-the-arbitrary-cut-score

MLA

Bowker, Matt. "The Art and Science of SBA Standard Setting: Beyond the Arbitrary Cut Score." 09 Mar 2025, https://prepforchse.com/blog/the-art-and-science-of-sba-standard-setting-beyond-the-arbitrary-cut-score

Written by Matt Bowker

Dr. Matt Bowker is a simulation educator and with over a decade of experience in healthcare simulation across multiple continents and student groups.

The Art and Science of SBA Standard Setting: Beyond the Arbitrary Cut Score

Why Standard Setting Matters

Criterion-Referenced vs. Norm-Referenced Standards

Common Standard Setting Methods for SBAs

The Angoff Method

The Ebel Method

The Hofstee Method

The Reality of Standard Setting in Practice

What Happens After Setting Standards? Item Analysis

Item Difficulty (p-value)

Discrimination Index

Distractor Analysis

When Standard Setting Goes Wrong

Implications for Candidates

Conclusion

Written by Matt Bowker

More from PrepForCHSE

The Art and Science of SBA Standard Setting: Beyond the Arbitrary Cut Score

When Your Reputation Precedes You: How Labels Influence Assessment

The Educator's Guide to Constructivism in Healthcare Simulation

The Quest for "Excellence for All": Understanding Simulation-Based Mastery Learning

Scenario Lifesavers: When Simulation Goes Sideways