5.5 · D515 questions · 5 free

Human review workflows & confidence calibration

Design human review workflows and confidence calibration.

This subtopic (5.5) sits in Context Management & Reliability (D5) on Anthropic's Claude Certified Architect — Foundations (CCA-F) exam. The bank holds 15 practice questions here — 4 easy, 8 medium, and 3 hard — with 5 free to try, answers and explanations included. 3 of the free questions are below; the rest are in the practice stream.

What the exam tests here

stratified random sampling for measuring high-confidence extraction error rates
field-level confidence scores calibrated on labeled validation sets
routing low-confidence extractions to human review
aggregate accuracy masks poor performance on specific document types or fields

Practice this subtopic — 5 free Free questions with answers ↓

Free practice questions: Human review workflows & confidence calibration

Question 1 of 3 · free · medium

Your support agent shows 84% overall accuracy, meeting the 80% target. A sample audit reveals process_refund succeeds on standard returns but fails 60% of the time on billing disputes. What metric change best surfaces this gap?

Show answer & explanation

Correct answer: A. Track accuracy per request category so billing disputes and standard returns are scored separately

Aggregate accuracy hides category-level failures; segmenting by request type exposes the billing dispute gap directly. 'Raise the overall threshold' keeps measurement aggregate, not segmented. 'Add a confidence score threshold' is an escalation control, not a measurement fix. 'Increase the sample size' improves aggregate precision but still masks per-category variance.

Try it interactively →

Question 2 of 3 · free · medium

Your support agent reports 82% first-contact resolution overall. Stakeholders are satisfied. A spot check shows escalate_to_human is triggered on 70% of account-issue requests compared to 15% for return requests. What does this indicate?

Show answer & explanation

Correct answer: A. The aggregate metric masks a performance gap on account-issue requests specifically

An 82% aggregate can hide a severe gap in a specific request category — account issues are resolved at far lower rates than returns. 'The overall rate is meeting target' ignores the per-category signal entirely. 'The escalate_to_human tool is misconfigured' assumes a bug rather than a performance gap on a specific case type. 'The 80% target should be raised' is a governance decision unrelated to diagnosing the segmented failure.

Try it interactively →

Question 3 of 3 · free · medium

Claude Code's CLAUDE.md-configured refactoring workflow reports 92% overall correctness across 500 refactoring tasks. Your team wants to measure whether this accuracy holds specifically for high-confidence refactorings. Which approach is correct?

Show answer & explanation

Correct answer: B. Apply stratified random sampling within the high-confidence stratum to estimate its error rate separately

Stratified sampling within the high-confidence stratum produces a reliable estimate of that tier's error rate without reviewing all tasks. 'Sample randomly across all 500...' mixes confidence tiers and dilutes the high-confidence signal. 'Review every high-confidence refactoring...' is exhaustive and impractical at scale. 'Use 92% overall as a proxy...' assumes high-confidence tasks have the same error rate as the full population, which stratified sampling exists to disprove.

Try it interactively →

2 more free questions on this subtopic in the practice stream, plus 10 in the full bank. Keep practicing →

Human review workflows & confidence calibration

What the exam tests here

Free practice questions: Human review workflows & confidence calibration

Your support agent shows 84% overall accuracy, meeting the 80% target. A sample audit reveals process_refund succeeds on standard returns but fails 60% of the time on billing disputes. What metric change best surfaces this gap?

Your support agent reports 82% first-contact resolution overall. Stakeholders are satisfied. A spot check shows escalate_to_human is triggered on 70% of account-issue requests compared to 15% for return requests. What does this indicate?

Claude Code's CLAUDE.md-configured refactoring workflow reports 92% overall correctness across 500 refactoring tasks. Your team wants to measure whether this accuracy holds specifically for high-confidence refactorings. Which approach is correct?

Related reading (Anthropic docs)