Human review workflows & confidence calibration
Design human review workflows and confidence calibration.
This subtopic (5.5) sits in Context Management & Reliability (D5) on Anthropic's Claude Certified Architect — Foundations (CCA-F) exam. The bank holds 15 practice questions here — 4 easy, 8 medium, and 3 hard — with 5 free to try, answers and explanations included. 3 of the free questions are below; the rest are in the practice stream.
What the exam tests here
- stratified random sampling for measuring high-confidence extraction error rates
- field-level confidence scores calibrated on labeled validation sets
- routing low-confidence extractions to human review
- aggregate accuracy masks poor performance on specific document types or fields
Free practice questions: Human review workflows & confidence calibration
Your support agent shows 84% overall accuracy, meeting the 80% target. A sample audit reveals process_refund succeeds on standard returns but fails 60% of the time on billing disputes. What metric change best surfaces this gap?
Show answer & explanation
Correct answer: A. Track accuracy per request category so billing disputes and standard returns are scored separately
Aggregate accuracy hides category-level failures; segmenting by request type exposes the billing dispute gap directly. 'Raise the overall threshold' keeps measurement aggregate, not segmented. 'Add a confidence score threshold' is an escalation control, not a measurement fix. 'Increase the sample size' improves aggregate precision but still masks per-category variance.
Your support agent reports 82% first-contact resolution overall. Stakeholders are satisfied. A spot check shows escalate_to_human is triggered on 70% of account-issue requests compared to 15% for return requests. What does this indicate?
Show answer & explanation
Correct answer: A. The aggregate metric masks a performance gap on account-issue requests specifically
An 82% aggregate can hide a severe gap in a specific request category — account issues are resolved at far lower rates than returns. 'The overall rate is meeting target' ignores the per-category signal entirely. 'The escalate_to_human tool is misconfigured' assumes a bug rather than a performance gap on a specific case type. 'The 80% target should be raised' is a governance decision unrelated to diagnosing the segmented failure.
Claude Code's CLAUDE.md-configured refactoring workflow reports 92% overall correctness across 500 refactoring tasks. Your team wants to measure whether this accuracy holds specifically for high-confidence refactorings. Which approach is correct?
Show answer & explanation
Correct answer: B. Apply stratified random sampling within the high-confidence stratum to estimate its error rate separately
Stratified sampling within the high-confidence stratum produces a reliable estimate of that tier's error rate without reviewing all tasks. 'Sample randomly across all 500...' mixes confidence tiers and dilutes the high-confidence signal. 'Review every high-confidence refactoring...' is exhaustive and impractical at scale. 'Use 92% overall as a proxy...' assumes high-confidence tasks have the same error rate as the full population, which stratified sampling exists to disprove.
2 more free questions on this subtopic in the practice stream, plus 10 in the full bank. Keep practicing →