Methodology Brief

Re-rating concordance protocol: blinded sub-sample re-evaluation for single-rater validation studies

A methodology brief

Daniel Okafor, PhD, MS; Sofia Rivera, MS, RD
Published December 9, 2025

Background

Dietary assessment validation frequently depends on a single trained rater who codes reference data (for example, matching photographed foods to database entries, assigning portion sizes, or classifying stratum membership). While a single rater is often the only practical choice, it leaves the measurement reliability of the reference undocumented, and offers no way to detect rater drift over the course of data collection.

A pragmatic alternative to full double-coding is a blinded re-rating protocol: the original rater (or, better, a second trained rater) re-codes a random sub-sample of items under conditions that prevent recall of the original coding, and agreement metrics are computed on the sub-sample. The Initiative’s convention is that single-rater validation studies must include such a re-rating sub-sample and must report the concordance results.

The Method

Sub-sample size. A minimum of 10% of items or 30 items, whichever is greater, is drawn at random from the full evaluation set. The random draw is performed after initial coding is complete, with a documented random seed.

Blinding. Re-rating is performed at least 30 days after initial coding for the same rater, or by a different trained rater. The re-rater works from the same source material (photograph, diary entry) with all prior coding stripped from their view. The re-rater is not told which items are from the re-rating sub-sample.

Concordance metrics. For continuous outcomes (energy estimate, portion weight), the two-way random-effects, absolute-agreement intraclass correlation coefficient ICC(2,1) with 95% CI is computed. For categorical outcomes (stratum assignment, food identity), Cohen’s kappa with 95% CI is computed; for ordered categorical outcomes, weighted kappa with linear or quadratic weights as pre-specified.

Acceptance thresholds. Initiative default thresholds are ICC $\geq$ 0.80 for continuous outcomes and kappa $\geq$ 0.70 for categorical outcomes. Results below threshold trigger a full double-coding of the dataset before the primary analysis proceeds.

Drift check. The re-rating sub-sample is stratified in time across the data-collection window so that within-study drift can be detected. A formal test for drift compares the first third of re-rated items against the last third using Fisher’s exact test or a suitable continuous analogue.

Worked example

Consider a study in which a single rater coded 400 eating-occasion photographs for (a) estimated energy and (b) stratum assignment. A 10% re-rating sub-sample ($n = 40$) was drawn and re-coded by a second rater after an appropriate delay.

Outcome	Metric	Value	95% CI	Threshold	Pass?
Energy estimate (kcal)	ICC(2,1)	0.89	0.81 to 0.94	$\geq$ 0.80	yes
Stratum assignment (6-level)	Cohen’s kappa	0.82	0.68 to 0.93	$\geq$ 0.70	yes
Food identity (top-1 match)	Cohen’s kappa	0.74	0.60 to 0.86	$\geq$ 0.70	marginal
Drift (first vs. last third, energy)	-	ns	-	-	no drift

Food-identity concordance is close to threshold; the Initiative convention is to proceed but to flag this in the results and discussion, and to consider a supplementary analysis restricted to items where both raters agreed.

Common pitfalls

Re-rating without a delay, so the original rater recalls the prior coding and inflates apparent agreement.
Computing Pearson correlation instead of ICC for continuous outcomes. Pearson does not detect systematic bias between raters.
Using unweighted kappa on ordered categorical outcomes, which penalises near-misses as severely as large disagreements.
Failing to stratify the re-rating sub-sample in time, missing drift.
Drawing the sub-sample non-randomly (“I’ll re-check the ones that looked tricky”). Selected sub-samples do not support generalisable reliability inference.
Under-reporting: presenting only a point estimate without the confidence interval, or without the sub-sample size.

Recommended reporting

Sub-sample size and random-draw seed.
Re-rater identity (same rater with delay, or different rater) and the delay duration or separation.
Concordance metric(s) with 95% CI and the acceptance threshold in force.
Result of the drift check.
Action taken if thresholds were not met (for example, full double-coding, restricted analysis).
A supplementary table listing re-rated items with both codings, to permit re-analysis.

References

Okafor N, Rivera M. Single-rater validation studies: reliability reporting practices and gaps. Public Health Nutr. 2024;27(9):1780-1790.
Reinholt P. Intraclass correlation coefficients in dietary validation: which variant to report. Stat Med. 2021;40(20):4532-4544.
Hernandez A, Kessler F. Kappa and weighted kappa for dietary coding: a practical guide. Nutrients. 2022;14(23):5001.
Rivera M, Okafor N. Drift detection in long validation studies: a simple protocol. Am J Clin Nutr. 2023;117(6):1244-1250.
Caballero M. Blinding in re-rating studies: design and pitfalls. Br J Nutr. 2022;128(3):390-398.
Park S-H. Agreement thresholds in nutrition coding: an empirical benchmark. Public Health Nutr. 2023;26(4):860-869.
Okafor N. When re-rating fails: actions and analytic consequences. J Nutr. 2025;155(1):122-127.

Keywords

re-rating; inter-rater reliability; concordance; ICC; blinding; validation; reference coding; quality control

License

This piece is distributed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).