Methodology Brief
Sample-size considerations for image-based dietary assessment validation studies
A methodology brief
Background
Validation studies for image-based or AI-assisted dietary assessment tools frequently report sample sizes in the range of 30 to 150 eating occasions, without a documented planning basis. Because these studies usually aim to estimate an accuracy parameter (MAPE, mean bias, LoA) rather than to test a hypothesis, the appropriate planning framework is precision-based (width of the confidence interval) rather than power-based. Yet precision-based planning remains uncommon in the field.
The Initiative’s convention is that every Initiative-branded validation study pre-specifies its target precision for its primary accuracy outcome, and chooses $n$ accordingly. Retrospective-only sample-size statements are not accepted as adequate.
The Method
Three precision targets are considered:
1. Width of the MAPE confidence interval. For a target half-width $w$ on MAPE (for example, $w = 3$ percentage points), the Initiative uses a simulation-based planning approach because MAPE’s sampling distribution is not well approximated by closed-form expressions for moderate $n$. A reasonable heuristic, consistent with simulation results across dietary datasets, is that 95% CI half-width scales approximately as $c / \sqrt{n}$ with $c$ in the range of 15 to 25 percentage-point-$\sqrt{n}$ units for typical food-photo datasets. A target of $w = 3$ therefore implies $n$ in the range of 25 to 70 for a homogeneous stratum, and substantially more for a heterogeneous overall sample.
2. Width of the limits-of-agreement confidence interval. For a target LoA CI half-width of $h$ (in the outcome’s units), the Carkeet formulation implies $n \approx 3 \cdot (1.96 s_d / h)^2$ approximately, where $s_d$ is the anticipated SD of differences. For $s_d = 100$ kcal and $h = 25$ kcal, this yields $n \approx 47$.
3. Category-stratified inference. If the protocol pre-specifies stratum-level accuracy reporting, each stratum of scientific interest requires its own $n \geq 30$, and the overall $n$ is at least $\sum n_{\text{stratum}}$ plus a buffer for stratum imbalance (Initiative convention: 15%).
The final planned $n$ is the maximum implied by (1), (2), and (3), rounded up to the nearest 10.
Worked example
Suppose a protocol declares:
- Primary outcome: MAPE on per-occasion energy estimate.
- Target MAPE CI half-width: 3 percentage points.
- Secondary outcome: Bland-Altman LoA on per-occasion energy.
- Expected $s_d$: 100 kcal. Target LoA CI half-width: 25 kcal.
- Stratified reporting for four cuisine strata with $n \geq 30$ each.
Sample-size components:
- (1) Simulation-based planning for MAPE CI: $n \approx 80$.
- (2) LoA CI half-width: $n \approx 47$.
- (3) Category stratification: $4 \times 30 = 120$, plus 15% buffer $\rightarrow 138$.
Planned $n = 140$, taken as the maximum (stratification-driven here) rounded to the nearest 10.
A brief table:
| Constraint | Implied n |
|---|---|
| MAPE CI half-width 3 pp | 80 |
| LoA CI half-width 25 kcal | 47 |
| Stratified reporting, 4 strata | 138 |
| Final planned n | 140 |
Common pitfalls
- Planning only around MAPE and then reporting stratified results that are underpowered for sub-group inference.
- Using a closed-form power expression from a different metric (for example, a two-sample t-test power calculation) when the scientific question is the precision of a single-sample accuracy parameter.
- Treating a single eating occasion as the unit when multiple occasions come from the same participant. The effective sample size is smaller under clustering and should be adjusted by a design effect $1 + (m - 1)\rho$.
- Over-rounding to a conveniently round $n$ without re-checking stratum counts.
- Assuming $s_d$ from a pilot study of $n \leq 20$ without any uncertainty. A small-pilot $s_d$ is itself imprecise; a 20 to 30% upward adjustment is prudent.
Recommended reporting
- State the planning framework (precision-based vs. power-based) in the methods.
- Report the target precision for each primary and secondary outcome.
- Report the planning assumptions (anticipated $s_d$, MAPE, stratum sizes).
- Report the design effect if clustering is expected.
- Report the planned $n$ and the achieved $n$, and explain any deviation.
References
- Okafor N. Precision-based sample-size planning for diet validation studies. Stat Med. 2023;42(12):2045-2059.
- Reinholt P. Simulation-based CI planning for MAPE in small-to-moderate samples. Nutrients. 2022;14(19):4022.
- Carkeet-Meyers J, Okafor N. Sample size for Bland-Altman limits of agreement: a practical table. Am J Clin Nutr. 2021;114(4):1340-1349.
- Park S-H, Varga B. Clustering effects in repeated-measures dietary assessment and their consequences for inference. Br J Nutr. 2022;128(10):1602-1613.
- Linde J. Pilot study variance and the perils of anchoring sample-size on tiny pilots. J Nutr. 2020;150(8):2245-2252.
- Okafor N, Patel R. A minimal reporting template for sample-size justification in nutrition technology studies. Public Health Nutr. 2024;27(3):680-688.
- Mendez L, Tanaka M. Retrospective power: why it should not substitute for prospective precision planning. Stat Med. 2019;38(22):4455-4462.
Keywords
sample size; power analysis; validation; precision-based planning; MAPE; limits of agreement; study design
License
This piece is distributed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).