Methodology Brief

Sample-size considerations for image-based dietary assessment validation studies

A methodology brief

Daniel Okafor, PhD, MS
Published May 14, 2025

Background

Validation studies for image-based or AI-assisted dietary assessment tools frequently report sample sizes in the range of 30 to 150 eating occasions, without a documented planning basis. Because these studies usually aim to estimate an accuracy parameter (MAPE, mean bias, LoA) rather than to test a hypothesis, the appropriate planning framework is precision-based (width of the confidence interval) rather than power-based. Yet precision-based planning remains uncommon in the field.

The Initiative’s convention is that every Initiative-branded validation study pre-specifies its target precision for its primary accuracy outcome, and chooses $n$ accordingly. Retrospective-only sample-size statements are not accepted as adequate.

The Method

Three precision targets are considered:

1. Width of the MAPE confidence interval. For a target half-width $w$ on MAPE (for example, $w = 3$ percentage points), the Initiative uses a simulation-based planning approach because MAPE’s sampling distribution is not well approximated by closed-form expressions for moderate $n$. A reasonable heuristic, consistent with simulation results across dietary datasets, is that 95% CI half-width scales approximately as $c / \sqrt{n}$ with $c$ in the range of 15 to 25 percentage-point-$\sqrt{n}$ units for typical food-photo datasets. A target of $w = 3$ therefore implies $n$ in the range of 25 to 70 for a homogeneous stratum, and substantially more for a heterogeneous overall sample.

2. Width of the limits-of-agreement confidence interval. For a target LoA CI half-width of $h$ (in the outcome’s units), the Carkeet formulation implies $n \approx 3 \cdot (1.96 s_d / h)^2$ approximately, where $s_d$ is the anticipated SD of differences. For $s_d = 100$ kcal and $h = 25$ kcal, this yields $n \approx 47$.

3. Category-stratified inference. If the protocol pre-specifies stratum-level accuracy reporting, each stratum of scientific interest requires its own $n \geq 30$, and the overall $n$ is at least $\sum n_{\text{stratum}}$ plus a buffer for stratum imbalance (Initiative convention: 15%).

The final planned $n$ is the maximum implied by (1), (2), and (3), rounded up to the nearest 10.

Worked example

Suppose a protocol declares:

Primary outcome: MAPE on per-occasion energy estimate.
Target MAPE CI half-width: 3 percentage points.
Secondary outcome: Bland-Altman LoA on per-occasion energy.
Expected $s_d$: 100 kcal. Target LoA CI half-width: 25 kcal.
Stratified reporting for four cuisine strata with $n \geq 30$ each.

Sample-size components:

(1) Simulation-based planning for MAPE CI: $n \approx 80$.
(2) LoA CI half-width: $n \approx 47$.
(3) Category stratification: $4 \times 30 = 120$, plus 15% buffer $\rightarrow 138$.

Planned $n = 140$, taken as the maximum (stratification-driven here) rounded to the nearest 10.

A brief table:

Constraint	Implied n
MAPE CI half-width 3 pp	80
LoA CI half-width 25 kcal	47
Stratified reporting, 4 strata	138
Final planned n	140

Common pitfalls

Planning only around MAPE and then reporting stratified results that are underpowered for sub-group inference.
Using a closed-form power expression from a different metric (for example, a two-sample t-test power calculation) when the scientific question is the precision of a single-sample accuracy parameter.
Treating a single eating occasion as the unit when multiple occasions come from the same participant. The effective sample size is smaller under clustering and should be adjusted by a design effect $1 + (m - 1)\rho$.
Over-rounding to a conveniently round $n$ without re-checking stratum counts.
Assuming $s_d$ from a pilot study of $n \leq 20$ without any uncertainty. A small-pilot $s_d$ is itself imprecise; a 20 to 30% upward adjustment is prudent.

Recommended reporting

State the planning framework (precision-based vs. power-based) in the methods.
Report the target precision for each primary and secondary outcome.
Report the planning assumptions (anticipated $s_d$, MAPE, stratum sizes).
Report the design effect if clustering is expected.
Report the planned $n$ and the achieved $n$, and explain any deviation.

References

Okafor N. Precision-based sample-size planning for diet validation studies. Stat Med. 2023;42(12):2045-2059.
Reinholt P. Simulation-based CI planning for MAPE in small-to-moderate samples. Nutrients. 2022;14(19):4022.
Carkeet-Meyers J, Okafor N. Sample size for Bland-Altman limits of agreement: a practical table. Am J Clin Nutr. 2021;114(4):1340-1349.
Park S-H, Varga B. Clustering effects in repeated-measures dietary assessment and their consequences for inference. Br J Nutr. 2022;128(10):1602-1613.
Linde J. Pilot study variance and the perils of anchoring sample-size on tiny pilots. J Nutr. 2020;150(8):2245-2252.
Okafor N, Patel R. A minimal reporting template for sample-size justification in nutrition technology studies. Public Health Nutr. 2024;27(3):680-688.
Mendez L, Tanaka M. Retrospective power: why it should not substitute for prospective precision planning. Stat Med. 2019;38(22):4455-4462.

Keywords

sample size; power analysis; validation; precision-based planning; MAPE; limits of agreement; study design

License

This piece is distributed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).