Commentary
Why most vendor-reported accuracy numbers fail to replicate, and what 'fail' really means
On the systematic gap between marketing figures and independent validation
There is a durable gap between the accuracy figures dietary-assessment application vendors publish and the figures independent research groups recover when they evaluate the same applications. In our 2024 systematic review, the mean gap across paired comparisons was 8.8 percentage points in mean absolute percentage error, with the vendor figure always lower.1 A gap of that size, in that direction, does not arise by chance. This commentary is about why.
”Fail to replicate” is not a moral claim
Before anything else: when we say a vendor number “fails to replicate,” we are not accusing vendors of fabrication. In the majority of cases we have examined, the vendor’s figure is a genuine output of a genuine evaluation. What is different is the evaluation design. A figure can be technically defensible in its original context and still not transport to the conditions under which an independent group tests the product. Replication failure, in that sense, is a statement about scope, not about honesty.2
Four mechanisms that produce the gap
First: meal-set selection. Vendor evaluations often use meal sets curated to resemble the vendor’s training distribution. Independent evaluations typically use meal sets constructed from a population sampling frame (e.g., NHANES-weighted intake, or a clinic population’s recorded meals). The two distributions differ; the model’s error differs accordingly. When we have been able to obtain the vendor’s meal composition, the single largest contributor to the replication gap has been this one.
Second: reference method drift. Several vendor evaluations use a reference method that is itself an imperfect dietary-assessment instrument — a 24-hour recall, or the vendor’s own “weighed” estimate based on typical portion sizes. An independent evaluation using actual weighed food recovers a different error distribution. This can easily contribute 2–4 percentage points of MAPE.3
Third: aggregation choices. Vendor headline figures are often an unweighted mean across multiple outcomes (energy, protein, fat, carbohydrate), sometimes across multiple cuisines. Whichever outcome performs best pulls the headline down. Independent evaluations that report per-outcome MAPE recover higher per-outcome numbers — not because the system is worse, but because the aggregation has been disaggregated.
Fourth: selective reporting. We have encountered vendor evaluations that appear to have reported the better of two or more internal runs. This is the mechanism we can least often directly verify, but it is consistent with the statistical signature in our paired-comparison dataset — vendor MAPE values are more tightly clustered near the low end than a random sample of legitimate evaluations should be.4
What “failure” means for readers
For a clinician, nutrition researcher, or a journalist reading a product webpage, the practical implication is that a headline MAPE figure is best treated as a lower bound on the error under research-realistic conditions. In most cases, a reasonable prior would be to add 5–10 percentage points when imagining how the product would perform on a meal set the research group did not curate.5 This is not a rule; it is an ordering heuristic.
For the research community, the implication is that independent replication needs to be treated as a first-order scientific activity, not as a nice-to-have. The field has too few paired comparisons to say, for any given application, how large the replication gap actually is.
What might close the gap
We would encourage three changes. Vendors should pre-register their evaluation designs before collecting data and publish the pre-registration alongside any headline figures; this removes the “selective reporting” mechanism by construction. Journals accepting validation studies should require reporting of per-outcome MAPE with confidence intervals and should not accept combined-outcome headlines as the primary result. Research groups — including ours — should treat multi-application comparative evaluations as part of their core programme rather than an occasional project.
None of this is novel advice. It has simply not yet been adopted at a scale that would make the replication gap uncommon.
References
Footnotes
-
Weiss, H. & Okafor, D. (2024). Systematic review of image-based dietary assessment validation, 2015–2024. Dietary Assessment Initiative, DAI-SR-2024-03. ↩
-
Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. ↩
-
Archer, E. et al. (2018). Controversy and debate: memory-based methods. Journal of Clinical Epidemiology, 104, 113–124. ↩
-
Simonsohn, U. et al. (2014). P-curve and effect size: correcting for publication bias using only significant results. Perspectives on Psychological Science, 9(6), 666–681. ↩
-
Forrester, M. G. & Castillo, R. (2023). Headline-to-field accuracy gaps in commercial dietary applications. Nutrition Journal, 22(1), 88. ↩
Keywords
replication; vendor accuracy; validation; MAPE; bias; research integrity
License
This piece is distributed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).