Vendor-reported accuracy claims for image-based dietary assessment applications: a systematic review of methodology gaps

Helena Weiss; Lars Henriksen; Daniel Okafor

doi:10.5281/zenodo.dai-2024-03

Systematic Review

Vendor-reported accuracy claims for image-based dietary assessment applications: a systematic review of methodology gaps

DAI-SR-2024-03

Helena Weiss, PhD, MPH, RD; Lars Henriksen, PhD; Daniel Okafor, PhD, MS
Published September 17, 2024 · DOI: 10.5281/zenodo.dai-2024-03

Abstract

Image-based dietary assessment applications are increasingly marketed with quantitative accuracy claims, commonly expressed as mean absolute percentage error (MAPE) on energy and macronutrient estimation. The reliability of such vendor-reported figures has not been systematically examined. This review identified 41 consumer-facing applications marketed between January 2019 and June 2024 that published a numeric accuracy claim, either on product websites, in press releases, or in white papers hosted by the vendor. Claims were extracted and coded against a 14-item methodological checklist derived from STARD-2015 and the 2022 extension for artificial-intelligence-based diagnostic studies. Only 9 of 41 applications (22.0%, 95% CI 10.6-37.6) disclosed sample size; 6 (14.6%, 95% CI 5.6-29.2) disclosed the reference standard used; and 2 (4.9%, 95% CI 0.6-16.5) reported 95% confidence intervals around the headline accuracy metric. Bespoke, non-public evaluation sets were used in 34 of the 41 claims where a dataset was identifiable (82.9%, 95% CI 68.0-92.7). Eight applications re-stated a single accuracy figure across multiple product generations without indication of re-validation. The review concludes that the current evidence base supporting vendor accuracy claims is substantially weaker than standard reporting conventions would require, that inter-product comparison on the basis of marketed MAPE is not defensible, and that independent replication on shared, publicly described evaluation sets is needed before any application's numeric accuracy claim can be treated as a reliable summary of its real-world performance. The review does not name individual applications; the objective is to describe the field.

Keywords: dietary assessment; systematic review; reporting standards; MAPE; vendor claims; STARD; image-based assessment; evidence synthesis

1. Background

The past decade has seen rapid growth in consumer-facing dietary assessment applications that estimate energy and macronutrient intake from photographs of food. Vendors increasingly accompany these products with quantitative accuracy claims — most commonly mean absolute percentage error (MAPE) on estimated kilocalories — that are used in marketing material, investor communications, and occasionally in submissions to regulators. Clinicians, researchers, and patients who encounter these figures are rarely in a position to evaluate how they were produced.

A growing methodological literature on the reporting of diagnostic and predictive studies in health — summarised most recently in STARD-2015 and the 2022 extension for artificial-intelligence-based diagnostic studies — has established minimum descriptive elements that allow a reader to judge the validity of a reported accuracy figure. These include a pre-specified reference standard, a pre-specified patient or item population, sample size justification, and interval estimates around headline metrics. Whether vendor-reported accuracy claims for image-based dietary assessment applications approach these norms has not previously been examined systematically.

The present review therefore asks: among consumer-facing image-based dietary assessment applications that publish a quantitative accuracy claim, how frequently are core methodological elements disclosed, and how comparable are the resulting claims across products?

2. Methods

2.1 Protocol and registration

The review protocol was registered on the Open Science Framework on 3 March 2024 (preregistration available at the DOI above) and followed PRISMA 2020 guidance. No amendments were made after registration.

2.2 Eligibility criteria

Applications were eligible if they (i) were commercially available to consumers in at least one market between 1 January 2019 and 30 June 2024, (ii) used photographs of food as the primary input modality for energy or macronutrient estimation, and (iii) published at least one numeric accuracy claim in English on a vendor-controlled surface (product website, help centre, press release, white paper, regulatory filing, or investor deck). Research prototypes without consumer release were excluded.

2.3 Information sources and search

The Web Archive (archive.org), Google, and the Apple and Google application stores were searched between 15 March and 30 April 2024 using the string (“food” OR “nutrition” OR “calorie” OR “meal”) AND (“accuracy” OR “MAPE” OR “error” OR “validation”). Archived vendor pages were retrieved where live versions had changed.

2.4 Data extraction

Two reviewers (W.B.W., L.H.) independently extracted data into a pre-specified sheet. A third reviewer (C.O.) arbitrated disagreements. Extracted fields covered the 14 items listed in Table 1.

2.5 Synthesis

Because the claims were heterogeneous in metric, reference standard, and dataset, no quantitative pooling was attempted. Proportions of claims reporting each methodological element are presented with Wilson 95% confidence intervals.

3. Results

3.1 Included applications

Of 112 applications initially screened, 41 met inclusion criteria. Thirty-one were headquartered in North America or Europe; ten in other regions. Twenty-six described themselves as consumer products; fifteen as clinical-adjacent or research tools.

3.2 Reporting of methodological elements

Table 1 summarises the prevalence of each checklist item.

Methodological element	Disclosed, n/41	Proportion (95% CI)
Sample size	9	22.0% (10.6-37.6)
Reference standard named	6	14.6% (5.6-29.2)
Reference standard procedure described	3	7.3% (1.5-19.9)
Evaluation set publicly available	7	17.1% (7.2-32.1)
Cuisine / population composition described	4	9.8% (2.7-23.1)
Per-item vs per-meal granularity specified	11	26.8% (14.2-42.9)
Error metric definition given	19	46.3% (30.7-62.6)
95% confidence interval reported	2	4.9% (0.6-16.5)
Independent (non-vendor) replication cited	1	2.4% (0.1-12.9)
Date of evaluation disclosed	12	29.3% (16.1-45.5)
Model version evaluated specified	8	19.5% (8.8-34.9)
Pre-registration	0	0.0% (0.0-8.6)
Mixed-dish performance reported separately	3	7.3% (1.5-19.9)
Failure-case analysis	2	4.9% (0.6-16.5)

3.3 Headline accuracy figures

The distribution of headline MAPE claims is summarised in Table 2. Reported values ranged from 1.8% to 23.4%. Twelve claims were expressed as accuracy (100 − MAPE) rather than error, which in several cases obscured the underlying metric.

Claim band (MAPE equivalent)	n	% of 41
<5%	6	14.6%
5-10%	11	26.8%
10-15%	9	22.0%
15-25%	7	17.1%
Not interpretable as MAPE	8	19.5%

3.4 Persistence of claims across product generations

Eight applications (19.5%) displayed a single accuracy figure that persisted across at least two product releases in the archived record, without indication that the figure had been re-derived against the newer model. In four cases the figure persisted across three or more releases.

4. Discussion

The central finding is not that vendor-reported accuracy claims are necessarily wrong, but that they are presented in a form that does not allow readers to judge whether they are right. Sample size, reference standard, and interval estimates — the three elements most central to interpreting an accuracy figure — were disclosed in under a quarter of cases. Evaluation sets were overwhelmingly bespoke and non-public, which precludes inter-product comparison by construction.

Several patterns warrant comment. First, where MAPE figures were reported without a per-item versus per-meal distinction, the same underlying performance can correspond to very different marketed numbers depending on the aggregation. Second, the tendency to report accuracy as 100 − MAPE, while arithmetically straightforward, tends to make small errors look like high accuracy and was almost never accompanied by an absolute kilocalorie error. Third, the persistence of single figures across product generations suggests either that re-validation is being performed and not reported, or that it is not being performed at all; both interpretations carry reporting implications.

Limitations of this review include its reliance on vendor-controlled surfaces, which may under-represent claims made in private clinical or regulatory settings, and the restriction to English-language material. The 14-item checklist is a simplification of STARD-2015 and may have missed nuances; however, it was applied consistently across included applications.

5. Conclusions

The evidence base supporting vendor-reported accuracy claims for image-based dietary assessment applications is substantially weaker than standard diagnostic-reporting conventions require. Inter-product comparison on the basis of marketed MAPE is not currently defensible. Independent replication on shared, publicly described evaluation sets, with full reporting of the STARD-2015 minimum elements, is needed before any individual figure can be treated as a reliable summary of real-world performance. The Initiative recommends that future reviews treat vendor figures as hypotheses to be tested rather than as evidence.

References

Beaumont S, Clarkson H. Reporting of diagnostic accuracy studies: a decade after STARD. JMIR mHealth Uhealth. 2022;10(4):e28841.
Durand P, Oyelaran M. Mobile dietary assessment: a landscape review. Nutrients. 2021;13(11):4012.
Fentiman R, Szabo I. The MAPE problem in consumer nutrition technology. Appetite. 2023;186:106523.
Grieg A, Halvorsen T. Reference standards for dietary recall validation: a scoping review. Br J Nutr. 2020;124(3):290-302.
Kleinberg J, Moreno-Leal R. Reporting bias in AI medical device marketing. J Acad Nutr Diet. 2023;123(9):1402-1414.
Liu X, Schmidt A. STARD-AI: extending diagnostic reporting standards to machine-learning devices. Nat Med Methods. 2022;4:112-119.
Macready A, Thwaites B. Evidence standards for digital health tools: a survey of national regulators. Public Health Nutr. 2021;24(14):4498-4507.
Ohlsson E, Pettersen K. Systematic review methods for non-indexed grey literature. Obes Rev. 2023;24(2):e13521.
Pritchard S, Kalyan R. The drift of accuracy claims across product generations. JMIR mHealth Uhealth. 2023;11:e44192.
Ruiz-Delgado A, Weiss W. Why confidence intervals matter for calorie-tracker marketing claims. Nutrients. 2024;16(3):455.
Sato H, Anderssen L. Photograph-based portion estimation: a methodological review. Br J Nutr. 2022;128(8):1401-1412.
Tennant J, Wu F. Pre-registration of diagnostic accuracy studies outside medicine. PLOS Digit Health. 2023;2(6):e0000187.

Funding

No external funding was received for this work.

Competing interests

The authors declare no competing interests.

Data availability

The extraction sheet, coding rubric, and PRISMA flow diagram are archived with the DOI above.

How to cite

Weiss H., Henriksen L., Okafor D.. (2024). Vendor-reported accuracy claims for image-based dietary assessment applications: a systematic review of methodology gaps. The Dietary Assessment Initiative — Research Publications. https://doi.org/10.5281/zenodo.dai-2024-03

License

This article is distributed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).