Preprint

Cuisine distribution shift in photo-based dietary assessment: a re-analysis of three publicly described evaluation sets

DAI-PRE-2025-01

This is a preprint

This article is a preprint and has not undergone external peer review. The Dietary Assessment Initiative releases preprints to invite methodological critique prior to or alongside formal publication. Comments and correction notes are welcomed via the Initiative’s contact address and will be acknowledged in any subsequent version.

1. Background

Photo-based dietary assessment has matured considerably over the past decade. A number of consumer and research applications now accept a single photograph of a plated meal and return an estimate of energy and, in some cases, macronutrient content. The most commonly cited accuracy figures for such applications derive from evaluation studies in which the application’s estimate is compared against a weighed reference or a trained dietitian’s estimate on a curated set of meals.

A less frequently examined feature of these evaluation studies is the composition of the meal set against which the application is tested. Validation manuscripts typically describe the set in broad terms — the number of meals, the range of energy content, whether meals were prepared in a metabolic kitchen or captured in the field — but rarely report a structured breakdown of the cuisine traditions represented. If, as we suspected, the cuisine distribution of commonly used evaluation sets skews heavily toward one or two traditions, the published accuracy of an application may not transfer to users whose everyday meals sit outside that distribution.

This preprint reports a re-analysis of three publicly described evaluation sets used in peer-reviewed validation studies of photo-based applications published between 2019 and 2024. Our goal is descriptive rather than inferential: we are not attempting to rank applications, and we do not re-score any application against any new data. We are asking a simpler question — how culinarily representative are the reference sets?

2. Methods

2.1 Set selection

We identified candidate evaluation sets through a structured search of PubMed and JMIR for validation studies of photo-based dietary assessment applications published between January 2019 and December 2024. From an initial pool of 34 candidate manuscripts, we retained three that met all of: (a) provided a per-meal descriptor table in the manuscript or supplementary materials, (b) reported at least one accuracy metric per meal or per category, and (c) explicitly described the set as a “validation set” or “reference set.” The three retained sets contained 184, 247, and 181 meals respectively (combined N = 612). Set identifiers are kept generic (Set A, Set B, Set C) in this preprint; full citations appear in the references.

2.2 Cuisine coding

We developed a six-bucket cuisine codebook (Western, Mediterranean, East Asian, South Asian, Latin American, Other/Mixed) drawing on existing food-culture literature. Two raters independently assigned each meal, working from the published descriptor only. Disagreements were adjudicated by a third rater. Cohen’s kappa is reported for the primary pair.

2.3 Re-expression of published accuracy

Where a source manuscript reported a per-meal or per-category accuracy, we re-expressed mean absolute percentage error (MAPE) conditional on cuisine bucket. We did not re-score any application ourselves.

3. Results

3.1 Cuisine distribution

Across the combined 612-meal pool, the Western bucket accounted for 378 meals (61.8%). Mediterranean accounted for a further 50 meals (8.2%). East Asian accounted for 74 (12.1%), South Asian for 23 (3.8%), Latin American for 41 (6.7%), and Other/Mixed for 46 (7.5%). The two largest sets (Set A, Set B) each had fewer than 10 South Asian meals. Inter-rater agreement was substantial (kappa = 0.79, 95% CI 0.74-0.83).

3.2 Conditional accuracy

For the two source manuscripts that permitted re-expression, per-cuisine MAPE within a single application ranged from 14.2% (Western) to 33.8% (South Asian) in the first study and from 18.1% (Western) to 29.1% (East Asian) in the second. The ratio of worst-to-best MAPE across cuisine was 2.4x in the first case and 1.6x in the second.

3.3 Descriptor completeness

None of the three source manuscripts reported cuisine composition in a structured table. Two reported portion-size ranges; one did not. All three reported energy ranges.

4. Discussion

Our re-analysis suggests that the cuisine composition of commonly used evaluation sets is heavily skewed toward Western meals, and that within-application accuracy varies substantially by cuisine bucket when it can be estimated. The practical implication is that a headline accuracy figure derived from a Western-skewed set is likely to overstate expected accuracy for users whose meals fall in underrepresented buckets, particularly South Asian cuisine.

We do not interpret this as a failure of any particular application. Rather, it is a feature of the evaluation infrastructure. Curating a diverse reference set is expensive, and early validation work understandably leaned on meals that the research team could reliably weigh and photograph. But the downstream effect is that published accuracy is, to a first approximation, Western accuracy.

We recommend that: (1) future evaluation sets adopt a minimum-reporting checklist (cuisine bucket, portion range, photo capture conditions, preparation venue); (2) validation manuscripts publish stratified accuracy figures where sample size allows; and (3) the community invest in reference meal sets that deliberately oversample underrepresented cuisines. The Initiative is developing a reference set along these lines.

Limitations: we worked from published descriptors rather than the original imagery, which we did not have access to. Cuisine bucket assignment from a text descriptor is imperfect; we report inter-rater agreement to quantify this. Our re-expression of accuracy is constrained by what the source manuscripts reported and is not a re-scoring.

References

  1. Albright T, Nakamura Y. Curated meal image sets for validation of photo-based dietary assessment: a scoping review. Nutrients. 2023;15(8):1842.
  2. Chen K, Osei-Assibey P. Cuisine as a moderator of accuracy in automated food recognition. J Med Internet Res. 2022;24(4):e31204.
  3. Dimitriou L, Faroqi N. Structured reporting for dietary assessment validation studies: a proposal. J Acad Nutr Diet. 2024;124(2):301-309.
  4. Esposito R, Park S. Western bias in consumer nutrition technology evaluation. Public Health Nutr. 2023;26(7):1450-1458.
  5. Fielder J, Rao A. Reference meal sets for mobile dietary assessment: a comparative analysis. JMIR mHealth uHealth. 2021;9(11):e29887.
  6. Gutierrez M, Hoffman K. Inter-rater reliability in cuisine coding for nutrition research. Appetite. 2022;172:105958.
  7. Hayes L, Ibrahim O. Stratified reporting in mHealth validation: why and how. npj Digit Med. 2024;7(1):44.
  8. Iyer P, Johansson E. Energy estimation error by cuisine tradition in a photo-based app: a secondary analysis. Nutrients. 2023;15(18):3987.
  9. Kaur H, Lindgren A. Representation of South Asian cuisine in digital nutrition tools. Br J Nutr. 2024;131(4):612-621.
  10. McDonnell S, Park J-Y. Minimum reporting standards for food image datasets. JMIR mHealth uHealth. 2022;10(9):e37712.
  11. Ogundimu F, Pritchard E. External validity in consumer health technology: a framework. BMJ Digit Health. 2023;1(3):e000041.
  12. Yamamoto R, Zeller C. Evaluation set transparency in dietary assessment research. J Acad Nutr Diet. 2024;124(9):1120-1128.

Keywords

photo-based dietary assessment; evaluation set; cuisine stratification; external validity; reference meals; stratified reporting; methodology

License

This piece is distributed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).