Methodology Brief
Cuisine stratification in evaluation sets: definitions, allocations, and minimum N for inference
A methodology brief
Background
Image-based dietary assessment tools frequently exhibit non-uniform accuracy across cuisines: a system trained primarily on Western plated meals may perform noticeably less well on mixed rice dishes, communal stews, or composed street-food items. Evaluation sets that do not stratify by cuisine, or that stratify inconsistently, give misleading global accuracy estimates and obscure real gaps in tool readiness.
However, “cuisine” itself is a contested category. The Initiative’s position is that for evaluation purposes a pragmatic, structurally defined taxonomy is preferable to one based on geographic or cultural labels alone, and that stratum definitions should be stated operationally enough to permit replication.
The Method
Stratum definition. The Initiative uses a six-stratum operational taxonomy for evaluation sets, defined primarily by structural characteristics of the meal rather than national origin:
- Plated single-component - an identifiable protein plus sides on a single plate.
- Mixed rice or grain dishes - dishes in which a grain matrix incorporates multiple other components (for example, pilafs, biryanis, fried rice, paella).
- Composed bowls and stews - liquid- or semi-liquid-matrix dishes with multiple components partially submerged.
- Layered or stacked items - sandwiches, wraps, burgers, tacos, and analogous items where components are stacked and partly occluded.
- Beverages and soups - predominantly liquid items.
- Discrete-piece items - fruits, baked goods, confectionery served as identifiable whole units.
Each evaluation item is assigned to exactly one stratum. The taxonomy is intentionally structural because these categories correspond to visually distinct estimation problems (occlusion, ingredient overlap, portion cue availability), which is the relevant axis for image-based evaluation.
Allocation. For an evaluation set of total $N$, the Initiative’s default allocation is proportional to the deployment population’s consumption mix, with a floor of 15% per stratum that is scientifically relevant to the claim being made. Strata judged non-relevant may be excluded with justification; the exclusion is reported.
Minimum stratum size for inference. The Initiative requires $n_{\text{stratum}} \geq 30$ before stratum-level accuracy is reported quantitatively. Strata with $15 \leq n < 30$ are reported descriptively only. Strata with $n < 15$ are pooled into a “miscellaneous” category.
Worked example
A validation targeting a general-purpose tool for an urban US adult deployment might set $N = 180$ with the following allocation.
| Stratum | Planned share | Planned n | Rationale |
|---|---|---|---|
| Plated single-component | 25% | 45 | Common in target population |
| Mixed rice/grain | 20% | 36 | Common, known difficult for image methods |
| Composed bowls/stews | 15% | 27 | Relevant, diverse |
| Layered/stacked | 20% | 36 | Very common (sandwiches, wraps) |
| Beverages/soups | 10% | 18 | Minimum-floor adjusted |
| Discrete-piece | 10% | 18 | Minimum-floor adjusted |
In this plan, Beverages and Discrete-piece would be reported descriptively only, since their planned $n$ falls below the 30-item threshold for stratum-level quantitative inference. If stratum-level inference is required for these, the overall $N$ must be raised.
Common pitfalls
- Using national-cuisine labels (“Italian”, “Thai”) that are not operationally stable and that conflate structurally different dishes. A “Thai” stratum may contain both plated single-component and composed-stew items, which are different estimation problems.
- Letting an evaluation set’s stratum mix drift toward whatever items were easiest to photograph, producing a systematic bias away from dishes that are typically difficult to estimate.
- Reporting an overall accuracy figure without disclosing stratum mix. Two studies with the same overall MAPE but different mixes are not comparable.
- Pooling strata post hoc to rescue underpowered sub-samples. Pooling must be pre-specified.
- Failing to tag individual items in the evaluation set with their stratum assignment, which prevents re-analysis.
Recommended reporting
- Report the stratum taxonomy used and any deviations from a standard one.
- Report planned and achieved $n$ per stratum.
- Report overall and stratum-level accuracy when stratum $n \geq 30$.
- Report the consumption-mix assumptions behind the allocation.
- Release stratum tags alongside the item-level evaluation data in the supplement.
References
- Patel R. A structural taxonomy for dietary assessment evaluation sets. Public Health Nutr. 2024;27(5):1155-1164.
- Rivera M, Patel R. Cuisine-level accuracy variation in image-based dietary assessment: a re-analysis. Nutrients. 2023;15(11):2590.
- Caballero M, Yoshida H. Occlusion and ingredient overlap as determinants of estimation error in food images. JMIR mHealth Uhealth. 2022;10(12):e39811.
- Patel R, Okafor N. Minimum cell sizes for stratum-level inference in validation studies. Stat Med. 2024;43(7):1220-1233.
- Hernandez A, Linde J. The problem with national-cuisine labels in algorithm evaluation. J Nutr. 2022;152(9):2033-2039.
- Ahlgren P. Representativeness of evaluation sets in nutrition technology: a critical appraisal. Br J Nutr. 2021;126(12):1795-1806.
- Patel R. Pre-specifying stratum pooling rules in validation protocols. Am J Clin Nutr. 2025;121(3):622-628.
Keywords
cuisine; stratification; evaluation set; dietary assessment; sampling design; generalisability; image-based
License
This piece is distributed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).