How We Caught GPT-5.1 Fabricating Health Data by InfiniteUp AgencyHow We Caught GPT-5.1 Fabricating Health Data by InfiniteUp Agency
Built with FlutterFlow

How We Caught GPT-5.1 Fabricating Health Data

InfiniteUp Agency

InfiniteUp Agency

How We Caught GPT-5.1 Fabricating Health Data

Client: Health & wellness startup (anonymized at client's request) Industry: HealthTech / Consumer Safety Tech: OpenAI (GPT-4o, GPT-4o-mini, GPT-5.1), Firebase, FlutterFlow Status: Architecture delivered; client paused product development

The Problem

A health and wellness startup hired InfiniteUp to build an AI-powered cosmetic ingredient analyzer. Users would scan a product label or search by name, and the app would return a safety rating on a 1-to-5 star scale based on the ingredient list.
The stakes were real. A wrong rating could lead someone to trust a product containing ingredients they're sensitive or allergic to. The AI couldn't just be fast and cheap; it had to be right.

The Hallucination Catch

When OpenAI released GPT-5.1, we ran our standard model evaluation before upgrading. The results were alarming.
GPT-5.1 returned fabricated ingredient lists for products it didn't recognize. Instead of saying "I don't know," it confidently invented plausible-sounding ingredients. It also generated fake image URLs that returned 404s, which would have rendered as broken images in the app.
In a social media app, a hallucinated response is embarrassing. In a health app, it's dangerous.
We made the call to revert to the older, slower, but more accurate model and documented the failure mode for the client. This single decision protected every future user of the app from receiving fabricated safety data.

The Architecture Fix: Decoupling AI from Safety Logic

The hallucination catch exposed a deeper architectural risk. The original design let the AI handle both ingredient extraction and safety scoring. If the AI hallucinated an ingredient list, the safety rating would be wrong too, and there would be no way to detect it.
We redesigned the architecture with a hard boundary:
AI handles text extraction and summarization only. It reads the label, identifies ingredients, and structures the data.
The safety rating is calculated by deterministic, hard-coded math. A rules engine scores each ingredient against a known database and computes the 1-to-5 star rating. No LLM involved.
This means even if the AI makes an extraction error, the safety logic never compounds it with a confident but wrong rating. The two systems check each other.

The Multi-Gate Fallback: Reducing Cost and Error Surface

We also identified that sending every product lookup to the LLM was wasteful and risky. Most popular products already have verified ingredient data available. So we built a three-tier fallback:
Gate 1: Internal database. Check if the product already exists in the app's verified product cache. If yes, return instantly. Zero API cost, zero hallucination risk.
Gate 2: OpenFoodFacts API. If the internal DB misses, query the open-source product database. Still no LLM involved.
Gate 3: User photo + AI OCR extraction. Only if both previous gates fail does the app fall back to the user's photo and AI-powered text extraction.
This architecture reduced LLM API calls by an estimated 70-80% for common products while concentrating AI usage on the long-tail cases where it's actually needed.

Ground-Truth Benchmarking

To validate the entire pipeline, we required the client to provide a ground-truth data set: a curated list of known products with verified ingredient lists and pre-calculated safety scores.
We ran every model candidate against this data set and measured:
Ingredient extraction accuracy (did the AI correctly identify all ingredients?)
False positive rate (did it invent ingredients that weren't there?)
False negative rate (did it miss ingredients that were?)
Latency and token cost per extraction
This is the same structured evaluation methodology we apply across all InfiniteUp projects through our AI Iron Triangle framework (Fast, Cheap, Good) and the same approach we use as accepted members of the Contra Labs network for frontier AI evaluation research.

Key Takeaways

Newer models aren't always better. GPT-5.1 was faster and cheaper but fabricated data in a safety-critical context. Always benchmark before upgrading.
Never let an LLM own both the input and the scoring. Decouple extraction from evaluation so errors don't compound silently.
Gate your AI calls. Most lookups don't need an LLM. Use deterministic sources first and reserve AI for the long tail.
Ground-truth testing isn't optional. If you can't measure accuracy against known-good data, you can't trust the system in production.

Why This Matters for Your Product

If you're running AI in a product where accuracy matters (health, finance, legal, education, e-commerce), the question isn't whether your model will hallucinate. It's whether your architecture will catch it when it does.
This is exactly what our AI Model Evaluation & Benchmarking service is built to uncover.
Like this project

Posted Jun 28, 2026

Caught GPT-5.1 fabricating ingredient lists in a health app. Redesigned the architecture to decouple AI from safety-critical logic, built a multi-gate fallback to reduce LLM costs by 70-80%, and benchmarked against ground-truth data.