Sugar, Revisor, and Pepper: the agentic cycle that maintains iFood’s catalog scale and consistency

Living catalogs require living systems. And in a catalog with millions of items, consistency is the most important factor.

In a catalog with millions of items, product classification faces challenges that go beyond simple categorization. Inconsistent descriptions, ambiguous names, and the enormous variety of items make it difficult for traditional models to correctly interpret what each item represents within the catalog’s taxonomy.

When this interpretation fails, classification may end up placing an item in the wrong place in the taxonomic tree. This is a small error in the data structure, but it has a direct impact on search, recommendation, and curation.

And here comes the real challenge: how to maintain semantic consistency in a catalog with millions of items, long tail, and ambiguous descriptions, without turning this into infinite manual work?

In recent months, the answer to this question has evolved significantly here. What started as an initiative to automate data labeling became something bigger: an ecosystem of services that classifies and audits at scale, while continuously learning from its own exceptions.

Keep reading to learn more about this evolution and about the three services that sustain the cycle today: Sugar, Revisor, and Pepper.

A brief context: from SAL to the next stage

Before discussing current services, it’s worth providing quick context.

SAL (Smart Automated Labeling) was born as an internal library to automate label generation for training and evaluating machine learning models. The central premise was to use multiple models and combine their responses via consensus methods to increase process reliability. This helped accelerate labeling and raise quality in important cases.

But over time, the scale and type of problem changed.

From the moment the need shifted from just generating labels to improving catalog classification in production, it became clear we would need to evolve our approach. iFood was already using classifiers based on traditional machine learning, and SAL initially emerged as an attempt to expand the quality of these classifications through automated labeling.

Despite bringing important gains in generating labeled data and semantic capacity, SAL didn’t scale well for the volumes required by the catalog.

In internal tests, the contrast was quite direct: what Sugar does in about two hours, SAL took six to seven hours to do in an equivalent flow. And as volume grew, the time simply exceeded the acceptable window for data submission pipelines. Additionally, there were classic SAL limitations for this use: significant rates of invalid responses or lack of consensus among LLMs, plus a high operational cost to maintain descriptions and definitions that required constant effort for construction, correction, and validation.

The shift, then, wasn’t just “replacing one tool with another.” This change represented a broader evolution in how we treat catalog classification. We expanded the previous approach, based on traditional machine learning models and techniques like TF-IDF, to LLM-based solutions, capable of better handling semantic ambiguity and the diversity of descriptions present in the catalog.

At the same time, we stopped treating labeling as an isolated stage and began operating a continuous quality cycle, capable of classifying, auditing, measuring, and feeding back into the system.

This is where Sugar, Revisor, and Pepper come into play.

What are we building, after all?

If you need a short definition, Sugar + Revisor + Pepper form an intelligent catalog classification system with audit and continuous improvement.

Or, in more practical terms:

Sugar classifies items at scale (with hybrid model strategy);

Revisor judges the quality of what was classified, corrects exceptions, and transforms errors into learning;

Pepper generates validation labels (ground truth) with reliability and scale, to measure precision/recall and accelerate system evolution.

The result is a cycle that functions like an organism: executes, verifies, learns, and comes back better.

Why this matters more than it seems

Classification and attributes don’t live in isolation; they directly feed three fronts that everyone feels in the product:

Search: when users search for something, classification helps understand intent and filter the right offer.

Recommendation: models need consistent signals to suggest what makes sense in that context.

Curation and lists: those thematic and promotional lists depend on taxonomy/attributes to avoid becoming a “random mix”.

If the catalog doesn’t make sense, the app doesn’t make sense. And for this and other reasons, the bar is high here: precision and recall need to be strong enough not to compromise user experience. In other words: we can’t accept systems and services that “sometimes get it right”.

Sugar: the classification engine at scale

Sugar was born with an objective proposal: to classify catalog items in volume and with speed, maintaining semantic understanding.

In practice, this means having sufficient throughput to handle high volume. The service today can process thousands of items in a few minutes, allowing classification to keep pace with catalog update rhythm.

What Sugar solves

Sugar answers the base question:

“What is this item and where does it fit in our taxonomy?”

This “where it fits” isn’t trivial. Taxonomy is a tree. Classifying an item means choosing the path in the tree, and this becomes especially difficult in the long tail, when the name is ambiguous or the context is incomplete.

The challenge: performance vs cost

Sugar, in its early phases, was built using frontier LLMs to guarantee semantic quality. It works, of course, but scaling this across millions of items can get expensive very quickly.

The solution was to evolve to a hybrid architecture, where the system decides when to use a proprietary model and when to trigger frontier models as fallback.

Today, the team is rolling out a smaller model, trained with catalog data (the logic of what they internally call ‘Item Profile’ / derivation of work with proprietary models). In practical terms, 50% to 70% of inferences already migrate to the proprietary model (an LLM trained with catalog data), while the rest continues with fallback to frontier models whenever necessary.

This combination solves two problems at once: maintains quality where needed and reduces cost where possible.

Our team estimates a cost drop between 60% and 80% compared to depending exclusively on frontier LLMs, with the additional benefit of reducing maintenance cost by replacing multiple legacy models with a more unified inference.

Revisor: “LLM as a Judge”, with human help

When classifying items at scale, having quality governance is non-negotiable.

The classic approach is: classify everything → take a sample → send for human review.

It works, but doesn’t scale well. Because, in practice, this means adding volume on top of an expensive and limited bottleneck.

Revisor, in turn, was born to act as an automatic, agentic audit layer before the human.

How the flow works

Revisor selects a daily sample of what Sugar classified and puts it inside a multi-agent system that evaluates different dimensions: item context, hallucination risk, business rules, and decision consistency.

These evaluations go to an agent that acts as a “judge” and gives the verdict:

If it agrees with Sugar, that becomes a reliable label for quality control.

If it disagrees, Revisor tries to reclassify (using a specialist agent, in multiple attempts).

If it still can’t resolve with confidence, the case goes to HITL (human-in-the-loop, or the human who will be the last validation layer). In other words, human review is triggered whenever there’s an exception.

This design makes an important adjustment to the system: the human stops being the first filter and becomes the last resource. With this, human review focuses only on the most complex cases, not on samples sent without prioritization criteria.

Transforming error into learning

Beyond judging, Revisor also learns and feeds back into the cycle.

When identifying a failure (like a “hamburger” classified as “wrap”, for example), it generates an insight: a kind of rule/observation that will guide Sugar when similar cases happen.

In practice, this becomes an operational memory of the system:

Revisor detects the error;

extracts an insight from what happened;

associates this insight with similar items (via similarity);

and reinjects guidance into the classification flow.

This is the opposite of “fix today and forget tomorrow”. Our ambition is to prevent errors from repeating when the context is equivalent.

Even in cases where Revisor agrees with Sugar, a small portion (for example, 1%) can go to human validation as additional control. This is a calibrated trust mechanism, which doesn’t blindly trust automations.

Pepper: measuring quality without collapsing the team

Up to this point, we’ve shown how the system can classify and audit data, but we still need to mention an essential ingredient for any system that wants to improve: reliable measurement.

To affirm that a class is good, you need to escape guesswork and focus on the sample.

To validate quality with statistical significance, for example, it may be necessary to have something like 300–500 labels per class to measure recall and precision decently. Now, multiply this by hundreds of classes.

If the process depends on human labeling, the math will never work.

This is where Pepper comes into action: as a multi-agent system focused on validation labels, with reliability and scale.

What Pepper solves

Pepper eliminates a classic bottleneck: before, the data scientist needed to select data with criteria, run labeling, guarantee reliability, and repeat this whenever a class evolved.

This is a heavy, slow process and, depending on volume, unfeasible.

With Pepper, the idea is for data selection to be automated, with labeling agents receiving the ideal volume for each class. This way, the output can feed internal ground truth tables and continuous metrics.

Today, Pepper correctly labels about 91% of cases compared to humans. And the best part: with less than 1% hallucination.

In other words, it makes validation at scale feasible without needing a human army to label samples all the time.

The complete cycle

If it’s still not clear, we can put it all together:

Input: a batch of catalog items enters.

Classification: Sugar classifies at scale.

When possible, uses proprietary model (cheaper and faster).

When the case is new/complex, triggers fallback with frontier models.

Revisor: Revisor evaluates a daily sample.

If it agrees, consolidates confidence.

If it disagrees, tries to reclassify.

If it doesn’t resolve, triggers HITL (human) for exceptions.

Learning: corrections and insights return to Sugar as guidance.

Validation: Pepper generates ground truth to measure recall/precision and sustain metrics at scale.

Continuous improvement: with metrics + insights + data, the system evolves faster, with less rework.

This makes it easier to visualize that we’re not talking about a “better model”, but a better process.

What changed in practice

The transition to an integrated cycle of classification and continuous improvement brought structural impacts to catalog operation. With this as a starting point, we consolidated an approach oriented toward scale, quality control, and operational efficiency.

We can feel effects of this change across different dimensions:

Speed and scale

When the need became “classify the entire catalog in a short time”, the drop from many hours to a few hours made all the difference.

The old approach wasn’t sustainable. What we used to do in up to seven hours, today drops to about two hours.

Flexibility

In legacy models, improving a node could take two weeks to a month, depending on complexity.

With the current architecture, adjustments can happen in days and, in many cases, in a few hours.

This is critical because the catalog changes all the time: new items appear, descriptions vary, trends explode (and disappear) quickly. If the system doesn’t evolve at the same pace, it ages.

Less fragmentation

Before, extracting attributes (like ‘industrialized’, ‘frozen’, or ‘vegan’, for example) could depend on multiple separate legacy models that were difficult to maintain.

With the evolution to proprietary models and itemProfile logic, the vision is to concentrate inferences in a more unified layer. This makes maintenance more sustainable.

Cost under control, without sacrificing quality

Frontier LLMs are great. And expensive. The hybrid architecture reduces dependence on these models where unnecessary and preserves fallback where needed.

We can estimate a cost drop of around ~60% with increasing use of the proprietary model, plus indirect gains by reducing maintenance of a collection of legacy models.

Quality with governance

Revisor allows the system to have a more mature posture. It can automate triage and review, involve humans when it makes sense, and transform exceptions into learning.

In the end, the goal isn’t to make the system “less human”. It’s to create an ecosystem where humans are strategic and work less with operational labeling and more with interventions in complex cases.

Where we’re going

If SAL represented a stage where the focus was accelerating labeling, this new phase shows another ambition: transforming offer qualification into a living, semantic, and self-correcting system.

Today, Sugar, Revisor, and Pepper function as a layer that enables other iFood fronts (especially Search and Recommendation) with more consistent and reliable signals.

The natural next step, speaking of the problem and not necessarily the solution, is to gain even more flexibility.

We want to reduce dependence on human intervention in operational cases and evolve the system’s ability to handle new classes and taxonomy changes more fluidly. And beyond that, make the learning cycle increasingly continuous.

Because, in the end, the catalog is alive. So the system that makes sense of the catalog also needs to be. This way, we can ensure that users always find exactly what they’re looking for.

Special thanks to Jeferson Santos and Ronald Pereira.