Cost Optimization at Scale: The Impact of Kafka Topic Compression at iFood

In iFood’s technology ecosystem, real-time data processing is the pillar that sustains everything from real-time driver location tracking to fraud detection in our marketplace. With the exponential growth in data volume, operational efficiency and financial sustainability have become strategic pillars for our streaming infrastructure.

Recently, we faced the challenge of optimizing Kafka costs without compromising the performance of our applications. This article details how a battery of tests in control groups, to falsify a possibility, guided us to a compression strategy that has already reduced the effective network cost of Kafka by 16% and achieved ROIs exceeding 11,000% in specific production scenarios.

The Main Challenge: Data Growth vs. Financial Sustainability

The operational cost in Kafka, which is a distributed event streaming platform, is directly impacted by data throughput. Without an optimized compression strategy, the increase in message volume heavily pressures data transfer costs, network, and storage. Our goal was to find the optimal point between additional CPU usage, Central Processing Unit (computational overhead), and the savings generated by reducing data volume.

It seems obvious, doesn’t it? But the reality is that this conclusion took time to come. When we look at a complex system, such as managing Kafka in a high-volume environment, it is not trivial to isolate variables and reach this conclusion. We started from an attempt to falsify a hypothesis: for us, compression didn’t seem like something worth the CPU investment. Fortunately, we were wrong, but willing to test. To this end, we developed an exclusive methodology developed for the execution of the project.

Methodology: The Science Behind the Tests

To ensure that our decisions were evidence-based, we established a rigorous testing methodology using control groups. The data-store-stream team, responsible for all of iFood’s data streaming platforms, dedicated more than 1 week to arrive at the results presented, being a multidisciplinary work with dedication and high technical quality.

Details about the environment configuration

Platform: Kubernetes (K8s) integrated with Confluent Cloud;
Load tools: We used K6 to simulate virtual users (VUs) and average payloads of 2KB;
Monitoring: Real-time metrics via Datadog to collect CPU, memory, and latency data.

We tested the two most widely used libraries internally by our developers: Sarama, developed in GO, and Confluent Kafka Client, developed in Java/JVM. The focus was to compare the behavior of the LZ4 (Lempel-Ziv 4) and ZSTD (Zstandard) algorithms.

Control Group Results: GO vs. Java

The initial analysis in the control groups revealed interesting disparities between the technologies. For example, applications developed in GO showed massive optimization potential. The ZSTD algorithm stood out as the most efficient:

Monthly savings: 81.5%;
ROI: 6,683% per month;
CPU overhead: Only 7% increase, compared to 41% for LZ4.

In the controlled test scenario, Java applications showed initial savings of 7.6%. Although it seemed modest compared to GO applications, the stability and efficiency of the JVM in handling compression without major changes in CPU usage indicated that the gain at scale would be significant.

From the Laboratory to Production: Main Results Obtained

It is common to have large differences between a controlled environment and production. The results not only confirmed our theses but exceeded the most optimistic projections. We highlight some cases:

Case 1: LZ4 to ZSTD Transition in Java

In a tracing service, switching from the LZ4 algorithm to ZSTD resulted in:

Operational cost reduction: 27.43%;
CPU increase: 5%;
12-month ROI: 742%.

Case 2: The Power of ZSTD Where There Was No Compression

The most surprising result came from Java producers that operated without compression. When we turned on ZSTD, we observed:

Operational cost reduction: 74.5%;
CPU increase: Only 3.2%;
12-month ROI: An incredible 11,085%.

Case 3: Efficiency in GO

In search topics, migration to ZSTD confirmed the efficiency observed in the laboratory:

Operational cost reduction: 67.9%;
CPU increase: 26.8%;
12-month ROI: 1,710%.

Learnings and Technical Recommendations

The implementation journey taught us that compression is not a “silver bullet,” but a precision tool. The CPU trade-off pays off. The investment in a few additional cores in Kubernetes is largely paid for by savings in Confluent Cloud. In GO, every $1 invested in CPU returned $67.90 in savings.

ZSTD is the clear winner. In almost all scenarios, ZSTD offered the best compression rate with more efficient CPU overhead than LZ4. Pay attention to K8s limits. Before turning on compression, make sure that the requests and limits of CPU for your applications are not too tight. The processing increase is real and the deployment needs to be ready to scale.

Applied Science in Production

The implementation of Kafka topic compression is a clear victory for our FinOps strategy. By reducing the cost per network request by 16%, we not only saved resources but created a more sustainable infrastructure for iFood’s growth.

The success of this initiative reinforces the importance of control groups: they gave us the necessary confidence to make structural changes in production with calculated risks and financial predictability.