

In iFood’s technology ecosystem, real-time data processing is the pillar that sustains everything from real-time driver location tracking to fraud detection in our marketplace. With the exponential growth in data volume, operational efficiency and financial sustainability have become strategic pillars for our streaming infrastructure.
Recently, we faced the challenge of optimizing Kafka costs without compromising the performance of our applications. This article details how a battery of tests in control groups, to falsify a possibility, guided us to a compression strategy that has already reduced the effective network cost of Kafka by 16% and achieved ROIs exceeding 11,000% in specific production scenarios.
The operational cost in Kafka, which is a distributed event streaming platform, is directly impacted by data throughput. Without an optimized compression strategy, the increase in message volume heavily pressures data transfer costs, network, and storage. Our goal was to find the optimal point between additional CPU usage, Central Processing Unit (computational overhead), and the savings generated by reducing data volume.
It seems obvious, doesn’t it? But the reality is that this conclusion took time to come. When we look at a complex system, such as managing Kafka in a high-volume environment, it is not trivial to isolate variables and reach this conclusion. We started from an attempt to falsify a hypothesis: for us, compression didn’t seem like something worth the CPU investment. Fortunately, we were wrong, but willing to test. To this end, we developed an exclusive methodology developed for the execution of the project.
To ensure that our decisions were evidence-based, we established a rigorous testing methodology using control groups. The data-store-stream team, responsible for all of iFood’s data streaming platforms, dedicated more than 1 week to arrive at the results presented, being a multidisciplinary work with dedication and high technical quality.
We tested the two most widely used libraries internally by our developers: Sarama, developed in GO, and Confluent Kafka Client, developed in Java/JVM. The focus was to compare the behavior of the LZ4 (Lempel-Ziv 4) and ZSTD (Zstandard) algorithms.
The initial analysis in the control groups revealed interesting disparities between the technologies. For example, applications developed in GO showed massive optimization potential. The ZSTD algorithm stood out as the most efficient:
In the controlled test scenario, Java applications showed initial savings of 7.6%. Although it seemed modest compared to GO applications, the stability and efficiency of the JVM in handling compression without major changes in CPU usage indicated that the gain at scale would be significant.
It is common to have large differences between a controlled environment and production. The results not only confirmed our theses but exceeded the most optimistic projections. We highlight some cases:
In a tracing service, switching from the LZ4 algorithm to ZSTD resulted in:
The most surprising result came from Java producers that operated without compression. When we turned on ZSTD, we observed:
In search topics, migration to ZSTD confirmed the efficiency observed in the laboratory:
The implementation journey taught us that compression is not a “silver bullet,” but a precision tool. The CPU trade-off pays off. The investment in a few additional cores in Kubernetes is largely paid for by savings in Confluent Cloud. In GO, every $1 invested in CPU returned $67.90 in savings.
ZSTD is the clear winner. In almost all scenarios, ZSTD offered the best compression rate with more efficient CPU overhead than LZ4. Pay attention to K8s limits. Before turning on compression, make sure that the requests and limits of CPU for your applications are not too tight. The processing increase is real and the deployment needs to be ready to scale.
The implementation of Kafka topic compression is a clear victory for our FinOps strategy. By reducing the cost per network request by 16%, we not only saved resources but created a more sustainable infrastructure for iFood’s growth.
The success of this initiative reinforces the importance of control groups: they gave us the necessary confidence to make structural changes in production with calculated risks and financial predictability.

Site Reliability Engineer
Bacharel em Ciências Econômicas e atua como SRE desde 2020. Mineiro de Belo Horizonte, ele acredita que economia e SRE têm tudo a ver. Dedica seus finais de semana aos universos de Magic: The Gathering e Dungeons & Dragons.
Estamos sempre em busca de desenvolvedores, designers e cientistas de dados apaixonados para nos ajudar a revolucionar a experiência de entrega de alimentos. Junte-se à iFood Tech e faça parte da construção do futuro da tecnologia alimentar.
Conheça nossas Carreiras
No iFood, acelerar o processo de engenharia sem abrir mão de segurança e qualidade é um desafio constante, especialmente quando levamos em consideração a escala dos nossos serviços. Com mais de 1.500 pessoas engenheiras, cerca de 10 mil repositórios e…


No ecossistema de tecnologia do iFood, o processamento de dados em tempo real é o pilar que sustenta desde a localização em tempo real de drivers até a detecção de fraudes em nosso marketplace. Com o crescimento exponencial do volume…

Cada artigo é resultado da visão e expertise dos nossos autores. Veja quem contribui com nosso blog: