Consuming over 1 billion Kafka messages per day at Ifood

Quick recap

We have a microservice that stores Ifood’s customer metadata (internally we call it account-metadata) and during peak hours it reaches over 2 million requests per minute. This system is called by the web and mobile apps and by many internal teams to fetch customers’ data. The data is stored into a single DynamoDB table (with 1.3 billion items using 757,3GB).

Main problem

The pipeline of the data team, to export the data from data lake, process it using Databricks and Airflow, and then finally send it to account-metadata, consists of lots of steps and could fail very easily. Often something fails along the we not always have the best monitoring of every single piece. Which means that it was not reliable. This is a big problem for us as the User Profile team, given that the quality of our data is at the core of our concerns and interest.

Our savior: Feature Store

While we’re looking for alternatives and trying to figure out how to replace the entire ingestion process, one of the ML teams inside Ifood was building a new awesome tool: a Feature Store (FS) project. In summary, Feature Store is a way to provide and share data very easily to empower ML applications, model training, and real-time predictions. On one side, FS reads data from somewhere (data lake, data warehouse, Kafka topics, etc), aggregates it, does some kind of processing or calculation, and then exports the results on the other side (API, in some database, Kafka topics, etc).

Consuming data from Feature Store

As I said, the FS would export the data into a Kafka topic. We defined the schema to be like this:

account_id: string
feature_name: string
feature_value: string
namespace: string
timestamp: int
| account_id | namespace | columns and its values… || 283a385e-8822–4e6e-a694-eafe62ea3efb | orders | total_orders: 3 | total_orders_lunch: 2 |
| c26d796a-38f9–481f-87eb-283e9254530f | rewards | segmentation: A |

Testing it

Once the consumer was done, we had to make sure that we’re not losing messages, not having race conditions issues and that the data was being saved properly. To test that, we added a hardcoded prefix to the namespace name in the consumer to save, let’s say, the data from the orders namespace into a testing-orders namespace. With that, we wrote a couple of scripts to scan the database and compared the values between the namespaces (orders vs testing-orders) to ensure they’re the same.

Making it better and faster

Well, we had our first version. It was working properly, consuming events really fast and saving them correctly in DynamoDB. However, the database costs were really high and this first approach was quite ineffective because we’re doing one write per message to Dynamo even though we had a couple of messages from the same account_id quite closely in the Kafka topic, given that we’re using the account_id as the key partition. Something like this:

| account_id | feature_name | feature_value | namespace | timestamp|| user1 | total_orders | 3 | orders | ts1 |
| user1 | total_orders_lunch | 1 | orders | ts3 |
| user1 | segmentation | foo | rewards | ts2 |
| user2 | segmentation | foo | rewards | ts2 |


To monitor the workflow, we invested a lot of time to have very good observability of the process. We used a lot of Prometheus/Grafana to create custom metrics for the consumers and get metrics of the pods; Datadog to collect metrics from the Kafka topic and created dashboards about the consumer group lag and collect metrics of the cluster in general; New Relic to collect errors from the consumers and get a little bit more data from the consumers.

Main takeaways

  • You most likely won’t get the best result on the first attempt. Take notes of what’s good and what isn’t and do good research trying to find what could be better.
  • Processing this amount of data can be intimidating, but it’s not the end of the world. It does require focus and attention to the details, but you can do it.
  • While processing this amount of data it’s also hard to make sure you’re not losing data, not having race condition problems, and so on. Invest a lot of time on that.
  • It does require some tunning to make the Kafka consumers very fast. My main advice is to read carefully the documentation and play around with the parameters. The main ones are “fetch.min.bytes”, “auto commiting” and “setting max intervals”.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
felipe volpone

felipe volpone

I’m into distributed systems and how we can make them easier to develop and maintain. Writing code to scale to millions of users @ Ifood, formerly Red Hat .