Commit Graph

9 Commits

Author SHA1 Message Date
Vincent Bernat
a66ce7cc3e outlet/kafka: make lag test more robust
The consumer may not have started when testing initial lag. Just try a
bit more.
2025-09-14 12:04:04 +02:00
François HORTA
e3a778552d outlet/kafka: expose consumer lag as a prometheus metric
Monitoring consumer lag is useful to troubleshoot performance/scaling
issues. It can currenctly be seen through kafka-ui, but a proper metric
is more practical.

Unfortunately, JMX metrics on the broker don't expose this. It seems
that people usually resort to monitoring from the consumer side, or
through other external exporters like Burrow or kafka_exporter.

franz-go/kadm provides a function to compute the consumer lag so let's
do it from the consumer side (the outlet)
2025-08-30 19:17:32 +02:00
Vincent Bernat
866658bc70 outlet/kafka: fix crash when scaling down and up the workers
The same metrics cannot be registered twice. Introduce a new method in
reporter to unregister a previously registered collector.

Fix #1908
2025-08-27 08:28:14 +02:00
Vincent Bernat
c2ef22b3fc outlet/kafka: add more debugging to the worker scaling test 2025-08-24 08:38:32 +02:00
Vincent Bernat
8d9d323710 Revert "outlet/kafka: pace a bit the worker test"
This reverts commit 0130119bdc. The timing
is a bit sensitive.
2025-08-23 17:32:07 +02:00
Vincent Bernat
0130119bdc outlet/kafka: pace a bit the worker test 2025-08-23 16:21:42 +02:00
Vincent Bernat
e5a625aecf outlet: make the number of Kafka workers dynamic
Inserting into ClickHouse should be done in large batches to minimize
the number of parts created. This would require the user to tune the
number of Kafka workers to match a target of around 50k-100k rows. Instead,
we dynamically tune the number of workers depending on the load to reach
this target.

We keep using async if we are too low in number of flows.

It is still possible to do better by consolidating batches from various
workers, but that's something I wanted to avoid.

Also, increase the maximum wait time to 5 seconds. It should be good
enough for most people.

Fix #1885
2025-08-09 15:58:25 +02:00
Vincent Bernat
756e4a8fbd */kafka: switch to franz-go
The concurrency of this library is easier to handle than Sarama.
Notably, it is more compatible with the new model of "almost share
nothing" we use for the inlet and the outlet. The lock for workers in
outlet is removed. We can now use sync.Pool to allocate slice of bytes
in inlet.

It may also be more performant.

In the future, we may want to commit only when pushing data to
ClickHouse. However, this does not seem easy when there is a rebalance.
In case of rebalance, we need to do something when a partition is
revoked to avoid duplicating data. For example, we could flush the
current batch to ClickHouse. Have a look at the
`example/mark_offsets/main.go` file in franz-go repository for a
possible approach. In the meantime, we rely on autocommit.

Another contender could be https://github.com/segmentio/kafka-go. Also
see https://github.com/twmb/franz-go/pull/1064.
2025-07-27 21:44:28 +02:00
Vincent Bernat
ac68c5970e inlet: split inlet into new inlet and outlet
This change split the inlet component into a simpler inlet and a new
outlet component. The new inlet component receive flows and put them in
Kafka, unparsed. The outlet component takes them from Kafka and resume
the processing from here (flow parsing, enrichment) and puts them in
ClickHouse.

The main goal is to ensure the inlet does a minimal work to not be late
when processing packets (and restart faster). It also brings some
simplification as the number of knobs to tune everything is reduced: for
inlet, we only need to tune the queue size for UDP, the number of
workers and a few Kafka parameters; for outlet, we need to tune a few
Kafka parameters, the number of workers and a few ClickHouse parameters.

The outlet component features a simple Kafka input component. The core
component becomes just a callback function. There is also a new
ClickHouse component to push data to ClickHouse using the low-level
ch-go library with batch inserts.

This processing has an impact on the internal representation of a
FlowMessage. Previously, it was tailored to dynamically build the
protobuf message to be put in Kafka. Now, it builds the batch request to
be sent to ClickHouse. This makes the FlowMessage structure hides the
content of the next batch request and therefore, it should be reused.
This also changes the way we decode flows as they don't output
FlowMessage anymore, they reuse one that is provided to each worker.

The ClickHouse tables are slightly updated. Instead of using Kafka
engine, the Null engine is used instead.

Fix #1122
2025-07-27 21:44:28 +02:00