Commit Graph

10 Commits

Author SHA1 Message Date
Vincent Bernat
1322d42549 outlet/kafka: fix scaler hysteresis
Previously, the scaler was scaling up and down independently. Because
when scaling up/down, Kafka rebalances the topic, temporarily, we get
scale down requests and the rate limiter won't stop them as it is
independant from the scale up rate limiter. Instead, the rate limit for
increase acts as a gracetime where everything is ignored, then between
that and the rate limit for decrease, we only consider increasing the
number of workers, past that, we scaling down as long as we have a
majority of scale down requests (compared to steady ones).

Fix #2080 (hopefully)
2025-11-11 21:26:05 +01:00
Vincent Bernat
894485c3ac outlet/clickhouse: give more time for ClickHouse component to flush data
Not only on shutdown, but also on finalization.
2025-11-10 16:17:21 +01:00
Vincent Bernat
10dc05d05c outlet/core: give more time for ClickHouse component to flush data
10 seconds may a bit low. This is part of #2079, but this is more a stopgap.
2025-11-10 16:15:34 +01:00
Gregor Düster
15d9f2531a outlet/core: fix typo 2025-09-29 05:37:19 +02:00
Vincent Bernat
e5a625aecf outlet: make the number of Kafka workers dynamic
Inserting into ClickHouse should be done in large batches to minimize
the number of parts created. This would require the user to tune the
number of Kafka workers to match a target of around 50k-100k rows. Instead,
we dynamically tune the number of workers depending on the load to reach
this target.

We keep using async if we are too low in number of flows.

It is still possible to do better by consolidating batches from various
workers, but that's something I wanted to avoid.

Also, increase the maximum wait time to 5 seconds. It should be good
enough for most people.

Fix #1885
2025-08-09 15:58:25 +02:00
Vincent Bernat
98eb1bdba5 chore: make a run of gofumpt 2025-08-05 06:21:34 +02:00
Vincent Bernat
17a272d0ba docs: update troubleshooting documentation 2025-07-27 21:44:28 +02:00
Vincent Bernat
8b580fd26b outlet/core: reuse the same RawFlow object when decoding
Each worker gets one and work on only one object. Something similar
could be done in inlet/flows. We could use a sync.Pool, as described in
https://github.com/IBM/sarama/issues/1302. However, it is not 100%
confirmed this is safe.
2025-07-27 21:44:28 +02:00
Vincent Bernat
e49a744a6d build: use vtprotobuf to speedup protobuf marshal/unmarshal
There is still room for improvement. For inlet, it would require to know
when Kafka has sent the message (so enabling successes return). For
outlet, it should be possible to reuse the same flow (with a ResetVT
between each use).
2025-07-27 21:44:28 +02:00
Vincent Bernat
ac68c5970e inlet: split inlet into new inlet and outlet
This change split the inlet component into a simpler inlet and a new
outlet component. The new inlet component receive flows and put them in
Kafka, unparsed. The outlet component takes them from Kafka and resume
the processing from here (flow parsing, enrichment) and puts them in
ClickHouse.

The main goal is to ensure the inlet does a minimal work to not be late
when processing packets (and restart faster). It also brings some
simplification as the number of knobs to tune everything is reduced: for
inlet, we only need to tune the queue size for UDP, the number of
workers and a few Kafka parameters; for outlet, we need to tune a few
Kafka parameters, the number of workers and a few ClickHouse parameters.

The outlet component features a simple Kafka input component. The core
component becomes just a callback function. There is also a new
ClickHouse component to push data to ClickHouse using the low-level
ch-go library with batch inserts.

This processing has an impact on the internal representation of a
FlowMessage. Previously, it was tailored to dynamically build the
protobuf message to be put in Kafka. Now, it builds the batch request to
be sent to ClickHouse. This makes the FlowMessage structure hides the
content of the next batch request and therefore, it should be reused.
This also changes the way we decode flows as they don't output
FlowMessage anymore, they reuse one that is provided to each worker.

The ClickHouse tables are slightly updated. Instead of using Kafka
engine, the Null engine is used instead.

Fix #1122
2025-07-27 21:44:28 +02:00