akvorado

mirror of https://github.com/akvorado/akvorado.git synced 2025-12-11 22:14:02 +01:00

Author	SHA1	Message	Date
Vincent Bernat	1322d42549	outlet/kafka: fix scaler hysteresis Previously, the scaler was scaling up and down independently. Because when scaling up/down, Kafka rebalances the topic, temporarily, we get scale down requests and the rate limiter won't stop them as it is independant from the scale up rate limiter. Instead, the rate limit for increase acts as a gracetime where everything is ignored, then between that and the rate limit for decrease, we only consider increasing the number of workers, past that, we scaling down as long as we have a majority of scale down requests (compared to steady ones). Fix #2080 (hopefully)	2025-11-11 21:26:05 +01:00
Vincent Bernat	7cfcb5fab3	outlet/kafka: change scaling strategy to scale up fast But scale down slowly. Moreover, after the initial period, don't try to scale up fast anymore.	2025-11-11 17:00:15 +01:00
Vincent Bernat	737b56ed77	outlet/kafka: decouple scaler logic This way, it is easier to test, notably with synctest package. The functional test is kept to a minimum.	2025-11-10 21:28:38 +01:00
Vincent Bernat	da6e73cba4	outlet/kafka: make lag test more reliable We need to ensure we are ready to receive from Kafka before producing the first message.	2025-11-05 23:34:16 +01:00
Vincent Bernat	ee6e197e8e	chore: switch to math/rand/v2	2025-10-26 12:14:20 +01:00
Vincent Bernat	eca3ff01b8	outlet/kafka: make scaling test more reliable Send only one message to request a decrease. That's enough and there is less risk than the messages are split in two groups (there may be some rebalancing).	2025-10-22 20:43:44 +02:00
Vincent Bernat	54822458a8	outlet/kafka: set linger for Kafka producer to 0ms in tests	2025-10-22 20:43:44 +02:00
Vincent Bernat	3a34495b70	outlet/kafka: be more aggressive when scaling up/down workers Some checks failed CI / 🤖 Check dependabot status (push) Has been cancelled Details CI / 🐧 Test on Linux (${{ github.ref_type == 'tag' }}, misc) (push) Has been cancelled Details CI / 🐧 Test on Linux (coverage) (push) Has been cancelled Details CI / 🐧 Test on Linux (regular) (push) Has been cancelled Details CI / ❄️ Build on Nix (push) Has been cancelled Details CI / 🍏 Build and test on macOS (push) Has been cancelled Details CI / 🧪 End-to-end testing (push) Has been cancelled Details CI / 🔍 Upload code coverage (push) Has been cancelled Details CI / 🔬 Test only Go (push) Has been cancelled Details CI / 🔬 Test only JS (${{ needs.dependabot.outputs.package-ecosystem }}, 20) (push) Has been cancelled Details CI / 🔬 Test only JS (${{ needs.dependabot.outputs.package-ecosystem }}, 22) (push) Has been cancelled Details CI / 🔬 Test only JS (${{ needs.dependabot.outputs.package-ecosystem }}, 24) (push) Has been cancelled Details CI / ⚖️ Check licenses (push) Has been cancelled Details CI / 🐋 Build Docker images (push) Has been cancelled Details CI / 🐋 Tag Docker images (push) Has been cancelled Details CI / 🚀 Publish release (push) Has been cancelled Details Update Nix dependency hashes / Update dependency hashes (push) Has been cancelled Details Update Go toolchain / Update Go toolchain (push) Has been cancelled Details Update Nix flake.lock / Update Nix lockfile (asn2org) (push) Has been cancelled Details Update Nix flake.lock / Update Nix lockfile (nixpkgs) (push) Has been cancelled Details Use dichotomy to quickly reach the optimal. This avoid too much Kafka rebalances on big setups.	2025-10-19 21:27:42 +02:00
Vincent Bernat	3374da6693	outlet/kafka: fix off-by-one scaling logic And add a test for it.	2025-10-19 19:26:54 +02:00
Vincent Bernat	59ee4a4749	outlet/kafka: add a metric for min and max workers Some checks failed CI / 🤖 Check dependabot status (push) Has been cancelled Details CI / 🐧 Test on Linux (${{ github.ref_type == 'tag' }}, misc) (push) Has been cancelled Details CI / 🐧 Test on Linux (coverage) (push) Has been cancelled Details CI / 🐧 Test on Linux (regular) (push) Has been cancelled Details CI / ❄️ Build on Nix (push) Has been cancelled Details CI / 🍏 Build and test on macOS (push) Has been cancelled Details CI / 🧪 End-to-end testing (push) Has been cancelled Details CI / 🔍 Upload code coverage (push) Has been cancelled Details CI / 🔬 Test only Go (push) Has been cancelled Details CI / 🔬 Test only JS (${{ needs.dependabot.outputs.package-ecosystem }}, 20) (push) Has been cancelled Details CI / 🔬 Test only JS (${{ needs.dependabot.outputs.package-ecosystem }}, 22) (push) Has been cancelled Details CI / 🔬 Test only JS (${{ needs.dependabot.outputs.package-ecosystem }}, 24) (push) Has been cancelled Details CI / ⚖️ Check licenses (push) Has been cancelled Details CI / 🐋 Build Docker images (push) Has been cancelled Details CI / 🐋 Tag Docker images (push) Has been cancelled Details CI / 🚀 Publish release (push) Has been cancelled Details	2025-10-15 15:59:07 +02:00
Vincent Bernat	8ea795c13d	outlet/kafka: cap the number of workers to the number of partitions	2025-10-15 13:58:14 +02:00
Vincent Bernat	801f3f1676	common/kafka: also logs output of kfake cluster	2025-09-23 07:06:58 +02:00
Vincent Bernat	a66ce7cc3e	outlet/kafka: make lag test more robust The consumer may not have started when testing initial lag. Just try a bit more.	2025-09-14 12:04:04 +02:00
François HORTA	e3a778552d	outlet/kafka: expose consumer lag as a prometheus metric Monitoring consumer lag is useful to troubleshoot performance/scaling issues. It can currenctly be seen through kafka-ui, but a proper metric is more practical. Unfortunately, JMX metrics on the broker don't expose this. It seems that people usually resort to monitoring from the consumer side, or through other external exporters like Burrow or kafka_exporter. franz-go/kadm provides a function to compute the consumer lag so let's do it from the consumer side (the outlet)	2025-08-30 19:17:32 +02:00
Vincent Bernat	866658bc70	outlet/kafka: fix crash when scaling down and up the workers The same metrics cannot be registered twice. Introduce a new method in reporter to unregister a previously registered collector. Fix #1908	2025-08-27 08:28:14 +02:00
Vincent Bernat	c2ef22b3fc	outlet/kafka: add more debugging to the worker scaling test	2025-08-24 08:38:32 +02:00
Vincent Bernat	8d9d323710	Revert "outlet/kafka: pace a bit the worker test" This reverts commit `0130119bdc`. The timing is a bit sensitive.	2025-08-23 17:32:07 +02:00
Vincent Bernat	0130119bdc	outlet/kafka: pace a bit the worker test	2025-08-23 16:21:42 +02:00
Vincent Bernat	e5a625aecf	outlet: make the number of Kafka workers dynamic Inserting into ClickHouse should be done in large batches to minimize the number of parts created. This would require the user to tune the number of Kafka workers to match a target of around 50k-100k rows. Instead, we dynamically tune the number of workers depending on the load to reach this target. We keep using async if we are too low in number of flows. It is still possible to do better by consolidating batches from various workers, but that's something I wanted to avoid. Also, increase the maximum wait time to 5 seconds. It should be good enough for most people. Fix #1885	2025-08-09 15:58:25 +02:00
Vincent Bernat	756e4a8fbd	*/kafka: switch to franz-go The concurrency of this library is easier to handle than Sarama. Notably, it is more compatible with the new model of "almost share nothing" we use for the inlet and the outlet. The lock for workers in outlet is removed. We can now use sync.Pool to allocate slice of bytes in inlet. It may also be more performant. In the future, we may want to commit only when pushing data to ClickHouse. However, this does not seem easy when there is a rebalance. In case of rebalance, we need to do something when a partition is revoked to avoid duplicating data. For example, we could flush the current batch to ClickHouse. Have a look at the `example/mark_offsets/main.go` file in franz-go repository for a possible approach. In the meantime, we rely on autocommit. Another contender could be https://github.com/segmentio/kafka-go. Also see https://github.com/twmb/franz-go/pull/1064.	2025-07-27 21:44:28 +02:00
Vincent Bernat	ac68c5970e	inlet: split inlet into new inlet and outlet This change split the inlet component into a simpler inlet and a new outlet component. The new inlet component receive flows and put them in Kafka, unparsed. The outlet component takes them from Kafka and resume the processing from here (flow parsing, enrichment) and puts them in ClickHouse. The main goal is to ensure the inlet does a minimal work to not be late when processing packets (and restart faster). It also brings some simplification as the number of knobs to tune everything is reduced: for inlet, we only need to tune the queue size for UDP, the number of workers and a few Kafka parameters; for outlet, we need to tune a few Kafka parameters, the number of workers and a few ClickHouse parameters. The outlet component features a simple Kafka input component. The core component becomes just a callback function. There is also a new ClickHouse component to push data to ClickHouse using the low-level ch-go library with batch inserts. This processing has an impact on the internal representation of a FlowMessage. Previously, it was tailored to dynamically build the protobuf message to be put in Kafka. Now, it builds the batch request to be sent to ClickHouse. This makes the FlowMessage structure hides the content of the next batch request and therefore, it should be reused. This also changes the way we decode flows as they don't output FlowMessage anymore, they reuse one that is provided to each worker. The ClickHouse tables are slightly updated. Instead of using Kafka engine, the Null engine is used instead. Fix #1122	2025-07-27 21:44:28 +02:00

21 Commits