Fix#605
All MergeTree tables are now replicated.
For some tables, a `_local` variant is added and the non-`_local`
variant is now distributed. The distributed tables are the `flows`
table, the `flows_DDDD` tables (where `DDDD` is a duration), as well as
the `flows_raw_errors` table. The `exporters` table is not distributed
and stays local.
The data is following this schema:
- data is coming from `flows_HHHH_raw` table, using the Kafka engine
- the `flows_HHHH_raw_consumer` reads data from `flows_HHHH_raw` (local)
and sends it to `flows` (distributed) when there is no error
- the `flows_raw_errors_consumer` reads data from
`flows_HHHH_raw` (local) and sends it to
`flows_raw_errors` (distributed)
- the `flows_DDDD_consumer` reads fata from `flows_local` (local) and
sends it to `flow_DDDD_local` (local)
- the `exporters_consumer` reads data from `flows` (distributed) and
sends it to `exporters` (local)
The reason for `flows_HHHH_raw_consumer` to send data to the distributed
`flows` table, and not the local one is to ensure flows are
balanced (for example, if there is not enough Kafka partitions). But
sending it to `flows_local` would have been possible.
On the other hand, it is important for `flows_DDDD_consumer` to read
from local to avoid duplication. It could have sent to distributed, but
the data is now balanced correctly and we just send it to local instead
for better performance.
The `exporters_consumer` is allowed to read from the distributed `flows`
table because it writes the result to the local `exporters` table.
Instantiate ClickHouse component earlier to reduce verbosity of a test
when skipped. Maybe there is a related test change in Go 1.22 as I don't
remember this behavior.
Recent versions of ClickHouse do not execute the provided entrypoint
script when the database exists. Workaround this by using our own
entrypoint and use it in place of the official one.
See https://github.com/ClickHouse/ClickHouse/pull/50724.
This is a first step to make it accept configuration. Most of the
changes are quite trivial, but I also ran into some difficulties with
query columns and filters. They need the schema for parsing, but parsing
happens before dependencies are instantiated (and even if it was not the
case, parsing is stateless). Therefore, I have added a `Validate()`
method that must be called after instantiation. Various bits `panic()`
if not validated to ensure we catch all cases.
The alternative to make the component manages a global state would have
been simpler but it would break once we add the ability to add or
disable columns.
This is a huge change to make the various subcomponents of the inlet use
the schema to generate the protobuf. For it to make sense, we also
modify the way we parse flows to directly serialize non-essential fields
to Protobuf.
The performance is mostly on par with the previous commit. We are a bit
less efficient because we don't have a fixed structure, but we avoid
loosing too much performance by not relying on reflection and keeping
the production of messages as code. We use less of Goflow2: raw flow
parsing is still done by Goflow2, but we don't use the producer part
anymore. This helps a bit with the performance as we parse less.
Overall, we are 20% than the previous commit and twice faster than the
1.6.4!
```
goos: linux
goarch: amd64
pkg: akvorado/inlet/flow
cpu: AMD Ryzen 5 5600X 6-Core Processor
BenchmarkDecodeEncodeNetflow
BenchmarkDecodeEncodeNetflow/with_encoding
BenchmarkDecodeEncodeNetflow/with_encoding-12 151484 7789 ns/op 8272 B/op 143 allocs/op
BenchmarkDecodeEncodeNetflow/without_encoding
BenchmarkDecodeEncodeNetflow/without_encoding-12 162550 7133 ns/op 8272 B/op 143 allocs/op
BenchmarkDecodeEncodeSflow
BenchmarkDecodeEncodeSflow/with_encoding
BenchmarkDecodeEncodeSflow/with_encoding-12 94844 13193 ns/op 9816 B/op 295 allocs/op
BenchmarkDecodeEncodeSflow/without_encoding
BenchmarkDecodeEncodeSflow/without_encoding-12 92569 12456 ns/op 9816 B/op 295 allocs/op
```
There was a tentative to parse sFlow packets with gopackets, but the
adhoc parser used here is more performant.