The two last fields of `sk_reuseport_md` were added in Linux 5.14. We
don't use them, so it shouldn't matter. I remove them from `vmlinux.h`
to ensure compatibility. Also, adding
`__attribute__((preserve_access_index))` should make the program more
portable (BPF CO-RE).
The counter is per-CPU and it should be more performant than using a
random number. The test may be flaky if the test process migrate from
one CPU to another. Let's see how it goes.
By default, the 5-tuple is used to load balance flows. Exporters with
many flows are bound to a specific worker. Use eBPF to do a per-packet
load-balancing.
Currently, this is done randomly, but we will use a percpu counter in
the next commit. This will make the test easier too, maybe?
This should also enable graceful restart but not with the current
Docker Compose setup, we would need to use mode host or spawn a new one
in the same network namespace than the old one. This does not look like
very complex:
- spawn a new inlet in the same network namespace, but listening to a
different HTTP port
- stop the previous inlet
- spawn a new inlet in the same network namespace
- stop the previous inlet
Alternatively, we could use SO_REUSEPORT for the HTTP socket too!
If the kernel is too old for timestamping, it should not be fatal. I
prefer to not accept SO_TIMESTAMP_OLD as the size of the timestamp is
arch-dependent.
Fix#1978
We don't need to use NativeEndian, we can just cast. The alignment is
ensured by CMSG_DATA macro, so it's safe even on archs not allowing
unaligned data access.
This way of doing things was one of the main reason Go took so much time
to get binary.NativeEndian.
This requires Linux 5.0+. Below, we would just get no timestamp. This is
more correct this way, even if most people would run that on 64-bit
Linux and already get 64-bit timestamp.
We also don't use the nanosecond part as it is "long long" and should be
virtually 64-bit on all archs, this is not totally correct.
The default value is quite low. This is a bit of a stop gap. The
alternative would be to maintain a circular buffer of the same size
inside the outlet for each connection and ensure there is no lock in the
path. But doing it in the kernel means almost no code, even if it is a
bit complex for the user.
Fix#1461
For example:
```
17:35 ❱ curl -s 127.0.0.1:8080/api/v0/outlet/metrics | promtool check metrics
akvorado_outlet_core_classifier_exporter_cache_size_items counter metrics should have "_total" suffix
akvorado_outlet_core_classifier_interface_cache_size_items counter metrics should have "_total" suffix
akvorado_outlet_flow_decoder_netflow_flowset_records_sum counter metrics should have "_total" suffix
akvorado_outlet_flow_decoder_netflow_flowset_records_sum non-histogram and non-summary metrics should not have "_sum" suffix
akvorado_outlet_flow_decoder_netflow_flowset_sum counter metrics should have "_total" suffix
akvorado_outlet_flow_decoder_netflow_flowset_sum non-histogram and non-summary metrics should not have "_sum" suffix
akvorado_outlet_kafka_buffered_fetch_records_total non-counter metrics should not have "_total" suffix
akvorado_outlet_kafka_buffered_produce_records_total non-counter metrics should not have "_total" suffix
akvorado_outlet_metadata_cache_refreshs counter metrics should have "_total" suffix
akvorado_outlet_routing_provider_bmp_peers_total non-counter metrics should not have "_total" suffix
akvorado_outlet_routing_provider_bmp_routes_total non-counter metrics should not have "_total" suffix
```
Also ensure metrics using errors as label don't have a too great
cardinality by using constants for error messages used.
The concurrency of this library is easier to handle than Sarama.
Notably, it is more compatible with the new model of "almost share
nothing" we use for the inlet and the outlet. The lock for workers in
outlet is removed. We can now use sync.Pool to allocate slice of bytes
in inlet.
It may also be more performant.
In the future, we may want to commit only when pushing data to
ClickHouse. However, this does not seem easy when there is a rebalance.
In case of rebalance, we need to do something when a partition is
revoked to avoid duplicating data. For example, we could flush the
current batch to ClickHouse. Have a look at the
`example/mark_offsets/main.go` file in franz-go repository for a
possible approach. In the meantime, we rely on autocommit.
Another contender could be https://github.com/segmentio/kafka-go. Also
see https://github.com/twmb/franz-go/pull/1064.
There is still room for improvement. For inlet, it would require to know
when Kafka has sent the message (so enabling successes return). For
outlet, it should be possible to reuse the same flow (with a ResetVT
between each use).
This change split the inlet component into a simpler inlet and a new
outlet component. The new inlet component receive flows and put them in
Kafka, unparsed. The outlet component takes them from Kafka and resume
the processing from here (flow parsing, enrichment) and puts them in
ClickHouse.
The main goal is to ensure the inlet does a minimal work to not be late
when processing packets (and restart faster). It also brings some
simplification as the number of knobs to tune everything is reduced: for
inlet, we only need to tune the queue size for UDP, the number of
workers and a few Kafka parameters; for outlet, we need to tune a few
Kafka parameters, the number of workers and a few ClickHouse parameters.
The outlet component features a simple Kafka input component. The core
component becomes just a callback function. There is also a new
ClickHouse component to push data to ClickHouse using the low-level
ch-go library with batch inserts.
This processing has an impact on the internal representation of a
FlowMessage. Previously, it was tailored to dynamically build the
protobuf message to be put in Kafka. Now, it builds the batch request to
be sent to ClickHouse. This makes the FlowMessage structure hides the
content of the next batch request and therefore, it should be reused.
This also changes the way we decode flows as they don't output
FlowMessage anymore, they reuse one that is provided to each worker.
The ClickHouse tables are slightly updated. Instead of using Kafka
engine, the Null engine is used instead.
Fix#1122
Done with:
```
git grep -l 'for.*:= 0.*++' \
| xargs sed -i -E 's/for (.*) := 0; \1 < (.*); \1\+\+/for \1 := range \2/'
```
And a few manual fixes due to unused variables. There is something fishy
in BMP rib test. Add a comment about that. This is not equivalent (as
with range, random is evaluated once, while in the original loop, it is
evaluated at each iteration). I believe the intent was to behave like
with range.
This is useless as this also needs to be enabled with the SIOCSHWTSTAMP
ioctl. This requires CAP_NET_ADMIN and we would need to guess the
physical interface. Too much trouble.