I did not benchmark it myself, but it was benchmarked here:
https://github.com/osrg/gobgp/issues/1414#issuecomment-3067255941
Of course, no guarantee that this benchmark matches our use cases.
Moreover, SubnetMap have been optimized to avoid parsing keys all
the time.
Also, the interface is a bit nicer and it uses netip.Prefix directly.
The next step is to convert outlet/routing/provider/bmp.
As we are not constrained by time that much in the outlet, we can
simplify the fetching of metadata by doing it synchronously. We still
keep the breaker design to avoid continously polling a source that is
not responsive, so we still can loose some data if we are not able to
poll metadata. We also keep the background cache refresh. We also
introduce a grace time of 1 minute to avoid loosing data during start.
For the static provider, we wait for the remote data sources to be
ready. For the gNMI provider, there are target windows of availability
during which the cached data can be polled. The SNMP provider is loosing
its ability to coalesce requests.
For example:
```
17:35 ❱ curl -s 127.0.0.1:8080/api/v0/outlet/metrics | promtool check metrics
akvorado_outlet_core_classifier_exporter_cache_size_items counter metrics should have "_total" suffix
akvorado_outlet_core_classifier_interface_cache_size_items counter metrics should have "_total" suffix
akvorado_outlet_flow_decoder_netflow_flowset_records_sum counter metrics should have "_total" suffix
akvorado_outlet_flow_decoder_netflow_flowset_records_sum non-histogram and non-summary metrics should not have "_sum" suffix
akvorado_outlet_flow_decoder_netflow_flowset_sum counter metrics should have "_total" suffix
akvorado_outlet_flow_decoder_netflow_flowset_sum non-histogram and non-summary metrics should not have "_sum" suffix
akvorado_outlet_kafka_buffered_fetch_records_total non-counter metrics should not have "_total" suffix
akvorado_outlet_kafka_buffered_produce_records_total non-counter metrics should not have "_total" suffix
akvorado_outlet_metadata_cache_refreshs counter metrics should have "_total" suffix
akvorado_outlet_routing_provider_bmp_peers_total non-counter metrics should not have "_total" suffix
akvorado_outlet_routing_provider_bmp_routes_total non-counter metrics should not have "_total" suffix
```
Also ensure metrics using errors as label don't have a too great
cardinality by using constants for error messages used.
When using `mapstructure:",squash"`, most structure-specific hook don't
dive into the structure as they are provided with the parent structure.
Add an helper to make them work on the embedded structure as well and
use it for the generic "deprecated fields" hook, but also for the hook
for the common Kafka configuration.
This is a bit brittle. There are other use cases, but they may not need
this change.
The concurrency of this library is easier to handle than Sarama.
Notably, it is more compatible with the new model of "almost share
nothing" we use for the inlet and the outlet. The lock for workers in
outlet is removed. We can now use sync.Pool to allocate slice of bytes
in inlet.
It may also be more performant.
In the future, we may want to commit only when pushing data to
ClickHouse. However, this does not seem easy when there is a rebalance.
In case of rebalance, we need to do something when a partition is
revoked to avoid duplicating data. For example, we could flush the
current batch to ClickHouse. Have a look at the
`example/mark_offsets/main.go` file in franz-go repository for a
possible approach. In the meantime, we rely on autocommit.
Another contender could be https://github.com/segmentio/kafka-go. Also
see https://github.com/twmb/franz-go/pull/1064.
This change split the inlet component into a simpler inlet and a new
outlet component. The new inlet component receive flows and put them in
Kafka, unparsed. The outlet component takes them from Kafka and resume
the processing from here (flow parsing, enrichment) and puts them in
ClickHouse.
The main goal is to ensure the inlet does a minimal work to not be late
when processing packets (and restart faster). It also brings some
simplification as the number of knobs to tune everything is reduced: for
inlet, we only need to tune the queue size for UDP, the number of
workers and a few Kafka parameters; for outlet, we need to tune a few
Kafka parameters, the number of workers and a few ClickHouse parameters.
The outlet component features a simple Kafka input component. The core
component becomes just a callback function. There is also a new
ClickHouse component to push data to ClickHouse using the low-level
ch-go library with batch inserts.
This processing has an impact on the internal representation of a
FlowMessage. Previously, it was tailored to dynamically build the
protobuf message to be put in Kafka. Now, it builds the batch request to
be sent to ClickHouse. This makes the FlowMessage structure hides the
content of the next batch request and therefore, it should be reused.
This also changes the way we decode flows as they don't output
FlowMessage anymore, they reuse one that is provided to each worker.
The ClickHouse tables are slightly updated. Instead of using Kafka
engine, the Null engine is used instead.
Fix#1122
The support is still pretty basic. Notably, scopes are not
configurable (waiting for someone to request them) and maybe there
client ID and secrets should not be provided as username/password.
Fix#1714
Otherwise, if the map contains "cafe", we may think this is a subnet
while it is obviously not. But we want to catch user errors like
"2o01:db8::/64" to provide a better error message.
Done with:
```
git grep -l 'for.*:= 0.*++' \
| xargs sed -i -E 's/for (.*) := 0; \1 < (.*); \1\+\+/for \1 := range \2/'
```
And a few manual fixes due to unused variables. There is something fishy
in BMP rib test. Add a comment about that. This is not equivalent (as
with range, random is evaluated once, while in the original loop, it is
evaluated at each iteration). I believe the intent was to behave like
with range.