This will serve as a base for converting this to a one-step conversion
to Protobuf. The main goal is not to be faster, but we don't want to be
slower and faster would be a nice bonus.
```
goos: linux
goarch: amd64
pkg: akvorado/inlet/flow
cpu: AMD Ryzen 5 5600X 6-Core Processor
BenchmarkDecodeEncodeNetflow
BenchmarkDecodeEncodeNetflow-12 39586 29199 ns/op
BenchmarkDecodeEncodeSflow
BenchmarkDecodeEncodeSflow-12 24349 48914 ns/op
ok akvorado/inlet/flow 3.167s
DONE 0 tests in 3.636s
```
This is a bit less type-safe. We could keep type safety by redefining
all the consts in `query_consts.go` in `common/schema`, but this is
pointless as the goal is to have arbitrary dimensions at some point.
We introduce an leaky abstraction for flows schema and use it for
migrations as a first step.
For views and dictionaries, we stop relying on a hash to know if they
need to be recreated, but we compare the select statements with our
target statement. This is a bit fragile, but strictly better than the
hash.
For data tables, we add the missing columns.
We give up on the abstraction of a migration step and just rely on
helper functions to get the same result. The migration code is now
shorter and we don't need to update it when adding new columns.
This is a preparatory work for #211 to allow a user to specify
additional fields to collect.
In our case, this shouldn't matter. However, performance hit should be
low and maybe at some point this middleware could be used for more
sensitive stuff.
validate is only able to validate non-struct types (or recurse inside
struct). So, if we want to use "required" on some of them, we need a
custom type.
Fix#263
RIB updates are handled by a single goroutine accepting update requests
through a channel receiving functions to execute on the state (RIB +
peer state).
For lookups, we have 3 options (better to lower performance, higher to
lower memory usage):
1. have a read-only copy updated "atomically" at regular interval,
2. have a read-only copy updated behind a lock at regular interval,
3. handle lookups by the worker through a high priority channel.
This commit implements option 3. It may be a regression in latency
compared to the previous design because long updates (flushing peers)
may prevent answering lookup requests. This will be addressed in the
next commit.
This could help performance as we will skip removing a prefix if we
don't have the associated NLRI. However, this is an unlikely corner
case (all routes we have should have been added first).
When the RIB is locked for too long, inlet is hung. Try to ensure give a
bit of time for the inlet to move forward between two flush of the RIB.
There are various knobs not documnted yet until we get better defaults:
- `inlet.bmp.peer-removal-max-time`: how long to keep the lock
- `inlet.bmp.peer-removal-sleep-interval`: how long to sleep between two
runs if we were unable to flush the whole peer
- `inlet.bmp.peer-removal-max-queue`: maximum number of flush requests
- `inlet.bmp.peer-removal-min-routes`: minimum number of routes to flush
before yielding
May fix#253
At first, there was a tentative to use BMP collector implementation
from bio-rd. However, this current implementation is using GoBGP
instead:
- BMP is very simple from a protocol point of view. The hard work is
mostly around decoding. Both bio-rd and GoBGP can decode, but for
testing, GoBGP is able to generate messages as well (this is its
primary purpose, I suppose parsing was done for testing purpose).
Using only one library is always better. An alternative would be
GoBMP, but it also only do parsing.
- Logging and metrics can be customized easily (but the work was done
for bio-rd, so not a real argument).
- bio-rd is an application and there is no API stability (and I did
that too)
- GoBGP supports FlowSpec, which may be useful in the future for the
DDoS part. Again, one library for everything is better (but
honestly, GoBGP as a lib is not the best part of it, maybe
github.com/jwhited/corebgp would be a better fit while keeping GoBGP
for decoding/encoding).
There was a huge effort around having a RIB which is efficient
memory-wise (data are interned to save memory), performant during
reads, while being decent during insertions. We rely on a patched
version of Kentik's Patricia trees to be able to apply mutations to
the tree.
There was several tentatives to implement some kind of graceful
restart, but ultimetaly, the design is kept simple: when a BMP
connection goes down, routes will be removed after a configurable
time. If the connection comes back up, then it is just considered new.
It would have been ideal to rely on EoR markers, but the RFC is
unclear about them, and they are likely to be per peer, making it
difficult to know what to do if one peer is back, but not the other.
Remaining tasks:
- [ ] Confirm support for LocRIB
- [ ] Import data in ClickHouse
- [ ] Make data available in the frontend
Fix#52
Raw data files can be converted with Scapy:
```python
from scapy.all import *
wrpcap("data-1140.pcap",
Ether(src="00:53:00:11:22:33",dst="00:53:00:44:55:66")/
IP(src="192.0.2.100", dst="192.0.2.101")/
UDP(sport=47873,dport=6343)/
open("data-1140.data", "rb").read())
```
While there is more helpful information in a panic, this is confusing
to the user. With the amount of code using reflection, it seems better
to have clearer messages to help the user find the faulty section if
any.
For each case, we test from native map and from YAML. This should
capture all the cases we are interested.
Also, simplify pretty diff by using stringer for everything. I don't
remember why this wasn't the case. Maybe IP addresses? It's possible
to opt out by overriding formatters.
But not using it as some linters are either plain incorrect (the one
suggesting to not use nil for `c.t.Context()`) or just
debatable (checking for err value is a good practice, but there are
good reasons to opt out in some cases).
We did not handle all cases, notably the case where default-community
was not set explicitely by the user. This seems a lot of code for
little gain, let's keep things simple.