This change split the inlet component into a simpler inlet and a new
outlet component. The new inlet component receive flows and put them in
Kafka, unparsed. The outlet component takes them from Kafka and resume
the processing from here (flow parsing, enrichment) and puts them in
ClickHouse.
The main goal is to ensure the inlet does a minimal work to not be late
when processing packets (and restart faster). It also brings some
simplification as the number of knobs to tune everything is reduced: for
inlet, we only need to tune the queue size for UDP, the number of
workers and a few Kafka parameters; for outlet, we need to tune a few
Kafka parameters, the number of workers and a few ClickHouse parameters.
The outlet component features a simple Kafka input component. The core
component becomes just a callback function. There is also a new
ClickHouse component to push data to ClickHouse using the low-level
ch-go library with batch inserts.
This processing has an impact on the internal representation of a
FlowMessage. Previously, it was tailored to dynamically build the
protobuf message to be put in Kafka. Now, it builds the batch request to
be sent to ClickHouse. This makes the FlowMessage structure hides the
content of the next batch request and therefore, it should be reused.
This also changes the way we decode flows as they don't output
FlowMessage anymore, they reuse one that is provided to each worker.
The ClickHouse tables are slightly updated. Instead of using Kafka
engine, the Null engine is used instead.
Fix#1122
* fix: generation of protocols.csv file
* feat: generation of ports-tcp.csv and ports-udp.csv files
* build: add rules for creating udp and tcp csv files
* feat: create dictionary tcp and udp
* refactor: add replaceRegexpOne
* test: transform src port and dest port columns in SQL
* test: add TCP and UDP dictionaries for migration testing
Fix#605
All MergeTree tables are now replicated.
For some tables, a `_local` variant is added and the non-`_local`
variant is now distributed. The distributed tables are the `flows`
table, the `flows_DDDD` tables (where `DDDD` is a duration), as well as
the `flows_raw_errors` table. The `exporters` table is not distributed
and stays local.
The data is following this schema:
- data is coming from `flows_HHHH_raw` table, using the Kafka engine
- the `flows_HHHH_raw_consumer` reads data from `flows_HHHH_raw` (local)
and sends it to `flows` (distributed) when there is no error
- the `flows_raw_errors_consumer` reads data from
`flows_HHHH_raw` (local) and sends it to
`flows_raw_errors` (distributed)
- the `flows_DDDD_consumer` reads fata from `flows_local` (local) and
sends it to `flow_DDDD_local` (local)
- the `exporters_consumer` reads data from `flows` (distributed) and
sends it to `exporters` (local)
The reason for `flows_HHHH_raw_consumer` to send data to the distributed
`flows` table, and not the local one is to ensure flows are
balanced (for example, if there is not enough Kafka partitions). But
sending it to `flows_local` would have been possible.
On the other hand, it is important for `flows_DDDD_consumer` to read
from local to avoid duplication. It could have sent to distributed, but
the data is now balanced correctly and we just send it to local instead
for better performance.
The `exporters_consumer` is allowed to read from the distributed `flows`
table because it writes the result to the local `exporters` table.
This does not seem to survive a restart. There is no indication in the
documentation this is the right way. One should modify settings
directly. I need to investigate how to do this properly with Docker.
We introduce an leaky abstraction for flows schema and use it for
migrations as a first step.
For views and dictionaries, we stop relying on a hash to know if they
need to be recreated, but we compare the select statements with our
target statement. This is a bit fragile, but strictly better than the
hash.
For data tables, we add the missing columns.
We give up on the abstraction of a migration step and just rely on
helper functions to get the same result. The migration code is now
shorter and we don't need to update it when adding new columns.
This is a preparatory work for #211 to allow a user to specify
additional fields to collect.
In aggregated tables, these columns were missing from the ORDER BY
clause. This means they were set to some random values. This is not
possible to fix that after their creation (see #60 for a tentative),
therefore, we have to drop and recreate the columns. This only affects
aggregated tables, not the main table, but nonetheless, unless you
look at the last hour, the data is lost.
ClickHouse does not allow more consumers than the number of physical
CPUs. Unless configured otherwise, the number of threads match the
number of physical CPUs. We bound the number of consumers to this
number.
Fix#13
This is a tentative to downsample flows. However, we can only group by
a prefix of the primary key. Therefore, all downsampling intervals
should be encoded in the order by clause, which is not what we want.
The idea is to not query the flows table unless absolutely necessary.
It would have been nice to not have this Date field, but rebuilding
the table is costly, we'll do that later when the table is smaller. We
will also need to use a small PARTITION BY.
Also remove some migrations not needed anymore.