mirror of https://github.com/akvorado/akvorado.git synced 2025-12-11 22:14:02 +01:00

Files

Vincent Bernat e20645c92e outlet/metadata: synchronous fetching of metadata

As we are not constrained by time that much in the outlet, we can
simplify the fetching of metadata by doing it synchronously. We still
keep the breaker design to avoid continously polling a source that is
not responsive, so we still can loose some data if we are not able to
poll metadata. We also keep the background cache refresh. We also
introduce a grace time of 1 minute to avoid loosing data during start.

For the static provider, we wait for the remote data sources to be
ready. For the gNMI provider, there are target windows of availability
during which the cached data can be polled. The SNMP provider is loosing
its ability to coalesce requests.

2025-07-27 21:44:28 +02:00

12 KiB

Raw Blame History

Internal design

Akvorado is written in Go. Each service has its code in a distinct directory (inlet/, outlet/, orchestrator/ and console/). The common/ directory contains components common to several services. The cmd/ directory contains the main entry points.

Each service is split into several components. This is heavily inspired by the Component framework in Clojure. A component is a piece of software with its configuration, its state and its dependencies on other components.

Each component features the following piece of code:

A Component structure containing its state.
A Configuration structure containing the configuration of the component. It maps to a section of Akvorado configuration file.
A DefaultConfiguration function with the default values for the configuration.
A New() function instantiating the component. This method takes the configuration and the dependencies. It is inert.
Optionally, a Start() method to start the routines associated with the component.
Optionally, a Stop() method to stop the component.

Each component is tested independently. If a component is complex, a NewMock() function can create a component with a compatible interface to be used in place of the real component. In this case, it takes a testing.T struct as first argument and starts the component immediately. It could return the real component or a mocked version. For example, the Kafka component returns a component using a mocked Kafka producer.

Dependencies are handled manually, unlike more complex component-based solutions like Uber Fx.

Reporter

The reporter is a special component handling logs and metrics for all the other components. In the future, this could also be the place to handle crash reports.

For logs, it is mostly a façade to github.com/rs/zerolog with some additional code to append the module name to the logs.

For metrics, it is a façade to the Prometheus instrumentation library. It provides a registry which automatically appends metric names with the module name.

It also exposes a simple way to report healthchecks from various components. While it could be used to kill the application proactively, currently, it is only exposed through HTTP. Not all components have healthchecks. For example, for the flow component, it is difficult to read from UDP while watching for a check. For the http component, the healthcheck would be too trivial (not in the routine handling the heavy work). For kafka, the hard work is hidden by the underlying library and we wouldn't want to be declared unhealthy because of a transient problem by checking broker states manually. The daemon component tracks the important goroutines, so it is not vital.

The general idea is to give a good visibility to an operator. Everything that moves should get a counter, errors should either be fatal, or rate-limited and accounted into a metric.

CLI

The CLI (not a component) is handled by Cobra. The configuration file is handled by mapstructure. Handling backward compatibility is done by registering hooks to transform the configuration.

Flow processing

Flow processing is split between inlet and outlet services:

Inlet flow reception

The inlet service receives flows. The design prioritizes speed and minimal processing: flows are encapsulated into protobuf messages and sent to Kafka without parsing. The design scales by creating a socket for each worker instead of distributing incoming flows using a channel.

NetFlow v5, NetFlow v9, IPFIX, and sFlow are currently supported for reception.

The design of this component is modular. It is possible to "plug" new inputs easily. Most buffering is implemented at this level by input modules that require them. Additional buffering happens in the Kafka module.

Outlet flow decoding

The outlet service consumes flows from Kafka and performs the actual decoding using GoFlow2. This is where flow parsing, enrichment with metadata and routing information, and classification happen before writing to ClickHouse.

Kafka

The Kafka component relies on franz-go. If a real broker is available under the DNS name kafka or at localhost on port 9092, it will be used for a quick functional test.

This library did not get benchmarked. Previously, we were using Sarama. However, the documentation is quite poor, it relies heavily on pointers (pressure on the garbage collector) and the concurrency model is difficult to understand. Another contender could be kafka-go.

ClickHouse

For this OLAP database, migrations are done with a simple loop checking if a step is needed using a custom query and executing it with Go code. Database migration systems exist in Go, notably migrate, but as the table schemas depend on user configuration, it is preferred to use code to check if the existing tables are up-to-date and to update them. For example, we may want to check if the Kafka settings of a table or the source URL of a dictionary are current.

When inserting into ClickHouse, we rely on the low-level ch-go library. Decoded flows are batched directly into the wire format used by ClickHouse.

Functional tests are run when a ClickHouse server is available under the name clickhouse or on localhost.

SNMP

SNMP polling is accomplished with GoSNMP. The cache layer is tailored specifically for our needs. Cached information can expire if not accessed or refreshed periodically. If an exporter fails to answer too frequently, a backoff will be triggered for a minute to ensure it does not eat up all the workers' resources.

Testing is done by another implementation of an SNMP agent.

BMP

The BMP server uses GoBGP's implementation. GoBGP does not have a BMP collector, but it's just a simple TCP connection receiving BMP messages and we use GoBGP to parse them. The data we need is stored in a Patricia tree.

github.com/kentik/patricia implements a fast Patricia tree for IP lookup in a tree of subnets. It leverages Go generics to make the code safe. It is used both for configuring subnet-dependent settings (eg SNMP communities) and for storing data received using BMP.

To save memory, Akvorado "interns" next-hops, origin AS, AS paths and communities. Each unique combination is associated to a reference-counter 32-bit integer, which is used in the RIB in place of the original information.

Schema

Akvorado schema is a bit dynamic. One can add or remove columns of data. However, everything needs to be predefined in the code. To add a new column, one needs to follow these steps:

Add its symbol to common/schema/definition.go.
Add it to the flow() function in common/schema/definition.go. Be sure to specify the right/smallest ClickHouse type. If the column is prefixed with Src or InIf, don't add the opposite direction, this is done automatically. Use ClickHouseMainOnly if the column is expected to take a lot of space. Add the column to the end and set Disabled field to true. If you add several fields, create a group and use it on decoding to keep decoding/encoding fast for people not enabling them.
Make it usable in the filters by adding it to console/filter/parser.peg. Don't forget to add a test in console/filter/parser_test.go.
Modify console/query/column.go to alter the display of the column (it should be a string).
If it does not have a proper type in ClickHouse to be displayed as is (like a MAC address stored as a 64-bit integer), also modify widgetFlowLastHandlerFunc() in console/widgets.go.
Modify inlet/flow/decoder/netflow/decode.go and inlet/flow/decoder/sflow/decode.go to extract the data from the flows.
If useful, add a completion in filterCompleteHandlerFunc() in akvorado/console/filter.go.

Web console

The web console is built as a REST API with a single page application on top of it.

REST API

The REST API is mostly built using the Gin framework which removes some boilerplate compared to using pure Go. Also, it uses the validator package which implements value validations based on tags. The validation options are quite rich.

Single page application

The SPA is built using mostly the following components:

TypeScript instead of JavaScript,
Vite as a builder,
Vue as the reactive JavaScript framework,
TailwindCSS for styling pages directly inside HTML,
Headless UI for some unstyled UI components,
ECharts to plot charts.
CodeMirror to edit filter expressions.

There is no full-blown component library despite the existence of many candidates:

Vuetify is only compatible with Vue 2.
BootstrapVue is only compatible with Vue 2.
PrimeVue is quite heavyweight and many stuff are not opensource.
VueTailwind would be the perfect match but it is not compatible with Vue 2.
Naive UI may be a future option but the styling is not using TailwindCSS which is annoying for responsive stuff, but we can just stay away from the proposed layout.

So, currently, components are mostly taken from Flowbite, copy/pasted or from Headless UI and styled like Flowbite.

Use of TailwindCSS is also a strong choice. Their documentation explains this choice. It makes sense but this is sometimes a burden. Many components are scattered around the web and when there is no need for JS, it is just a matter of copy/pasting and customizing.

Other components

The core component is the main processing component in the outlet service. It takes metadata, routing, and other components as dependencies and orchestrates the flow enrichment and classification process.

The HTTP component exposes a web server. Its main role is to manage the lifecycle of the HTTP server and to provide a method to add handlers. The web component provides the web interface of Akvorado. Currently, this is only the documentation. Other components may expose some various endpoints. They are documented in the usage section.

The daemon component handles the lifecycle of the whole application. It watches for the various goroutines (through tombs, see below) spawned by the other components and waits for signals to terminate. If Akvorado had a systemd integration, it would take place here too.

Other interesting dependencies

gopkg.in/tomb.v2 handles clean goroutine tracking and termination. Like contexts, it allows to signal termination of a bunch of goroutines. Unlike contexts, it also enables us to catch errors in goroutines and react to them (most of the time by dying).
github.com/benbjohnson/clock is used in place of the time module when we want to be able to mock the clock. This is used for example to test the cache of the SNMP poller.
github.com/cenkalti/backoff/v4 provides an exponential backoff algorithm for retries.
github.com/eapache/go-resiliency implements several resiliency patterns, including the breaker pattern.
github.com/go-playground/validator implements struct validation using tags. We use it to have better validation on configuration structures.

12 KiB Raw Blame History