Files
akvorado/console/data/docs/05-troubleshooting.md
Vincent Bernat 45ab047c80 config: also listen to 4739 for IPFIX
This is the port defined in RFC 7011.
2025-08-29 08:12:30 +02:00

24 KiB
Raw Blame History

Troubleshooting

Warning

Please read this page carefully before you open an issue or start a discussion.

Tip

This guide assumes that you use the Docker Compose setup. If you use a different setup, adapt the commands as needed.

As explained in the introduction, Akvorado has several components. To troubleshoot an issue, inspect each component.

Functional view

Your routers send flows to the inlet, which sends them to Kafka. The outlet takes flows from Kafka, decodes and processes them, and then sends them to ClickHouse. The orchestrator configures Kafka and ClickHouse and provides the configuration for the inlet and outlet. The console (not shown here) queries ClickHouse to display flows to users.

Basic checks

First, check that you have enough space. This is a common cause of failure:

$ docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          7         7         1.819GB   7.834MB (0%)
Containers      15        15        2.752GB   0B (0%)
Local Volumes   16        9         69.24GB   8.594GB (12%)
Build Cache     4         0         5.291MB   5.291MB

You can recover space with docker system prune or get more details with docker system df -v. See the documentation about operations on how to check space usage for ClickHouse.

Caution

Do not use docker system prune -a unless you are sure that all your containers are up and running. It is important to understand that this command removes anything that is not currently used.

Check that all components are running and healthy:

$ docker compose ps --format "table {{.Service}}\t{{.Status}}"
SERVICE                    STATUS
akvorado-conntrack-fixer   Up 28 minutes
akvorado-console           Up 27 minutes (healthy)
akvorado-inlet             Up 27 minutes (healthy)
akvorado-orchestrator      Up 27 minutes (healthy)
akvorado-outlet            Up 27 minutes (healthy)
clickhouse                 Up 28 minutes (healthy)
geoip                      Up 28 minutes (healthy)
kafka                      Up 28 minutes (healthy)
kafka-ui                   Up 28 minutes
redis                      Up 28 minutes (healthy)
traefik                    Up 28 minutes

Make sure that all components are present. If a component is missing, restarting, unhealthy, or not working correctly, check its logs:

$ docker compose logs akvorado-inlet

The inlet, outlet, orchestrator, and console expose metrics. Get them with this command:

$ curl -s http://127.0.0.1:8080/api/v0/inlet/metrics
# HELP akvorado_cmd_info Akvorado build information
# TYPE akvorado_cmd_info gauge
akvorado_cmd_info{compiler="go1.24.4",version="v1.11.5-134-gaf3869cd701c"} 1
[...]

Caution

Run the curl command on the same host that runs Akvorado, and change inlet to the name of the component that you are interested in.

To see only error metrics, filter them:

$ curl -s http://127.0.0.1:8080/api/v0/inlet/metrics | grep 'akvorado_.*_error'

Tip

To follow this guide on a working system, replace http://127.0.0.1:8080 with https://demo.akvorado.net.

Inlet service

The inlet service receives NetFlow/IPFIX/sFlow packets and sends them to Kafka. First, check if you are receiving packets from exporters (your routers):

$ curl -s http://127.0.0.1:8080/api/v0/inlet/metrics | grep 'akvorado_inlet_flow_input_udp_packets'
# HELP akvorado_inlet_flow_input_udp_packets_total Packets received by the application.
# TYPE akvorado_inlet_flow_input_udp_packets_total counter
akvorado_inlet_flow_input_udp_packets_total{exporter="241.107.1.12",listener=":2055",worker="2"} 6769
akvorado_inlet_flow_input_udp_packets_total{exporter="241.107.1.13",listener=":2055",worker="1"} 6794
akvorado_inlet_flow_input_udp_packets_total{exporter="241.107.1.14",listener=":2055",worker="2"} 6765
akvorado_inlet_flow_input_udp_packets_total{exporter="241.107.1.15",listener=":2055",worker="0"} 6782

If your exporters are not listed, check their configuration. You can also use tcpdump to verify that they are sending packets. Replace the IP with the IP address of the exporter and the port with the correct port (2055 for NetFlow, 4739 for IPFIX and 6343 for sFlow).

# tcpdump -c3 -pni any host 241.107.1.12 and port 2055
09:11:08.729738 IP 241.107.1.12.44026 > 240.0.2.9.2055: UDP, length 624
09:11:08.729787 IP 241.107.1.12.44026 > 240.0.2.9.2055: UDP, length 1060
09:11:08.729799 IP 241.107.1.12.44026 > 240.0.2.9.2055: UDP, length 1060
3 packets captured
3 packets received by filter
0 packets dropped by kernel

Next, check if flows are sent to Kafka correctly:

$ curl -s http://127.0.0.1:8080/api/v0/inlet/metrics | grep 'akvorado_inlet_kafka_sent_messages'
# HELP akvorado_inlet_kafka_sent_messages_total Number of messages sent from a given exporter.
# TYPE akvorado_inlet_kafka_sent_messages_total counter
akvorado_inlet_kafka_sent_messages_total{exporter="241.107.1.12"} 8108
akvorado_inlet_kafka_sent_messages_total{exporter="241.107.1.13"} 8117
akvorado_inlet_kafka_sent_messages_total{exporter="241.107.1.14"} 8090
akvorado_inlet_kafka_sent_messages_total{exporter="241.107.1.15"} 8123

If no messages appear here, there may be a problem with Kafka.

Kafka

The inlet sends messages to Kafka, and the outlet takes them from Kafka. The Docker Compose setup comes with UI for Apache Kafka. You can access it at http://127.0.0.1:8080/kafka-ui.

Tip

For security reasons, this UI is not exposed on anything other than the host that is running Akvorado. If you need to access it remotely, the easiest way is to use SSH port forwarding: ssh -L 8080:127.0.0.1:8080 akvorado. Then, you can use http://127.0.0.1:8080/kafka-ui directly from your workstation.

Check the various tabs (brokers, topics, and consumers) to make sure that everything is green. In “brokers”, you should see one broker. In “topics”, you should see flows-v5 with an increasing number of messages. This means that the inlet is pushing messages. In “consumers”, you should have akvorado-outlet, with at least one member. The consumer lag should be stable (and low). This is the number of messages that the outlet has not yet processed.

Outlet

The outlet is the most complex component. Check if it works correctly with this command (it should show one processed flow):

$ curl -s http://127.0.0.1:8080/api/v0/outlet/flows\?limit\=1
{"TimeReceived":1753631373,"SamplingRate":100000,"ExporterAddress":"::ffff:241.107.1.15","InIf":10,"OutIf":21,"SrcVlan":0,"DstVlan":0,"SrcAddr":"::ffff:216.58.206.244","DstAddr":"::ffff:192.0.2.144","NextHop":"","SrcAS":15169,"DstAS":64501,"SrcNetMask":24,"DstNetMask":24,"OtherColumns":null}

Check these important metrics. First, the outlet should receive flows from Kafka:

$ curl -s http://127.0.0.1:8080/api/v0/outlet/metrics | grep 'akvorado_outlet_kafka_received_messages'
# HELP akvorado_outlet_kafka_received_messages_total Number of messages received for a given worker.
# TYPE akvorado_outlet_kafka_received_messages_total counter
akvorado_outlet_kafka_received_messages_total{worker="0"} 5561
akvorado_outlet_kafka_received_messages_total{worker="1"} 5456
akvorado_outlet_kafka_received_messages_total{worker="2"} 5583
akvorado_outlet_kafka_received_messages_total{worker="3"} 11068
akvorado_outlet_kafka_received_messages_total{worker="4"} 11151
akvorado_outlet_kafka_received_messages_total{worker="5"} 5588

If these numbers are not increasing, there is a problem with receiving from Kafka. If everything is OK, check if the flow processing pipeline works correctly:

$ curl -s http://127.0.0.1:8080/api/v0/outlet/metrics | grep -P 'akvorado_outlet_core_(received|forwarded)'
# HELP akvorado_outlet_core_forwarded_flows_total Number of flows forwarded to Kafka.
# TYPE akvorado_outlet_core_forwarded_flows_total counter
akvorado_outlet_core_forwarded_flows_total{exporter="241.107.1.12"} 182512
akvorado_outlet_core_forwarded_flows_total{exporter="241.107.1.13"} 182366
akvorado_outlet_core_forwarded_flows_total{exporter="241.107.1.14"} 182278
akvorado_outlet_core_forwarded_flows_total{exporter="241.107.1.15"} 182900
# HELP akvorado_outlet_core_received_flows_total Number of incoming flows.
# TYPE akvorado_outlet_core_received_flows_total counter
akvorado_outlet_core_received_flows_total{exporter="241.107.1.12"} 182512
akvorado_outlet_core_received_flows_total{exporter="241.107.1.13"} 182366
akvorado_outlet_core_received_flows_total{exporter="241.107.1.14"} 182278
akvorado_outlet_core_received_flows_total{exporter="241.107.1.15"} 182900
# HELP akvorado_outlet_core_received_raw_flows_total Number of incoming raw flows (proto).
# TYPE akvorado_outlet_core_received_raw_flows_total counter
akvorado_outlet_core_received_raw_flows_total 45812

Notably, akvorado_outlet_core_received_raw_flows_total is incremented by one for each message that is received from Kafka. The message is then decoded, and the flows are extracted. For each extracted flow, akvorado_outlet_core_received_flows_total is incremented by one. The flows are then enriched, and before they are sent to ClickHouse, akvorado_outlet_core_forwarded_flows_total is incremented.

If akvorado_outlet_core_received_raw_flows_total increases but akvorado_outlet_core_received_flows_total does not, there is an error decoding the flows. If akvorado_outlet_core_received_flows_total increases but akvorado_outlet_core_forwarded_flows_total does not, there is an error enriching the flows.

For the first case, use this command to find clues:

$ curl -s http://127.0.0.1:8080/api/v0/outlet/metrics | grep 'akvorado_outlet_flow.*errors'

For the second case, use this one:

$ curl -s http://127.0.0.1:8080/api/v0/outlet/metrics | grep 'akvorado_outlet_core.*errors'

Here is a list of errors that you may find:

  • metadata cache miss means that interface information is missing from the metadata cache. The most likely cause is that the exporter does not accept SNMP requests or the SNMP community is configured incorrectly.
  • sampling rate missing means that the sampling rate information is not present. This is normal when Akvorado starts, but it should not keep increasing. With NetFlow, the sampling rate is sent in an options data packet. Make sure that your exporter sends them (look for sampler-table in the documentation). Alternatively, you can configure outletcoredefault-sampling-rate to work around this issue.
  • input and output interfaces missing means that the flow does not contain input and output interface indexes. Fix this on the exporter.

A convenient way to check if the SNMP configuration is correct is to use tcpdump.

# nsenter -t $(pgrep -fo "akvorado inlet") -n tcpdump -c3 -pni eth0 port 161
20:46:44.812243 IP 240.0.2.11.34554 > 240.0.2.13.161: C="private" GetRequest(95) .1.3.6.1.2.1.1.5.0 .1.3.6.1.2.1.2.2.1.2.11 .1.3.6.1.2.1.31.1.1.1.1.11 .1.3.6.1.2.1.31.1.1.1.18.11 .1.3.6.1.2.1.31.1.1.1.15.11
20:46:45.144567 IP 240.0.2.13.161 > 240.0.2.11.34554: C="private" GetResponse(153) .1.3.6.1.2.1.1.5.0="dc3-edge1.example.com" .1.3.6.1.2.1.2.2.1.2.11="Gi0/0/0/11" .1.3.6.1.2.1.31.1.1.1.1.11="Gi0/0/0/11" .1.3.6.1.2.1.31.1.1.1.18.11="Transit: Lumen" .1.3.6.1.2.1.31.1.1.1.15.11=10000
^C
2 packets captured
2 packets received by filter
0 packets dropped by kernel

If you do not get an answer, there may be several causes:

  • the community is incorrect, and you need to fix it
  • the exporter is not configured to answer SNMP requests

Finally, check if flows are sent to ClickHouse successfully. Use this command:

$ curl -s http://127.0.0.1:8080/api/v0/outlet/metrics | grep -P 'akvorado_outlet_clickhouse_(errors|flow)'
# HELP akvorado_outlet_clickhouse_errors_total Errors while inserting into ClickHouse
# TYPE akvorado_outlet_clickhouse_errors_total counter
akvorado_outlet_clickhouse_errors_total{error="send"} 7
# HELP akvorado_outlet_clickhouse_flow_per_batch Number of flow per batch sent to ClickHouse
# TYPE akvorado_outlet_clickhouse_flow_per_batch summary
akvorado_outlet_clickhouse_flow_per_batch{quantile="0.5"} 250
akvorado_outlet_clickhouse_flow_per_batch{quantile="0.9"} 480
akvorado_outlet_clickhouse_flow_per_batch{quantile="0.99"} 950
akvorado_outlet_clickhouse_flow_per_batch_sum 45892
akvorado_outlet_clickhouse_flow_per_batch_count 163

If the errors are not increasing and flow_per_batch_sum is increasing, everything is working correctly.

ClickHouse

The last component to check is ClickHouse. Connect to it with this command:

$ docker compose exec clickhouse clickhouse-client

First, check if all the tables are present:

$ SHOW TABLES
    ┌─name────────────────────────────────────────────┐
 1. │ asns                                            │
 2. │ exporters                                       │
 3. │ exporters_consumer                              │
 4. │ flows                                           │
 5. │ flows_1h0m0s                                    │
 6. │ flows_1h0m0s_consumer                           │
 7. │ flows_1m0s                                      │
 8. │ flows_1m0s_consumer                             │
 9. │ flows_5m0s                                      │
10. │ flows_5m0s_consumer                             │
11. │ flows_I6D3KDQCRUBCNCGF4BSOWTRMVIv5_raw          │
12. │ flows_I6D3KDQCRUBCNCGF4BSOWTRMVIv5_raw_consumer │
13. │ icmp                                            │
14. │ networks                                        │
15. │ protocols                                       │
16. │ tcp                                             │
17. │ udp                                             │
    └─────────────────────────────────────────────────┘

Check if the various dictionaries are populated:

$ SELECT name, element_count FROM system.dictionaries
   ┌─name──────┬─element_count─┐
1. │ networks  │       5963224 │
2. │ udp       │          5495 │
3. │ icmp      │            58 │
4. │ protocols │           129 │
5. │ asns      │         99598 │
6. │ tcp       │          5883 │
   └───────────┴───────────────┘

If you have not used the console yet, some dictionaries may be empty.

To check if ClickHouse is behind, use this SQL query with clickhouse-client to get the lag in seconds:

SELECT (now()-max(TimeReceived))/60
FROM flows

If you still have problems, check the errors that are reported by ClickHouse:

SELECT last_error_time, last_error_message
FROM system.errors
ORDER BY last_error_time LIMIT 10
FORMAT Vertical

Console

The most common console problems are empty widgets or no flows shown in the “visualize” tab. Both problems indicate that interface classification is not working correctly.

Interface classification marks interfaces as either “internal” or “external”. If you have not configured interface classification, see the configuration guide. This step is required.

Scaling

Various bottlenecks can cause dropped packets. This is problematic because the reported sampling rate becomes incorrect, and you cannot reliably calculate the number of bytes and packets. Both the exporters and the inlet need to be tuned for this kind of problem.

The outlet can also be a bottleneck. In this case, the flows may appear on the console with a delay.

Exporters

The first problem may be that the exporter is dropping flows. Usually, counters can detect this situation, and you can solve it by reducing the exporter rate.

NCS5500 routers

NetFlow, Sampling-Interval and the Mythical Internet Packet Size contains a lot of information about the limits of this platform. The first bottleneck is a 133 Mbps shaper between an NPU and the LC CPU for the sampled packets (144 bytes each). For example, on a NC55-36X100G line card, there are 6 NPUs, and each one manages 6 ports. If we consider an average packet size of 1000, the maximum sampling rate when all ports are full is 1:700 (the formula is Total-BW / ( Avg-Pkt-Size x 133Mbps ) x ( 144 x 8 )).

It is possible to check if there are drops with sh controllers npu stats voq base 24 instance 0 location 0/0/CPU0 and looking at the COS2 line.

The second bottleneck is the size of the flow cache. If it is too small, it may overflow. For example:

# show flow monitor monitor1 cache internal location 0/1/CPU0 | i Cache
Cache summary for Flow Monitor :
Cache size:                         100000
Cache Hits:                            202938943
Cache Misses:                         1789836407
Cache Overflows:                         2166590
Cache above hi water:                       1704

When this happens, either the cache timeout rate-limit should be increased, or the cache entries directive should be increased. The latter value can be increased to 1 million per monitor-map.

Other routers

Other routers are likely to have the same limitations. Note that sFlow and IPFIX 315 do not have a flow cache and are therefore less likely to have scaling problems.

Inlet

When the inlet has scaling issues, the kernel's receive buffers may drop packets. Each listening queue has a fixed number of receive buffers (212992 bytes by default) to keep packets before they are handled by the application. When this buffer is full, packets are dropped.

Akvorado reports the number of drops for each listening socket with the akvorado_inlet_flow_input_udp_in_dropped_packets_total counter. This should be compared to akvorado_inlet_flow_input_udp_packets_total. Another way to get the same information is to use ss -lunepm and look at the drop counter:

$ nsenter -t $(pgrep -fo "akvorado inlet") -n ss -lunepm
State            Recv-Q           Send-Q                       Local Address:Port                        Peer Address:Port           Process
UNCONN           0                0                                        *:2055                                   *:*               users:(("akvorado",pid=2710961,fd=16)) ino:67643151 sk:89c v6only:0 <->
         skmem:(r0,rb212992,t0,tb212992,f4096,w0,o0,bl0,d486525)

In the example above, there were 486525 drops. You can solve this in three ways:

  • increase the number of workers for the UDP input,
  • increase the value of the net.core.rmem_max sysctl (on the host) and increase the receive-buffer setting for the input to the same value,
  • add more inlet instances and shard the exporters among the configured ones.

The value of the receive buffer is also available as a metric:

$ curl -s http://127.0.0.1:8080/api/v0/inlet/metrics | grep -P 'akvorado_inlet_flow_input_udp_buffer'
# HELP akvorado_inlet_flow_input_udp_buffer_size_bytes Size of the in-kernel buffer for this worker.
# TYPE akvorado_inlet_flow_input_udp_buffer_size_bytes gauge
akvorado_inlet_flow_input_udp_buffer_size_bytes{listener=":2055",worker="2"} 212992

Outlet

The outlet is expected to automatically scale the number of workers to ensure that the data is delivered efficiently to ClickHouse. Increasing the maximum number of Kafka workers (max-workers) past the default value of 8 may put more pressure on ClickHouse. However, you can increase maximum-batch-size.

The BMP routing component may have some challenges with scaling, especially when peers disappear, as it requires cleaning up many entries in the routing tables. BMP is a one-way protocol, and the sender may declare the receiver station “stuck” if it does not accept more data. To avoid this, you may need to tune the TCP receive buffer. First, check the current situation:

$ nsenter -t $(pgrep -fo "akvorado outlet") -n ss -tnepm sport = :10179
State         Recv-Q          Send-Q                         Local Address:Port                           Peer Address:Port          Process
ESTAB         0               0                        [::ffff:240.0.2.10]:10179                   [::ffff:240.0.2.15]:46656          users:(("akvorado",pid=1117049,fd=13)) timer:(keepalive,55sec,0) ino:40752297 sk:11 <->
         skmem:(r0,rb131072,t0,tb87040,f0,w0,o0,bl0,d2976)
ESTAB         0               0                        [::ffff:240.0.2.10]:10179                   [::ffff:240.0.2.14]:44318          users:(("akvorado",pid=1117049,fd=12)) timer:(keepalive,55sec,0) ino:40751196 sk:12 <->
         skmem:(r0,rb131072,t0,tb87040,f0,w0,o0,bl0,d2976)
ESTAB         0               0                        [::ffff:240.0.2.10]:10179                   [::ffff:240.0.2.13]:47586          users:(("akvorado",pid=1117049,fd=11)) timer:(keepalive,55sec,0) ino:40751097 sk:13 <->
         skmem:(r0,rb131072,t0,tb87040,f0,w0,o0,bl0,d2976)
ESTAB         0               0                        [::ffff:240.0.2.10]:10179                   [::ffff:240.0.2.12]:51352          users:(("akvorado",pid=1117049,fd=14)) timer:(keepalive,55sec,0) ino:40752299 sk:14 <->
         skmem:(r0,rb131072,t0,tb87040,f0,w0,o0,bl0,d2976)

Here, the receive buffer for each process is 131,072 bytes. Linux exposes net.ipv4.tcp_rmem to tune this value:

$ sysctl net.ipv4.tcp_rmem
net.ipv4.tcp_rmem = 4096        131072  6291456

The middle value is the default one, and the last value is the maximum one. When net.ipv4.tcp_moderate_rcvbuf is set to 1 (the default), Linux automatically tunes the size of the buffer for the application to maximize the throughput depending on the latency. This mechanism will not help here. To increase the receive buffer size, you need to:

  • set the receive-buffer value in the BMP provider configuration (for example, to 33554432 for 32 MiB)
  • increase the last value of net.ipv4.tcp_rmem to the same value
  • increase the value of net.core.rmem_max to the same value
  • optionally, increase the last value of net.ipv4.tcp_mem by the value of tcp_rmem[2] multiplied by the maximum number of BMP peers you expect, divided by 4096 bytes per page (check with getconf PAGESIZE)

The value of the receive buffer is also available as a metric:

$ curl -s http://127.0.0.1:8080/api/v0/inlet/metrics | grep -P 'akvorado_outlet_routing_provider_bmp_buffer'
# HELP akvorado_outlet_flow_input_udp_buffer_size_bytes Size of the in-kernel buffer for this connection.
# TYPE akvorado_outlet_flow_input_udp_buffer_size_bytes gauge
akvorado_outlet_flow_input_udp_buffer_size_bytes{exporter="241.107.1.12"} 425984

Profiling

On a large-scale installation, you may want to check if Akvorado is using too much CPU or memory. You can do this with pprof, the Go profiler. You need a working Go installation on your workstation.

When running on Docker, use docker inspect to get the IP address of the service that you want to profile (inlet or outlet):

$ docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' akvorado_akvorado-inlet_1
240.0.4.8
$ docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' akvorado_akvorado-outlet_1
240.0.4.9

Then, use one of these two commands:

$ go tool pprof http://240.0.4.8:8080/debug/pprof/profile
$ go tool pprof http://240.0.4.8:8080/debug/pprof/heap

If your Docker host is remote, you also need to use SSH forwarding to expose the HTTP port to your workstation:

$ ssh -L 6060:240.0.4.8:8080 dockerhost.example.com

Then, use one of these two commands:

$ go tool pprof http://127.0.0.1:6060/debug/pprof/profile
$ go tool pprof http://127.0.0.1:6060/debug/pprof/heap

The first one provides a CPU profile. The second one provides a memory profile. On the command line, you can type web to visualize the result in the browser or svg to get an SVG file that you can attach to a bug report if needed.