25 KiB
Troubleshooting
Warning
Please read this page carefully before you open an issue or start a discussion.
Tip
This guide assumes that you use the Docker Compose setup. If you use a different setup, adapt the commands as needed.
As explained in the introduction, Akvorado has several components. To troubleshoot an issue, inspect each component.
Your routers send flows to the inlet, which sends them to Kafka. The outlet takes flows from Kafka, decodes and processes them, and then sends them to ClickHouse. The orchestrator configures Kafka and ClickHouse and provides the configuration for the inlet and outlet. The console (not shown here) queries ClickHouse to display flows to users.
Basic checks
First, check that you have enough space. This is a common cause of failure:
$ docker system df
TYPE TOTAL ACTIVE SIZE RECLAIMABLE
Images 7 7 1.819GB 7.834MB (0%)
Containers 15 15 2.752GB 0B (0%)
Local Volumes 16 9 69.24GB 8.594GB (12%)
Build Cache 4 0 5.291MB 5.291MB
You can recover space with docker system prune or get more details with
docker system df -v. See the documentation about
operations on how to check space usage for
ClickHouse.
Caution
Do not use
docker system prune -aunless you are sure that all your containers are up and running. It is important to understand that this command removes anything that is not currently used.
Check that all components are running and healthy:
$ docker compose ps --format "table {{.Service}}\t{{.Status}}"
SERVICE STATUS
akvorado-conntrack-fixer Up 28 minutes
akvorado-console Up 27 minutes (healthy)
akvorado-inlet Up 27 minutes (healthy)
akvorado-orchestrator Up 27 minutes (healthy)
akvorado-outlet Up 27 minutes (healthy)
clickhouse Up 28 minutes (healthy)
geoip Up 28 minutes (healthy)
kafka Up 28 minutes (healthy)
kafka-ui Up 28 minutes
redis Up 28 minutes (healthy)
traefik Up 28 minutes
Make sure that all components are present. If a component is missing, restarting, unhealthy, or not working correctly, check its logs:
$ docker compose logs akvorado-inlet
The inlet, outlet, orchestrator, and console expose metrics. Get them with this command:
$ curl -s http://127.0.0.1:8080/api/v0/inlet/metrics
# HELP akvorado_cmd_info Akvorado build information
# TYPE akvorado_cmd_info gauge
akvorado_cmd_info{compiler="go1.24.4",version="v1.11.5-134-gaf3869cd701c"} 1
[...]
Caution
Run the
curlcommand on the same host that runs Akvorado, and changeinletto the name of the component that you are interested in.
To see only error metrics, filter them:
$ curl -s http://127.0.0.1:8080/api/v0/inlet/metrics | grep 'akvorado_.*_error'
Tip
To follow this guide on a working system, replace
http://127.0.0.1:8080withhttps://demo.akvorado.net.
Inlet service
The inlet service receives NetFlow/IPFIX/sFlow packets and sends them to Kafka. First, check if you are receiving packets from exporters (your routers):
$ curl -s http://127.0.0.1:8080/api/v0/inlet/metrics | grep 'akvorado_inlet_flow_input_udp_packets'
# HELP akvorado_inlet_flow_input_udp_packets_total Packets received by the application.
# TYPE akvorado_inlet_flow_input_udp_packets_total counter
akvorado_inlet_flow_input_udp_packets_total{exporter="241.107.1.12",listener=":2055",worker="2"} 6769
akvorado_inlet_flow_input_udp_packets_total{exporter="241.107.1.13",listener=":2055",worker="1"} 6794
akvorado_inlet_flow_input_udp_packets_total{exporter="241.107.1.14",listener=":2055",worker="2"} 6765
akvorado_inlet_flow_input_udp_packets_total{exporter="241.107.1.15",listener=":2055",worker="0"} 6782
If your exporters are not listed, check their configuration. You can also use
tcpdump to verify that they are sending packets. Replace the IP with the IP
address of the exporter and the port with the correct port (2055 for NetFlow,
4739 for IPFIX and 6343 for sFlow).
# tcpdump -c3 -pni any host 241.107.1.12 and port 2055
09:11:08.729738 IP 241.107.1.12.44026 > 240.0.2.9.2055: UDP, length 624
09:11:08.729787 IP 241.107.1.12.44026 > 240.0.2.9.2055: UDP, length 1060
09:11:08.729799 IP 241.107.1.12.44026 > 240.0.2.9.2055: UDP, length 1060
3 packets captured
3 packets received by filter
0 packets dropped by kernel
Next, check if flows are sent to Kafka correctly:
$ curl -s http://127.0.0.1:8080/api/v0/inlet/metrics | grep 'akvorado_inlet_kafka_sent_messages'
# HELP akvorado_inlet_kafka_sent_messages_total Number of messages sent from a given exporter.
# TYPE akvorado_inlet_kafka_sent_messages_total counter
akvorado_inlet_kafka_sent_messages_total{exporter="241.107.1.12"} 8108
akvorado_inlet_kafka_sent_messages_total{exporter="241.107.1.13"} 8117
akvorado_inlet_kafka_sent_messages_total{exporter="241.107.1.14"} 8090
akvorado_inlet_kafka_sent_messages_total{exporter="241.107.1.15"} 8123
If no messages appear here, there may be a problem with Kafka.
Kafka
The inlet sends messages to Kafka, and the outlet takes them from
Kafka. The Docker Compose setup comes with UI for Apache
Kafka. You can access it at
http://127.0.0.1:8080/kafka-ui.
Tip
For security reasons, this UI is not exposed on anything other than the host that is running Akvorado. If you need to access it remotely, the easiest way is to use SSH port forwarding:
ssh -L 8080:127.0.0.1:8080 akvorado. Then, you can usehttp://127.0.0.1:8080/kafka-uidirectly from your workstation.
Check the various tabs (brokers, topics, and consumers) to make sure that everything is
green. In “brokers”, you should see one broker. In “topics”, you should see
flows-v5 with an increasing number of messages. This means that the inlet is
pushing messages. In “consumers”, you should have akvorado-outlet, with at
least one member. The consumer lag should be stable (and low). This is the
number of messages that the outlet has not yet processed.
Outlet
The outlet is the most complex component. Check if it works correctly with this command (it should show one processed flow):
$ curl -s http://127.0.0.1:8080/api/v0/outlet/flows\?limit\=1
{"TimeReceived":1753631373,"SamplingRate":100000,"ExporterAddress":"::ffff:241.107.1.15","InIf":10,"OutIf":21,"SrcVlan":0,"DstVlan":0,"SrcAddr":"::ffff:216.58.206.244","DstAddr":"::ffff:192.0.2.144","NextHop":"","SrcAS":15169,"DstAS":64501,"SrcNetMask":24,"DstNetMask":24,"OtherColumns":null}
If you get a flow, you can skip this section. Otherwise, we need to check some metrics. First, the outlet should receive flows from Kafka:
$ curl -s http://127.0.0.1:8080/api/v0/outlet/metrics | grep 'akvorado_outlet_kafka_received_messages'
# HELP akvorado_outlet_kafka_received_messages_total Number of messages received for a given worker.
# TYPE akvorado_outlet_kafka_received_messages_total counter
akvorado_outlet_kafka_received_messages_total{worker="0"} 5561
akvorado_outlet_kafka_received_messages_total{worker="1"} 5456
akvorado_outlet_kafka_received_messages_total{worker="2"} 5583
akvorado_outlet_kafka_received_messages_total{worker="3"} 11068
akvorado_outlet_kafka_received_messages_total{worker="4"} 11151
akvorado_outlet_kafka_received_messages_total{worker="5"} 5588
If these numbers are not increasing, there is a problem with receiving from Kafka. If everything is OK, check if the flow processing pipeline works correctly:
$ curl -s http://127.0.0.1:8080/api/v0/outlet/metrics | grep -P 'akvorado_outlet_core_(received|forwarded)'
# HELP akvorado_outlet_core_forwarded_flows_total Number of flows forwarded to Kafka.
# TYPE akvorado_outlet_core_forwarded_flows_total counter
akvorado_outlet_core_forwarded_flows_total{exporter="241.107.1.12"} 182512
akvorado_outlet_core_forwarded_flows_total{exporter="241.107.1.13"} 182366
akvorado_outlet_core_forwarded_flows_total{exporter="241.107.1.14"} 182278
akvorado_outlet_core_forwarded_flows_total{exporter="241.107.1.15"} 182900
# HELP akvorado_outlet_core_received_flows_total Number of incoming flows.
# TYPE akvorado_outlet_core_received_flows_total counter
akvorado_outlet_core_received_flows_total{exporter="241.107.1.12"} 182512
akvorado_outlet_core_received_flows_total{exporter="241.107.1.13"} 182366
akvorado_outlet_core_received_flows_total{exporter="241.107.1.14"} 182278
akvorado_outlet_core_received_flows_total{exporter="241.107.1.15"} 182900
# HELP akvorado_outlet_core_received_raw_flows_total Number of incoming raw flows (proto).
# TYPE akvorado_outlet_core_received_raw_flows_total counter
akvorado_outlet_core_received_raw_flows_total 45812
Notably, akvorado_outlet_core_received_raw_flows_total is incremented by one
for each message that is received from Kafka. The message is then decoded, and the flows
are extracted. For each extracted flow,
akvorado_outlet_core_received_flows_total is incremented by one. The flows are
then enriched, and before they are sent to ClickHouse,
akvorado_outlet_core_forwarded_flows_total is incremented.
If akvorado_outlet_core_received_raw_flows_total increases but
akvorado_outlet_core_received_flows_total does not, there is an error
decoding the flows. If akvorado_outlet_core_received_flows_total increases
but akvorado_outlet_core_forwarded_flows_total does not, there is an error
enriching the flows.
For the first case, use this command to find clues:
$ curl -s http://127.0.0.1:8080/api/v0/outlet/metrics | grep 'akvorado_outlet_flow.*errors'
For the second case, use this one:
$ curl -s http://127.0.0.1:8080/api/v0/outlet/metrics | grep 'akvorado_outlet_core.*errors'
Here is a list of errors that you may find:
metadata cache missmeans that interface information is missing from the metadata cache. The most likely cause is that the exporter does not accept SNMP requests or the SNMP community is configured incorrectly.sampling rate missingmeans that the sampling rate information is not present. This is normal when Akvorado starts, but it should not keep increasing. With NetFlow, the sampling rate is sent in an options data packet. Make sure that your exporter sends them (look forsampler-tablein the documentation). Alternatively, you can configureoutlet→core→default-sampling-rateto work around this issue.input and output interfaces missingmeans that the flow does not contain input and output interface indexes. Fix this on the exporter.
A convenient way to check if the SNMP configuration is correct is to use
tcpdump.
# nsenter -t $(pgrep -fo "akvorado outlet") -n tcpdump -c3 -pni eth0 port 161
20:46:44.812243 IP 240.0.2.11.34554 > 240.0.2.13.161: C="private" GetRequest(95) .1.3.6.1.2.1.1.5.0 .1.3.6.1.2.1.2.2.1.2.11 .1.3.6.1.2.1.31.1.1.1.1.11 .1.3.6.1.2.1.31.1.1.1.18.11 .1.3.6.1.2.1.31.1.1.1.15.11
20:46:45.144567 IP 240.0.2.13.161 > 240.0.2.11.34554: C="private" GetResponse(153) .1.3.6.1.2.1.1.5.0="dc3-edge1.example.com" .1.3.6.1.2.1.2.2.1.2.11="Gi0/0/0/11" .1.3.6.1.2.1.31.1.1.1.1.11="Gi0/0/0/11" .1.3.6.1.2.1.31.1.1.1.18.11="Transit: Lumen" .1.3.6.1.2.1.31.1.1.1.15.11=10000
^C
2 packets captured
2 packets received by filter
0 packets dropped by kernel
If you do not get an answer, there may be several causes:
- the community is incorrect, and you need to fix it
- the exporter is not configured to answer SNMP requests
Finally, check if flows are sent to ClickHouse successfully. Use this command:
$ curl -s http://127.0.0.1:8080/api/v0/outlet/metrics | grep -P 'akvorado_outlet_clickhouse_(errors|flow)'
# HELP akvorado_outlet_clickhouse_errors_total Errors while inserting into ClickHouse
# TYPE akvorado_outlet_clickhouse_errors_total counter
akvorado_outlet_clickhouse_errors_total{error="send"} 7
# HELP akvorado_outlet_clickhouse_flow_per_batch Number of flow per batch sent to ClickHouse
# TYPE akvorado_outlet_clickhouse_flow_per_batch summary
akvorado_outlet_clickhouse_flow_per_batch{quantile="0.5"} 250
akvorado_outlet_clickhouse_flow_per_batch{quantile="0.9"} 480
akvorado_outlet_clickhouse_flow_per_batch{quantile="0.99"} 950
akvorado_outlet_clickhouse_flow_per_batch_sum 45892
akvorado_outlet_clickhouse_flow_per_batch_count 163
If the errors are not increasing and flow_per_batch_sum is increasing,
everything is working correctly.
ClickHouse
The last component to check is ClickHouse. Connect to it with this command:
$ docker compose exec clickhouse clickhouse-client
First, check if all the tables are present:
$ SHOW TABLES
┌─name────────────────────────────────────────────┐
1. │ asns │
2. │ exporters │
3. │ exporters_consumer │
4. │ flows │
5. │ flows_1h0m0s │
6. │ flows_1h0m0s_consumer │
7. │ flows_1m0s │
8. │ flows_1m0s_consumer │
9. │ flows_5m0s │
10. │ flows_5m0s_consumer │
11. │ flows_I6D3KDQCRUBCNCGF4BSOWTRMVIv5_raw │
12. │ flows_I6D3KDQCRUBCNCGF4BSOWTRMVIv5_raw_consumer │
13. │ icmp │
14. │ networks │
15. │ protocols │
16. │ tcp │
17. │ udp │
└─────────────────────────────────────────────────┘
Check if the various dictionaries are populated:
$ SELECT name, element_count FROM system.dictionaries
┌─name──────┬─element_count─┐
1. │ networks │ 5963224 │
2. │ udp │ 5495 │
3. │ icmp │ 58 │
4. │ protocols │ 129 │
5. │ asns │ 99598 │
6. │ tcp │ 5883 │
└───────────┴───────────────┘
If you have not used the console yet, some dictionaries may be empty.
To check if ClickHouse is behind, use this SQL query with clickhouse-client to
get the lag in seconds:
SELECT (now()-max(TimeReceived))/60
FROM flows
If you still have problems, check the errors that are reported by ClickHouse:
SELECT last_error_time, last_error_message
FROM system.errors
ORDER BY last_error_time LIMIT 10
FORMAT Vertical
Console
The most common console problems are empty widgets or no flows shown in the “visualize” tab. Both problems indicate that interface classification is not working correctly.
Interface classification marks interfaces as either “internal” or “external”. If you have not configured interface classification, see the configuration guide. This step is required.
Scaling
Various bottlenecks can cause dropped packets. This is problematic because the reported sampling rate becomes incorrect, and you cannot reliably calculate the number of bytes and packets. Both the exporters and the inlet need to be tuned for this kind of problem.
The outlet can also be a bottleneck. In this case, the flows may appear on the console with a delay.
Exporters
The first problem may be that the exporter is dropping flows. Usually, counters can detect this situation, and you can solve it by reducing the exporter rate.
NCS5500 routers
NetFlow, Sampling-Interval and the Mythical Internet Packet Size contains
a lot of information about the limits of this platform. The first bottleneck is a 133
Mbps shaper between an NPU and the LC CPU for the sampled packets (144 bytes
each). For example, on a NC55-36X100G line card, there are 6 NPUs, and each one
manages 6 ports. If we consider an average packet size of 1000, the maximum
sampling rate when all ports are full is 1:700 (the formula is Total-BW / ( Avg-Pkt-Size x 133Mbps ) x ( 144 x 8 )).
It is possible to check if there are drops with sh controllers npu stats voq base 24 instance 0 location 0/0/CPU0 and looking at the
COS2 line.
The second bottleneck is the size of the flow cache. If it is too small, it may overflow. For example:
# show flow monitor monitor1 cache internal location 0/1/CPU0 | i Cache
Cache summary for Flow Monitor :
Cache size: 100000
Cache Hits: 202938943
Cache Misses: 1789836407
Cache Overflows: 2166590
Cache above hi water: 1704
When this happens, either the cache timeout rate-limit should be increased,
or the cache entries directive should be increased. The latter value can be
increased to 1 million per monitor-map.
Other routers
Other routers are likely to have the same limitations. Note that sFlow and IPFIX 315 do not have a flow cache and are therefore less likely to have scaling problems.
Inlet
When the inlet has scaling issues, the kernel's receive buffers may drop packets. Each listening queue has a fixed number of receive buffers (212992 bytes by default) to keep packets before they are handled by the application. When this buffer is full, packets are dropped.
Akvorado reports the number of drops for each listening socket with the
akvorado_inlet_flow_input_udp_in_dropped_packets_total counter. This should be
compared to akvorado_inlet_flow_input_udp_packets_total. Another way to get
the same information is to use ss -lunepm and look at the drop counter:
$ nsenter -t $(pgrep -fo "akvorado inlet") -n ss -lunepm
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
UNCONN 0 0 *:2055 *:* users:(("akvorado",pid=2710961,fd=16)) ino:67643151 sk:89c v6only:0 <->
skmem:(r0,rb212992,t0,tb212992,f4096,w0,o0,bl0,d486525)
In the example above, there were 486525 drops. You can solve this in different ways:
- Check the
akvorado_inlet_kafka_buffered_produce_records_totalgauge. If it is often near the value ofkafka→queue-size, increase this value. This setting has the most impact. - Increase the number of workers for the UDP input.
- Enable the eBPF load balancer on Linux (check
docker/docker-compose-local.yml). - Increase the value of the
net.core.rmem_maxsysctl (on the host) and increase thereceive-buffersetting for the input to the same value, - Increase the number of Kafka brokers.
- Add more inlet instances and shard the exporters among the configured ones.
The value of the receive buffer is also available as a metric:
$ curl -s http://127.0.0.1:8080/api/v0/inlet/metrics | grep -P 'akvorado_inlet_flow_input_udp_buffer'
# HELP akvorado_inlet_flow_input_udp_buffer_size_bytes Size of the in-kernel buffer for this worker.
# TYPE akvorado_inlet_flow_input_udp_buffer_size_bytes gauge
akvorado_inlet_flow_input_udp_buffer_size_bytes{listener=":2055",worker="2"} 212992
Outlet
The outlet is expected to automatically scale the number of workers to ensure
that the data is delivered efficiently to ClickHouse. Increasing the maximum
number of Kafka workers (max-workers) past the default value of 8 may put more
pressure on ClickHouse. However, you can increase maximum-batch-size.
Moreover, there cannot be more workers than the number of partitions for the
topic. This part is configurable on the orchestrator
(kafka→topic-configuration→num-partitions).
The BMP routing component may have some challenges with scaling, especially when peers disappear, as it requires cleaning up many entries in the routing tables. BMP is a one-way protocol, and the sender may declare the receiver station “stuck” if it does not accept more data. To avoid this, you may need to tune the TCP receive buffer. First, check the current situation:
$ nsenter -t $(pgrep -fo "akvorado outlet") -n ss -tnepm sport = :10179
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
ESTAB 0 0 [::ffff:240.0.2.10]:10179 [::ffff:240.0.2.15]:46656 users:(("akvorado",pid=1117049,fd=13)) timer:(keepalive,55sec,0) ino:40752297 sk:11 <->
skmem:(r0,rb131072,t0,tb87040,f0,w0,o0,bl0,d2976)
ESTAB 0 0 [::ffff:240.0.2.10]:10179 [::ffff:240.0.2.14]:44318 users:(("akvorado",pid=1117049,fd=12)) timer:(keepalive,55sec,0) ino:40751196 sk:12 <->
skmem:(r0,rb131072,t0,tb87040,f0,w0,o0,bl0,d2976)
ESTAB 0 0 [::ffff:240.0.2.10]:10179 [::ffff:240.0.2.13]:47586 users:(("akvorado",pid=1117049,fd=11)) timer:(keepalive,55sec,0) ino:40751097 sk:13 <->
skmem:(r0,rb131072,t0,tb87040,f0,w0,o0,bl0,d2976)
ESTAB 0 0 [::ffff:240.0.2.10]:10179 [::ffff:240.0.2.12]:51352 users:(("akvorado",pid=1117049,fd=14)) timer:(keepalive,55sec,0) ino:40752299 sk:14 <->
skmem:(r0,rb131072,t0,tb87040,f0,w0,o0,bl0,d2976)
Here, the receive buffer for each process is 131,072 bytes. Linux exposes
net.ipv4.tcp_rmem to tune this value:
$ sysctl net.ipv4.tcp_rmem
net.ipv4.tcp_rmem = 4096 131072 6291456
The middle value is the default one, and the last value is the maximum one. When
net.ipv4.tcp_moderate_rcvbuf is set to 1 (the default), Linux automatically tunes the
size of the buffer for the application to maximize the throughput depending
on the latency. This mechanism will not help here. To increase the receive buffer
size, you need to:
- set the
receive-buffervalue in the BMP provider configuration (for example, to 33554432 for 32 MiB) - increase the last value of
net.ipv4.tcp_rmemto the same value - increase the value of
net.core.rmem_maxto the same value - optionally, increase the last value of
net.ipv4.tcp_memby the value oftcp_rmem[2]multiplied by the maximum number of BMP peers you expect, divided by 4096 bytes per page (check withgetconf PAGESIZE)
The value of the receive buffer is also available as a metric:
$ curl -s http://127.0.0.1:8080/api/v0/inlet/metrics | grep -P 'akvorado_outlet_routing_provider_bmp_buffer'
# HELP akvorado_outlet_flow_input_udp_buffer_size_bytes Size of the in-kernel buffer for this connection.
# TYPE akvorado_outlet_flow_input_udp_buffer_size_bytes gauge
akvorado_outlet_flow_input_udp_buffer_size_bytes{exporter="241.107.1.12"} 425984
Profiling
On a large-scale installation, you may want to check if Akvorado is using too
much CPU or memory. You can do this with pprof, the Go
profiler. You need a working Go
installation on your workstation.
When running on Docker, use docker inspect to get the IP address of the service
that you want to profile (inlet or outlet):
$ docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' akvorado-akvorado-inlet-1
240.0.4.8
$ docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' akvorado-akvorado-outlet-1
240.0.4.9
Then, use one of these two commands:
$ go tool pprof http://240.0.4.8:8080/debug/pprof/profile
$ go tool pprof http://240.0.4.8:8080/debug/pprof/heap
If your Docker host is remote, you also need to use SSH forwarding to expose the HTTP port to your workstation:
$ ssh -L 6060:240.0.4.8:8080 dockerhost.example.com
Then, use one of these two commands:
$ go tool pprof http://127.0.0.1:6060/debug/pprof/profile
$ go tool pprof http://127.0.0.1:6060/debug/pprof/heap
The first one provides a CPU profile. The second one provides a memory profile. On the
command line, you can type web to visualize the result in the browser or svg
to get an SVG file that you can attach to a bug report if needed.