# Troubleshooting

> [!WARNING]
> Please read this page carefully before you open an issue or start a discussion.

> [!TIP]
> This guide assumes that you use the *Docker Compose* setup. If you use a different setup, adapt the commands as needed.

As explained in the [introduction](00-intro#big-picture), Akvorado has several
components. To troubleshoot an issue, inspect each component.

![Functional view](troubleshoot.svg)

Your routers send flows to the *inlet*, which sends them to *Kafka*. The
*outlet* takes flows from Kafka, decodes and processes them, and then sends them to
*ClickHouse*. The *orchestrator* configures *Kafka* and *ClickHouse* and
provides the configuration for the *inlet* and *outlet*. The *console* (not shown
here) queries *ClickHouse* to display flows to users.

## Basic checks

First, check that you have enough space. This is a common cause of failure:

```console
$ docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          7         7         1.819GB   7.834MB (0%)
Containers      15        15        2.752GB   0B (0%)
Local Volumes   16        9         69.24GB   8.594GB (12%)
Build Cache     4         0         5.291MB   5.291MB
```

You can recover space with `docker system prune` or get more details with
`docker system df -v`. See the documentation about
[operations](04-operations.md#clickhouse) on how to check space usage for
ClickHouse.

> [!CAUTION]
> Do **not** use `docker system prune -a` unless you are sure that all your
> containers are up and running. It is important to understand that this command
> removes anything that is not currently used.

Check that all components are running and healthy:

```console
$ docker compose ps --format "table {{.Service}}\t{{.Status}}"
SERVICE                    STATUS
akvorado-conntrack-fixer   Up 28 minutes
akvorado-console           Up 27 minutes (healthy)
akvorado-inlet             Up 27 minutes (healthy)
akvorado-orchestrator      Up 27 minutes (healthy)
akvorado-outlet            Up 27 minutes (healthy)
clickhouse                 Up 28 minutes (healthy)
geoip                      Up 28 minutes (healthy)
kafka                      Up 28 minutes (healthy)
kafka-ui                   Up 28 minutes
redis                      Up 28 minutes (healthy)
traefik                    Up 28 minutes
```

Make sure that all components are present. If a component is missing, restarting,
unhealthy, or not working correctly, check its logs:

```console
$ docker compose logs akvorado-inlet
```

The *inlet*, *outlet*, *orchestrator*, and *console* expose metrics. Get them with this command:

```console
$ curl -s http://127.0.0.1:8080/api/v0/inlet/metrics
​# HELP akvorado_cmd_info Akvorado build information
​# TYPE akvorado_cmd_info gauge
akvorado_cmd_info{compiler="go1.24.4",version="v1.11.5-134-gaf3869cd701c"} 1
[...]
```

> [!CAUTION]
> Run the `curl` command on the same host that runs Akvorado, and change `inlet`
> to the name of the component that you are interested in.

To see only error metrics, filter them:

```console
$ curl -s http://127.0.0.1:8080/api/v0/inlet/metrics | grep 'akvorado_.*_error'
```

> [!TIP]
> To follow this guide on a working system, replace `http://127.0.0.1:8080` with `https://demo.akvorado.net`.

### Inlet service

The inlet service receives NetFlow/IPFIX/sFlow packets and sends them to
Kafka. First, check if you are receiving packets from exporters (your routers):

```console
$ curl -s http://127.0.0.1:8080/api/v0/inlet/metrics | grep 'akvorado_inlet_flow_input_udp_packets'
​# HELP akvorado_inlet_flow_input_udp_packets_total Packets received by the application.
​# TYPE akvorado_inlet_flow_input_udp_packets_total counter
akvorado_inlet_flow_input_udp_packets_total{exporter="241.107.1.12",listener=":2055",worker="2"} 6769
akvorado_inlet_flow_input_udp_packets_total{exporter="241.107.1.13",listener=":2055",worker="1"} 6794
akvorado_inlet_flow_input_udp_packets_total{exporter="241.107.1.14",listener=":2055",worker="2"} 6765
akvorado_inlet_flow_input_udp_packets_total{exporter="241.107.1.15",listener=":2055",worker="0"} 6782
```

If your exporters are not listed, check their configuration. You can also use
`tcpdump` to verify that they are sending packets. Replace the IP with the IP
address of the exporter and the port with the correct port (2055 for NetFlow,
4739 for IPFIX and 6343 for sFlow).

```console
# tcpdump -c3 -pni any host 241.107.1.12 and port 2055
09:11:08.729738 IP 241.107.1.12.44026 > 240.0.2.9.2055: UDP, length 624
09:11:08.729787 IP 241.107.1.12.44026 > 240.0.2.9.2055: UDP, length 1060
09:11:08.729799 IP 241.107.1.12.44026 > 240.0.2.9.2055: UDP, length 1060
3 packets captured
3 packets received by filter
0 packets dropped by kernel
```

Next, check if flows are sent to Kafka correctly:

```console
$ curl -s http://127.0.0.1:8080/api/v0/inlet/metrics | grep 'akvorado_inlet_kafka_sent_messages'
​# HELP akvorado_inlet_kafka_sent_messages_total Number of messages sent from a given exporter.
​# TYPE akvorado_inlet_kafka_sent_messages_total counter
akvorado_inlet_kafka_sent_messages_total{exporter="241.107.1.12"} 8108
akvorado_inlet_kafka_sent_messages_total{exporter="241.107.1.13"} 8117
akvorado_inlet_kafka_sent_messages_total{exporter="241.107.1.14"} 8090
akvorado_inlet_kafka_sent_messages_total{exporter="241.107.1.15"} 8123
```

If no messages appear here, there may be a problem with Kafka.

### Kafka

The *inlet* sends messages to Kafka, and the *outlet* takes them from
Kafka. The Docker Compose setup comes with [UI for Apache
Kafka](https://github.com/provectus/kafka-ui). You can access it at
`http://127.0.0.1:8080/kafka-ui`.

> [!TIP]
> For security reasons, this UI is not exposed on anything other than the host
> that is running Akvorado. If you need to access it remotely, the easiest way
> is to use [SSH port
> forwarding](https://www.digitalocean.com/community/tutorials/ssh-port-forwarding): 
> `ssh -L 8080:127.0.0.1:8080 akvorado`. Then, you can use
> `http://127.0.0.1:8080/kafka-ui` directly from your workstation.

Check the various tabs (brokers, topics, and consumers) to make sure that everything is
green. In “brokers”, you should see one broker. In “topics”, you should see
`flows-v5` with an increasing number of messages. This means that the *inlet* is
pushing messages. In “consumers”, you should have `akvorado-outlet`, with at
least one member. The consumer lag should be stable (and low). This is the
number of messages that the *outlet* has not yet processed.

### Outlet

The *outlet* is the most complex component. Check if it works correctly with
this command (it should show one processed flow):

```console
$ curl -s http://127.0.0.1:8080/api/v0/outlet/flows\?limit\=1
{"TimeReceived":1753631373,"SamplingRate":100000,"ExporterAddress":"::ffff:241.107.1.15","InIf":10,"OutIf":21,"SrcVlan":0,"DstVlan":0,"SrcAddr":"::ffff:216.58.206.244","DstAddr":"::ffff:192.0.2.144","NextHop":"","SrcAS":15169,"DstAS":64501,"SrcNetMask":24,"DstNetMask":24,"OtherColumns":null}
```

Check these important metrics. First, the outlet should receive flows from
Kafka:

```console
$ curl -s http://127.0.0.1:8080/api/v0/outlet/metrics | grep 'akvorado_outlet_kafka_received_messages'
​# HELP akvorado_outlet_kafka_received_messages_total Number of messages received for a given worker.
​# TYPE akvorado_outlet_kafka_received_messages_total counter
akvorado_outlet_kafka_received_messages_total{worker="0"} 5561
akvorado_outlet_kafka_received_messages_total{worker="1"} 5456
akvorado_outlet_kafka_received_messages_total{worker="2"} 5583
akvorado_outlet_kafka_received_messages_total{worker="3"} 11068
akvorado_outlet_kafka_received_messages_total{worker="4"} 11151
akvorado_outlet_kafka_received_messages_total{worker="5"} 5588
```

If these numbers are not increasing, there is a problem with receiving from Kafka. If
everything is OK, check if the flow processing pipeline works correctly:

```console
$ curl -s http://127.0.0.1:8080/api/v0/outlet/metrics | grep -P 'akvorado_outlet_core_(received|forwarded)'
​# HELP akvorado_outlet_core_forwarded_flows_total Number of flows forwarded to Kafka.
​# TYPE akvorado_outlet_core_forwarded_flows_total counter
akvorado_outlet_core_forwarded_flows_total{exporter="241.107.1.12"} 182512
akvorado_outlet_core_forwarded_flows_total{exporter="241.107.1.13"} 182366
akvorado_outlet_core_forwarded_flows_total{exporter="241.107.1.14"} 182278
akvorado_outlet_core_forwarded_flows_total{exporter="241.107.1.15"} 182900
​# HELP akvorado_outlet_core_received_flows_total Number of incoming flows.
​# TYPE akvorado_outlet_core_received_flows_total counter
akvorado_outlet_core_received_flows_total{exporter="241.107.1.12"} 182512
akvorado_outlet_core_received_flows_total{exporter="241.107.1.13"} 182366
akvorado_outlet_core_received_flows_total{exporter="241.107.1.14"} 182278
akvorado_outlet_core_received_flows_total{exporter="241.107.1.15"} 182900
​# HELP akvorado_outlet_core_received_raw_flows_total Number of incoming raw flows (proto).
​# TYPE akvorado_outlet_core_received_raw_flows_total counter
akvorado_outlet_core_received_raw_flows_total 45812
```

Notably, `akvorado_outlet_core_received_raw_flows_total` is incremented by one
for each message that is received from Kafka. The message is then decoded, and the flows
are extracted. For each extracted flow,
`akvorado_outlet_core_received_flows_total` is incremented by one. The flows are
then enriched, and before they are sent to ClickHouse,
`akvorado_outlet_core_forwarded_flows_total` is incremented.

If `akvorado_outlet_core_received_raw_flows_total` increases but
`akvorado_outlet_core_received_flows_total` does not, there is an error
**decoding the flows**. If `akvorado_outlet_core_received_flows_total` increases
but `akvorado_outlet_core_forwarded_flows_total` does not, there is an error
**enriching the flows**.

For the first case, use this command to find clues:

```console
$ curl -s http://127.0.0.1:8080/api/v0/outlet/metrics | grep 'akvorado_outlet_flow.*errors'
```

For the second case, use this one:

```console
$ curl -s http://127.0.0.1:8080/api/v0/outlet/metrics | grep 'akvorado_outlet_core.*errors'
```

Here is a list of errors that you may find:

- `metadata cache miss` means that interface information is missing from the metadata
  cache. The most likely cause is that the exporter does not accept SNMP
  requests or the SNMP community is configured incorrectly.
- `sampling rate missing` means that the sampling rate information is not present. This
  is normal when Akvorado starts, but it should not keep increasing. With NetFlow,
  the sampling rate is sent in an options data packet. Make sure that your exporter
  sends them (look for `sampler-table` in the documentation). Alternatively,
  you can configure `outlet`→`core`→`default-sampling-rate` to work around this issue.
- `input and output interfaces missing` means that the flow does not contain input
  and output interface indexes. Fix this on the exporter.

A convenient way to check if the SNMP configuration is correct is to use
`tcpdump`.

```console
# nsenter -t $(pgrep -fo "akvorado inlet") -n tcpdump -c3 -pni eth0 port 161
20:46:44.812243 IP 240.0.2.11.34554 > 240.0.2.13.161: C="private" GetRequest(95) .1.3.6.1.2.1.1.5.0 .1.3.6.1.2.1.2.2.1.2.11 .1.3.6.1.2.1.31.1.1.1.1.11 .1.3.6.1.2.1.31.1.1.1.18.11 .1.3.6.1.2.1.31.1.1.1.15.11
20:46:45.144567 IP 240.0.2.13.161 > 240.0.2.11.34554: C="private" GetResponse(153) .1.3.6.1.2.1.1.5.0="dc3-edge1.example.com" .1.3.6.1.2.1.2.2.1.2.11="Gi0/0/0/11" .1.3.6.1.2.1.31.1.1.1.1.11="Gi0/0/0/11" .1.3.6.1.2.1.31.1.1.1.18.11="Transit: Lumen" .1.3.6.1.2.1.31.1.1.1.15.11=10000
^C
2 packets captured
2 packets received by filter
0 packets dropped by kernel
```

If you do not get an answer, there may be several causes:

- the community is incorrect, and you need to fix it
- the exporter is not configured to answer SNMP requests

Finally, check if flows are sent to ClickHouse successfully. Use this command:

```
$ curl -s http://127.0.0.1:8080/api/v0/outlet/metrics | grep -P 'akvorado_outlet_clickhouse_(errors|flow)'
# HELP akvorado_outlet_clickhouse_errors_total Errors while inserting into ClickHouse
# TYPE akvorado_outlet_clickhouse_errors_total counter
akvorado_outlet_clickhouse_errors_total{error="send"} 7
​# HELP akvorado_outlet_clickhouse_flow_per_batch Number of flow per batch sent to ClickHouse
​# TYPE akvorado_outlet_clickhouse_flow_per_batch summary
akvorado_outlet_clickhouse_flow_per_batch{quantile="0.5"} 250
akvorado_outlet_clickhouse_flow_per_batch{quantile="0.9"} 480
akvorado_outlet_clickhouse_flow_per_batch{quantile="0.99"} 950
akvorado_outlet_clickhouse_flow_per_batch_sum 45892
akvorado_outlet_clickhouse_flow_per_batch_count 163
```

If the errors are not increasing and `flow_per_batch_sum` is increasing,
everything is working correctly.

### ClickHouse

The last component to check is ClickHouse. Connect to it with this command:

```console
$ docker compose exec clickhouse clickhouse-client
```

First, check if all the tables are present:

```console
$ SHOW TABLES
    ┌─name────────────────────────────────────────────┐
 1. │ asns                                            │
 2. │ exporters                                       │
 3. │ exporters_consumer                              │
 4. │ flows                                           │
 5. │ flows_1h0m0s                                    │
 6. │ flows_1h0m0s_consumer                           │
 7. │ flows_1m0s                                      │
 8. │ flows_1m0s_consumer                             │
 9. │ flows_5m0s                                      │
10. │ flows_5m0s_consumer                             │
11. │ flows_I6D3KDQCRUBCNCGF4BSOWTRMVIv5_raw          │
12. │ flows_I6D3KDQCRUBCNCGF4BSOWTRMVIv5_raw_consumer │
13. │ icmp                                            │
14. │ networks                                        │
15. │ protocols                                       │
16. │ tcp                                             │
17. │ udp                                             │
    └─────────────────────────────────────────────────┘
```

Check if the various dictionaries are populated:

```console
$ SELECT name, element_count FROM system.dictionaries
   ┌─name──────┬─element_count─┐
1. │ networks  │       5963224 │
2. │ udp       │          5495 │
3. │ icmp      │            58 │
4. │ protocols │           129 │
5. │ asns      │         99598 │
6. │ tcp       │          5883 │
   └───────────┴───────────────┘
```

If you have not used the console yet, some dictionaries may be empty.

To check if ClickHouse is behind, use this SQL query with `clickhouse-client` to
get the lag in seconds:

```sql
SELECT (now()-max(TimeReceived))/60
FROM flows
```

If you still have problems, check the errors that are reported by ClickHouse:

```sql
SELECT last_error_time, last_error_message
FROM system.errors
ORDER BY last_error_time LIMIT 10
FORMAT Vertical
```

### Console

The most common console problems are empty widgets or no flows shown in the
“visualize” tab. Both problems indicate that interface classification is not
working correctly.

Interface classification marks interfaces as either “internal” or “external”. If
you have not configured interface classification, see the [configuration
guide](02-configuration.md#classification). This step is required.

## Scaling

Various bottlenecks can cause dropped packets. This is problematic because the
reported sampling rate becomes incorrect, and you cannot reliably calculate the
number of bytes and packets. Both the exporters and the inlet need to be tuned
for this kind of problem.

The outlet can also be a bottleneck. In this case, the flows may appear on the
console with a delay.

### Exporters

The first problem may be that the exporter is dropping flows. Usually, counters
can detect this situation, and you can solve it by reducing the exporter rate.

#### NCS5500 routers

[NetFlow, Sampling-Interval and the Mythical Internet Packet Size][1] contains
a lot of information about the limits of this platform. The first bottleneck is a 133
Mbps shaper between an NPU and the LC CPU for the sampled packets (144 bytes
each). For example, on a NC55-36X100G line card, there are 6 NPUs, and each one
manages 6 ports. If we consider an average packet size of 1000, the maximum
sampling rate when all ports are full is 1:700 (the formula is `Total-BW / (
Avg-Pkt-Size x 133Mbps ) x ( 144 x 8 )`).

[1]: https://xrdocs.io/ncs5500/tutorials/2018-02-19-netflow-sampling-interval-and-the-mythical-internet-packet-size/

It is possible to check if there are drops with `sh controllers npu
stats voq base 24 instance 0 location 0/0/CPU0` and looking at the
`COS2` line.

The second bottleneck is the size of the flow cache. If it is too small, it may
overflow. For example:

```console
# show flow monitor monitor1 cache internal location 0/1/CPU0 | i Cache
Cache summary for Flow Monitor :
Cache size:                         100000
Cache Hits:                            202938943
Cache Misses:                         1789836407
Cache Overflows:                         2166590
Cache above hi water:                       1704
```

When this happens, either the `cache timeout rate-limit` should be increased,
or the `cache entries` directive should be increased. The latter value can be
increased to 1 million per monitor-map.

#### Other routers

Other routers are likely to have the same limitations. Note that
sFlow and IPFIX 315 do not have a flow cache and are therefore less likely to
have scaling problems.

### Inlet

When the inlet has scaling issues, the kernel\'s receive buffers may drop packets.
Each listening queue has a fixed number of receive buffers (212992 bytes by
default) to keep packets before they are handled by the application. When this
buffer is full, packets are dropped.

*Akvorado* reports the number of drops for each listening socket with the
`akvorado_inlet_flow_input_udp_in_dropped_packets_total` counter. This should be
compared to `akvorado_inlet_flow_input_udp_packets_total`. Another way to get
the same information is to use `ss -lunepm` and look at the drop counter:

```console
$ nsenter -t $(pgrep -fo "akvorado inlet") -n ss -lunepm
State            Recv-Q           Send-Q                       Local Address:Port                        Peer Address:Port           Process
UNCONN           0                0                                        *:2055                                   *:*               users:(("akvorado",pid=2710961,fd=16)) ino:67643151 sk:89c v6only:0 <->
         skmem:(r0,rb212992,t0,tb212992,f4096,w0,o0,bl0,d486525)
```

In the example above, there were 486525 drops. You can solve this in three ways:

- increase the number of workers for the UDP input,
- increase the value of the `net.core.rmem_max` sysctl (on the host) and increase
  the `receive-buffer` setting for the input to the same value,
- add more inlet instances and shard the exporters among the configured ones.

The value of the receive buffer is also available as a metric:

```console
$ curl -s http://127.0.0.1:8080/api/v0/inlet/metrics | grep -P 'akvorado_inlet_flow_input_udp_buffer'
​# HELP akvorado_inlet_flow_input_udp_buffer_size_bytes Size of the in-kernel buffer for this worker.
​# TYPE akvorado_inlet_flow_input_udp_buffer_size_bytes gauge
akvorado_inlet_flow_input_udp_buffer_size_bytes{listener=":2055",worker="2"} 212992
```

### Outlet

The outlet is expected to automatically scale the number of workers to ensure that the
data is delivered efficiently to ClickHouse. Increasing the maximum number of
Kafka workers (`max-workers`) past the default value of 8 may put more pressure
on ClickHouse. However, you can increase `maximum-batch-size`.

The BMP routing component may have some challenges with scaling, especially when peers
disappear, as it requires cleaning up many entries in the routing tables. BMP
is a one-way protocol, and the sender may declare the receiver station “stuck” if
it does not accept more data. To avoid this, you may need to tune the
TCP receive buffer. First, check the current situation:

```console
$ nsenter -t $(pgrep -fo "akvorado outlet") -n ss -tnepm sport = :10179
State         Recv-Q          Send-Q                         Local Address:Port                           Peer Address:Port          Process
ESTAB         0               0                        [::ffff:240.0.2.10]:10179                   [::ffff:240.0.2.15]:46656          users:(("akvorado",pid=1117049,fd=13)) timer:(keepalive,55sec,0) ino:40752297 sk:11 <->
         skmem:(r0,rb131072,t0,tb87040,f0,w0,o0,bl0,d2976)
ESTAB         0               0                        [::ffff:240.0.2.10]:10179                   [::ffff:240.0.2.14]:44318          users:(("akvorado",pid=1117049,fd=12)) timer:(keepalive,55sec,0) ino:40751196 sk:12 <->
         skmem:(r0,rb131072,t0,tb87040,f0,w0,o0,bl0,d2976)
ESTAB         0               0                        [::ffff:240.0.2.10]:10179                   [::ffff:240.0.2.13]:47586          users:(("akvorado",pid=1117049,fd=11)) timer:(keepalive,55sec,0) ino:40751097 sk:13 <->
         skmem:(r0,rb131072,t0,tb87040,f0,w0,o0,bl0,d2976)
ESTAB         0               0                        [::ffff:240.0.2.10]:10179                   [::ffff:240.0.2.12]:51352          users:(("akvorado",pid=1117049,fd=14)) timer:(keepalive,55sec,0) ino:40752299 sk:14 <->
         skmem:(r0,rb131072,t0,tb87040,f0,w0,o0,bl0,d2976)
```

Here, the receive buffer for each process is 131,072 bytes. Linux exposes
`net.ipv4.tcp_rmem` to tune this value:

```console
$ sysctl net.ipv4.tcp_rmem
net.ipv4.tcp_rmem = 4096        131072  6291456
```

The middle value is the default one, and the last value is the maximum one. When
`net.ipv4.tcp_moderate_rcvbuf` is set to 1 (the default), Linux automatically tunes the
size of the buffer for the application to maximize the throughput depending
on the latency. This mechanism will not help here. To increase the receive buffer
size, you need to:

- set the `receive-buffer` value in the BMP provider configuration (for example, to 33554432 for 32 MiB)
- increase the last value of `net.ipv4.tcp_rmem` to the same value
- increase the value of `net.core.rmem_max` to the same value
- optionally, increase the last value of `net.ipv4.tcp_mem` by the value of `tcp_rmem[2]`
  multiplied by the maximum number of BMP peers you expect, divided by 4096 bytes
  per page (check with `getconf PAGESIZE`)

The value of the receive buffer is also available as a metric:

```console
$ curl -s http://127.0.0.1:8080/api/v0/inlet/metrics | grep -P 'akvorado_outlet_routing_provider_bmp_buffer'
​# HELP akvorado_outlet_flow_input_udp_buffer_size_bytes Size of the in-kernel buffer for this connection.
​# TYPE akvorado_outlet_flow_input_udp_buffer_size_bytes gauge
akvorado_outlet_flow_input_udp_buffer_size_bytes{exporter="241.107.1.12"} 425984
```


### Profiling

On a large-scale installation, you may want to check if *Akvorado* is using too
much CPU or memory. You can do this with `pprof`, the [Go
profiler](https://go.dev/blog/pprof). You need a working [Go
installation](https://go.dev/doc/install) on your workstation.

When running on Docker, use `docker inspect` to get the IP address of the service
that you want to profile (inlet or outlet):

```console
$ docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' akvorado_akvorado-inlet_1
240.0.4.8
$ docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' akvorado_akvorado-outlet_1
240.0.4.9
```

Then, use one of these two commands:

```console
$ go tool pprof http://240.0.4.8:8080/debug/pprof/profile
$ go tool pprof http://240.0.4.8:8080/debug/pprof/heap
```

If your Docker host is remote, you also need to use SSH forwarding to expose the
HTTP port to your workstation:

```console
$ ssh -L 6060:240.0.4.8:8080 dockerhost.example.com
```

Then, use one of these two commands:

```console
$ go tool pprof http://127.0.0.1:6060/debug/pprof/profile
$ go tool pprof http://127.0.0.1:6060/debug/pprof/heap
```

The first one provides a CPU profile. The second one provides a memory profile. On the
command line, you can type `web` to visualize the result in the browser or `svg`
to get an SVG file that you can attach to a bug report if needed.