Across multiple managed clients we've settled, over the last few years, on a short list of metrics we actually watch on their UniFi networks. The controller shows hundreds of numbers, graphs and indicators. That's vendor honesty. But not every one of those numbers is useful for someone to watch. Most of them never change a decision.
This is the list of what we actually look at — and a short note on why we turned alerts off for the rest.
What we watch
- WAN uptime and latency. Real ISP reachability, not just 'WAN port up'. We ping 1.1.1.1 and 8.8.8.8 every 30 seconds. Three failures in a row = alert. This covers roughly 90% of all alerts that ever require action.
- AP uplink negotiation speed. If an access point that should be on 1 Gbps suddenly reports as 100 Mbps, something physical is wrong — loose cable, dying SFP module, moisture. Negotiation-change alerts have caught three cases we'd otherwise have found only after the client complained.
- DHCP pool utilisation. At 80% full we have two weeks to extend the pool. At 100% full we have a phone call from the director.
- Top 10 clients by traffic. We don't alert on this — we look at it on Monday morning. Changes usually mean something: a new device, a guest pulling 4K Netflix on the guest network, an IoT sensor that decided to sync with something unusual.
- CPU and memory on the gateway. Not on switches. Not on access points. Just the gateway. The one place that, when it falls over, takes everything else down with it.
What we deliberately ignore
- Per-client RSSI graphs. UniFi will draw them, but they'll never tell you anything actionable. Wi-Fi is physics. A client in the corner of the office will have weaker signal than a client next to the AP. This does not need an alert.
- Wireless retries. Mostly a function of environmental physics — microwave, neighbouring network, glass wall. Worth seeing on a graph, not worth alerting on.
- Firmware update notifications. We do those manually once a month, on a Wednesday evening, after a quick changelog read. We don't run auto-updates on production infrastructure as a rule.
- 'Port saturation' on switches. We'd rather alert on full link drops. 90% saturation isn't a problem. 100% link loss is.
The principle is simple: a dashboard nobody opens at three in the morning is a dashboard that doesn't exist. Pretty graphs are for showing the wider network-health picture during a review. Actionable alerts have to be the kind that, when they ring, someone gets up from their desk.
The alert we wish UniFi sent on its own
UniFi tells you when an access point goes offline. It doesn't tell you when an access point has quietly fallen to half its expected throughput. We had a client where one AP looked online, reported all-green, but at an 18-degree angle to the ceiling was actually delivering 40 Mbps instead of the expected 300. The clients felt it. The controller said nothing.
We wrote a small Go script that runs a DNS lookup through each AP once an hour and measures the response. The result goes to Datadog as a custom metric. Alert: any AP whose same-query latency is 3× slower than the 24-hour median. Since then we know about slowly-degrading APs two or three days before the client starts raising tickets.
The best monitoring is invisible. It sends a message when something actually needs you. Otherwise it stays quiet.
This list isn't final. We add when we feel a new pattern; we remove when an alert hasn't fired actionable work in a month. If we manage your network, this setup turns up the first Monday after handover. If you run it yourself and want to know which five metrics will actually save you, drop us a line — we'll happily walk through it in an hour of call time.