Doing Deadman Checks With Prometheus Metrics in MetricsQL

The lack of functionality to do deadman checks is one of my biggest gripes with PromQL. Fortunately, MetricsQL (a superset of PromQL) has added this functionality, and it’s very easy to use. If you’re not familiar with MetricsQL, it’s the built in query language in VictoriaMetrics, a horizontally scalable time series database. VictoriaMetrics is my go to database for storing Prometheus metrics because of how performant and easy to use it is compared to alternatives like Mimir or Thanos.

Using the ‘lag’ function

The lag function in MetricsQL (https://docs.victoriametrics.com/victoriametrics/metricsql/#lag) is what lets us do this. It returns the duration in seconds between the last sample on the given lookbehind window and the timestamp of the current point. It is calculated independently per each time series returned from the given series selector. Here’s an example:

lag(metric_name{label="value"}[1h]) > 600

Let’s break apart what this query is doing. First we’re getting all of the metrics that match the query selector metric_name{label="value"}. In PromQL you usually have to do deadman checks by tracking a single specific series and checking whether it exists or not with the count function, but that doesn’t apply here. Our series selector can return as many metrics as we want. Then we have this wrapped in the lag function with a lookbehind window of [1h] set. This says we’re looking for any metrics that have reported within the last hour. Then we check for the lag of everything (time since last datapoint) is greater than 600 seconds to get the series that have stopped reporting data.

To describe it in one sentence, the above expression is getting all of the series within that series selector that have reported a datapoint within the past hour, but haven’t reported a datapoint within the past 10 minutes (600 seconds).