Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
Enterprise Edition
-
- Affects all currently-shipping Observium versions. The buggy query has been unchanged in `poller-wrapper.py` for at least 5 years (CE 19.8 → present, including the Feb 2026 CE 26.1 release).
- Manifests on **MariaDB 10.3+** (verified on 10.6.5; perOBS-5164likely also 10.11.6).
- Requires distributed polling. Single-poller deployments are unaffected.- Affects all currently-shipping Observium versions. The buggy query has been unchanged in `poller-wrapper.py` for at least 5 years (CE 19.8 → present, including the Feb 2026 CE 26.1 release). - Manifests on **MariaDB 10.3+** (verified on 10.6.5; per OBS-5164 likely also 10.11.6). - Requires distributed polling. Single-poller deployments are unaffected.
Description
I used AI to generate this ticket so if you need more information from me, let me know.
TL;DR
Observium's distributed-poller wrapper splits devices across multiple poller
hosts using a clever-but-fragile SQL trick that is only guaranteed to work
on MySQL by accident. On MariaDB, the same trick gives different answers on
different runs. The real-world consequence:
- Each cron tick, every poller host asks the database "which devices are
mine to poll?" - On MariaDB, the answer randomly changes between runs.
- Each device ends up polled on a random schedule instead of every 5 minutes.
- When the gap exceeds 10 minutes, RRDtool records `NaN` (no value).
- Graphs in the web UI show holes.
Summary
When using distributed pollers (`observium-wrapper poller -i N -n i` with
`N >= 2`) against a MariaDB 10.3+ master database, devices get randomly
reshuffled across poller instances on every cron tick. Many devices end up
polled at intervals exceeding the RRD `minimal_heartbeat` (default 600s),
causing `-nan` rows in the resulting RRDs. The same configuration on MySQL
8/9 happens to work because of an optimizer detail that masks the underlying
SQL bug; on MariaDB it is exposed.
Symptom
`rrdtool fetch` on a recent RRD shows alternating real values and `-nan` rows
on a regular cadence:
1777343700: 1.7722e+08 1.7764e+08 ...
|
1777344000: -nan -nan -nan ...
|
1777344300: -nan -nan -nan ...
|
1777344600: -nan -nan -nan ...
|
1777344900: 1.7732e+08 1.7748e+08 ...
|
1777345200: -nan -nan -nan ...
|
In the `devices` table, `last_polled` for individual devices drifts to 15-45
minutes ago even though `observium-wrapper poller` reports "processed N
devices in X seconds" every cron tick on every distributed poller. Each cron
tick the wrapper processes a different random subset of the devices it
should own.
Root cause
`poller-wrapper.py` (around lines 793-815 in CE 26.1) uses this SQL to
compute the per-instance shard:
SELECT `device_id` FROM (SELECT @rownum := 0) r, |
(SELECT @rownum := @rownum + 1 AS rownum, device_id |
FROM `devices` |
WHERE ... |
ORDER BY `device_id` ASC) temp |
WHERE MOD(temp.rownum, %s) = %s |
The MySQL/MariaDB manuals both state that "the order of evaluation for
expressions involving user variables is undefined" (see Sources, claims 1-2).
On MariaDB the optimizer typically chooses a secondary index (e.g.
`disabled_lastpolled`) for this query and adds `Using temporary;
Using filesort` to satisfy the inner `ORDER BY device_id`. The
`@rownum := @rownum + 1` assignment runs during the secondary-index scan
{}before{} the filesort, so the rownum baked into each row reflects
secondary-index scan order, not sorted-device_id position. Combined with
InnoDB stats auto-recalc, the chosen plan can shift between executions,
producing different `MOD(rownum, N) = i` row sets across cron ticks.
`EXPLAIN` on MariaDB 10.6:
key: disabled_lastpolled
|
type: range
|
Extra: Using index condition; Using where; Using temporary; Using filesort
|
`EXPLAIN` on MySQL 8
key: PRIMARY
|
type: index
|
Extra: (none)
|
MySQL's optimizer is generally more biased than MariaDB's toward picking an
index that already satisfies `ORDER BY` so a filesort can be skipped.
`prefer_ordering_index=on` is one example of this bias (specifically scoped
in the docs to queries with `LIMIT`, which our query does not have); a
broader pattern of similar preferences leads MySQL to typically pick the
PRIMARY index for our query and MariaDB to typically pick a secondary index.
With the PRIMARY scan, rows emerge already in `device_id` order, no filesort
is needed, and rownum stays correlated with the final result order. This is
what masks the underlying user-variable-evaluation issue on MySQL. The SQL
is still technically undefined per the manual; the deprecation label has
been on it since MySQL 8.0 (see Sources, claim 2).
Reproduction
On any MariaDB 10.3+ Observium master with distributed polling enabled and
at least a few hundred devices:
Run the wrapper's exact shard query 20 times, hash the result each time.
for i in $(seq 1 20); do |
mysql observium -N -e "
|
SELECT \`device_id\` FROM (SELECT @rownum := 0) r,
|
(SELECT @rownum := @rownum + 1 AS rownum, device_id
|
FROM \`devices\`
|
WHERE disabled != 1 AND poller_id = 1
|
ORDER BY \`device_id\` ASC) temp
|
WHERE MOD(temp.rownum, 4) = 0
|
ORDER BY device_id;" \
|
|md5sum
|
done |
Observed on MariaDB 10.6.5 with 313 active devices (`-i 4 -n 0`):
- distinct MD5 hashes across 20 runs{} — the shard membership is
changing between executions.
Observed on MySQL 8.0.41 with the same dataset:
- MD5 hash across 20 runs{} — stable. (The resulting partition is
technically not what the code's surface form claims — `@rownum` reflects
PK scan position, not sorted-device_id rank — but it is at least a stable
bijection on the active-device set, which is what the wrapper depends on.)
Proposed fix
Replace both the `bills` and `devices` distributed-shard subqueries with the
SQL-standard `ROW_NUMBER() OVER (ORDER BY id ASC)` window function.
`ROW_NUMBER()` is supported on MariaDB 10.2+ and MySQL 8+, both of which
Observium already requires (see Sources, claims 4-5).
— a/poller-wrapper.py
|
+++ b/poller-wrapper.py
|
@@ -793,21 +793,17 @@
|
# Use distributed wrapper
|
if process == 'billing':
|
# billing
|
|
|
- query = """SELECT `bill_id` FROM (SELECT @rownum :=0) r,
|
- (
|
- SELECT @rownum := @rownum +1 AS rownum, bill_id
|
+ query = """SELECT `bill_id` FROM (
|
+ SELECT ROW_NUMBER() OVER (ORDER BY `bill_id` ASC) AS rownum, `bill_id`
|
FROM `bills`
|
- ORDER BY `bill_id` ASC
|
) temp
|
WHERE MOD(temp.rownum, %s) = %s""" % (instances_count, instance_number)
|
else:
|
|
|
# poller or discovery
|
|
|
- query = """SELECT `device_id` FROM (SELECT @rownum :=0) r,
|
- (
|
- SELECT @rownum := @rownum +1 AS rownum, device_id
|
+ query = """SELECT `device_id` FROM (
|
+ SELECT ROW_NUMBER() OVER (ORDER BY `device_id` ASC) AS rownum, `device_id`
|
FROM `devices`
|
%s
|
- ORDER BY `device_id` ASC
|
) temp
|
WHERE MOD(temp.rownum, %s) = %s""" % (where_devices, instances_count, instance_number)
|
After the fix, the query is:
- Standards-conformant - no reliance on undefined user-variable evaluation
order). - Deterministic across runs and engines.
- Forward-compatible with MySQL 9.x, where the `@var := ...` inline
assignment is deprecated (see Sources, claim 2).
Possibly related public ticket
OBS-5164 ("Issue with multiple pollers alerts while they can reach
Observium server and the polled device", filed 2026-01-26, currently Pending
Response):
- Observium Enterprise 24.8.13606
- MariaDB 10.11.6
- 9 distributed pollers
- Symptom: random `device_down` alerts with no underlying network or device
problem - Reporter blamed DB connection timeouts but couldn't pin down a real cause
The symptoms are consistent with this bug. When the shard membership
reshuffles each cron tick, devices that "should have been" polled by
instance N this tick get skipped; an alert checker then sees stale
`last_polled` and fires `device_down` until the next reshuffle happens to
include that device. Worth investigating whether OBS-5164 closes once this
fix is in.
Notes
- The bug is silent: `observium-wrapper poller` reports "processed N devices
in X seconds" successfully every tick, and the wrapper's own RRDs
(`poller-wrapper.rrd`) look healthy. The corruption is downstream in the
per-device RRDs. - It scales with deployment size: a deployment small enough that MariaDB
picks the PRIMARY index for the shard query may not see the bug. The
threshold is determined by the optimizer's selectivity estimate of the
`disabled` column versus PK scan cost. - The wrapper's `billing` shard query has the same bug. The patch above
covers both.
Sources
Claim 1 — User-variable evaluation order is officially undefined
- MySQL 9.6 Reference Manual, §11.4 "User-Defined Variables"
https://dev.mysql.com/doc/refman/9.6/en/user-variables.html>
> "The order of evaluation for expressions involving user variables is
> undefined. For example, there is no guarantee that
> `SELECT @a, @a:=@a+1` evaluates `@a` first and then performs the
> assignment."
- MySQL 8.0 Reference Manual, §11.4 "User-Defined Variables"
https://dev.mysql.com/doc/refman/8.0/en/user-variables.html>
Same paragraph — "The order of evaluation for expressions involving user variables is undefined." (identical wording across 8.0 / 9.0-9.6)
- MariaDB Knowledge Base, "User-Defined Variables"
https://mariadb.com/kb/en/user-defined-variables/>
> "It is unsafe to read a user-defined variable and set its value in the
> same statement (unless the command is `SET`), because the order of
> these actions is undefined."
This is the engine-agnostic foundation: both vendors document that the
`@rownum := @rownum + 1` pattern has no guaranteed evaluation order. A
working result on any given engine/version is incidental, not contractual.
Claim 2 — The inline `@var := ...` assignment style is deprecated in MySQL
Cited in: "Notes on engine portability" / "Proposed fix" sections, where the
ticket says the `@rownum := ...` pattern is "deprecated since MySQL 8.0 and
subject to removal in a future release."
- MySQL 8.0 Reference Manual, §11.4 "User-Defined Variables"
https://dev.mysql.com/doc/refman/8.0/en/user-variables.html
> "Previous releases of MySQL made it possible to assign a value to a user
> variable in statements other than `SET`. **This functionality is
> supported in MySQL 8.0 for backward compatibility but is subject to
> removal in a future release of MySQL.**"
- MySQL 9.6 Reference Manual, §11.4 "User-Defined Variables"
https://dev.mysql.com/doc/refman/9.6/en/user-variables.html
> Same paragraph, with "MySQL 8.0" replaced by "MySQL 9.6". Oracle has
> preserved the deprecation warning for ≥ 5 release cycles.
Claim 3 — `prefer_ordering_index` exists only on MySQL, defaults `on`, and is part of why MySQL tends to mask the bug
Cited in: "Root cause" section, where the ticket says MySQL's optimizer is
biased toward picking an index that already satisfies `ORDER BY`.
- MySQL 8.0 Reference Manual, §10.9.2 "Switchable Optimizations"
<https://dev.mysql.com/doc/refman/8.0/en/switchable-optimizations.html>
> "`prefer_ordering_index` (default `on`) — Controls whether, in the case
> of a query having an `ORDER BY` or `GROUP BY` with a `LIMIT` clause,
> the optimizer tries to use an ordered index instead of an unordered
> index, a filesort, or some other optimization. This optimization is
> performed by default whenever the optimizer determines that using it
> would allow for faster execution of the query. [...] Prior to MySQL
> 8.0.21, it was not possible to disable this optimization, but in
> MySQL 8.0.21 and later, while it remains the default behavior, it can
> be disabled by setting the `prefer_ordering_index` flag to `off`."
Important precision. The flag's documented scope is `ORDER BY` /
`GROUP BY` with a `LIMIT`. Our buggy query does not have `LIMIT`, so the
flag is not strictly the cause of MySQL picking the PRIMARY index in our
case. The MySQL optimizer separately also prefers an index that already
satisfies `ORDER BY` (so it can avoid a filesort). The combined practical
effect is that MySQL tends to pick the PRIMARY index for our query and
MariaDB tends not to. The MariaDB optimizer has no comparable "prefer
ordering index" toggle.
- MariaDB does not have an equivalent flag.{} MariaDB's `optimizer_switch`
documentation does not list `prefer_ordering_index`:
https://mariadb.com/kb/en/optimizer-switch/
Claim 4 — `ROW_NUMBER()` is supported on MariaDB 10.2+ and MySQL 8.0+
Cited in: "Proposed fix" section.
- MariaDB Knowledge Base, "ROW_NUMBER"
<https://mariadb.com/kb/en/row_number/>
> Documents `ROW_NUMBER() OVER (...)` syntax and behavior. Window functions
> were added in MariaDB 10.2 — confirmed at:
> <https://mariadb.com/kb/en/window-functions-overview/>
> which links to "Introduction to Window Functions in MariaDB Server 10.2".
- MySQL 8.0 Reference Manual, §14.20.1 "Window Function Descriptions"
<https://dev.mysql.com/doc/refman/8.0/en/window-function-descriptions.html>
> Lists `ROW_NUMBER()` as one of the standard nonaggregate window
> functions. MySQL added window functions in 8.0.
Claim 5 — Observium's own minimum versions already require ROW_NUMBER()-capable engines
Cited in: "Proposed fix" section, where the ticket says ROW_NUMBER() is
within Observium's already-required engine versions.
- Observium docs, "Software Requirements"
<https://docs.observium.org/software_requirements/>
> MySQL minimum 5.7.8, recommended > 8.0
> MariaDB minimum 10.2.7, recommended > 10.3
MariaDB 10.2.7 ≥ 10.2 → has `ROW_NUMBER()`.
MySQL 5.7.8 lacks `ROW_NUMBER()` (window functions were added in 8.0). So
there is a corner case: a Community Edition deployment running
Observium-minimum MySQL 5.7.8 would not have `ROW_NUMBER()` available.
However, Distributed Polling — the only feature that exercises this query
path — is Enterprise-only, and Enterprise installs are virtually always on
recent MySQL 8.x or MariaDB 10.5+.
Claim 76— DBA community has independently called the rownum trick out as undefined
Useful as third-party (non-vendor) corroboration:
- DBA Stack Exchange, question 244353
<https://dba.stackexchange.com/questions/244353/order-by-is-ignored-if-inside-a-derived-table-of-a-update-only-in-mysql-5-7>
> Accepted-answer quote: "A derived table in SQL is a set, and a set by
> definition has no row order. Your attempt to use an imperative
> calculation (`@rownum := @rownum + 1`) within a set was wrong to begin
> with, and while it may have worked for a while, as you now realized,
> there is no guarantee for that."
- **MySQL official blog archive, "Row numbering, ranking: how to use less
user variables in MySQL queries"**
<https://dev.mysql.com/blog-archive/row-numbering-ranking-how-to-use-less-user-variables-in-mysql-queries/>
> Discusses why the `@rownum` trick happens to work on a single-table
> query when the optimizer pre-orders rows: "MySQL sees that there is
> only one table and that the ordering expression is a column of this
> table, so it makes an optimization: it first orders the rows of people,
> then reads this ordered result and, for each row, calculates selected
> expressions (including `@rownum`). So `@rownum` increases in order, as
> desired." The blog post is explicit that this is an *optimization
> side-effect*, not a guarantee.
Attachments
Issue Links
- mentions
-
OBS-5164 Issue with multiple pollers alerts while tthey can reach Observium server and the polled device
-
- Resolved
-