mirrored queues sync issues
Recently, I experienced an issue with synchronization of mirrored queues in RabbitMQ. Although the root cause of my issue was quickly identified, I thought it would be a good idea to describe how mirrored queues work, as they are sometimes misunderstood.
Disclaimer: I wrote this article based on my current experience with RabbitMQ. It might contain mistakes. Should you notice any, feel free to correct me 🙋🏻♂️
How do mirrored
A very important thing to remember about mirrored queues, is that they work in a master/mirror mode. Only one node (= the master) is in charge of handling the messages. Consumers are transparently redirected to the master, no matter which node they connect to. So, every time a message is received on a mirrored queue, the master will replicate that message to the configured mirror nodes.
Replication consumes bandwidth
Please be aware that this replication consumes network bandwidth. For example: Let’s say you receive messages that are 10MB in size, and you have 2 mirror nodes. This 10MB message will be replicated to 2 nodes, so an additional 20MB of network bandwidth is consumed. Imagine you get a few hundreds requests per minute of those messages, and you’ll see there is quite some additional bandwidth overhead when using mirrored queues.
As noted earlier, mirrored queues are master/mirror by design. Mirrored queues provide redundancy for messages, however, as only the master node is in charge of processing the messages and distributing them to consumers, mirrored queues do not distribute the workload to different nodes.
This means that when a message is received on the master node, this message is replicated to the mirror server. When this message is consumed, the master node will also replicate this information to the mirror nodes. Once a message has been consumed, there is no need for the mirrors to keep their version of this message around.
What happens when the master node goes down?
When the master node goes down, another node will be promoted to master. How this happens depends on the following queue policy settings:
Queue policy settings
ha-promote-on-failure: this setting matters during an unexpected/unannounced downtime of the master node.
ha-promote-on-shutdown: this setting matters during an expected/announced downtime of the master node (this means the rabbitmq server is brought down via “systemctl stop rabbitmq-server” or another way to perform a controlled shutdown).
These two settings can have one of the following values:
always: this means any mirror node can be become a master
when-synced: this means that only a mirror that is in sync with the master will be promoted
By default, the following settings apply:
This means, when we stop RabbitMQ via systemctl, a synced mirror mode will be promoted, and during an unexpected downtime, a non-synced mirror might be promoted.
PROTIP: When consistency is preferred above availability, it’s a good idea to set “ha-promote-on-failure: when-synced” as well.
What happens when a mirror node goes down?
When a mirror node goes down, it will no longer receive replicated messages
from the master node. When the mirror node comes back up again, it will detect
that it has missed messages from the master node and will put itself in the
This begs the question: How and when will the mirror node sync again?
Synchronization will happen in two cases, and both of them have
their own particularities, and you must be aware of those!
You can force a sync via the rabbitmqctl command or the management UI. However, during this kind of sync, the queue will be unavailable. No new messages will be accepted or processed, until the sync has been completed. This is not recommended, as this impacts availability.
rabbitmqctl sync_queue <vhost> <queue>
Automatic synchronization occurs as soon as the mirror node is back up. The
mirror node is “synchronized” again, when all the messages that the
master server received during the mirror’s downtime have been processed.
To illustrate this with an example:
Imagine the following queue, with 3 messages:
Now, let’s say we shutdown the mirror node. In the meantime, 2 more messages are
received at the master node. When the mirror node comes up, the new situation
will be this:
You might wonder: why are
Message1,2,3 gone on the mirror node?
Durable vs non-durable queue
If the queue is non-durable, the messages will be lost after a restart, as they are not persisted on disk
If the queue is durable, the manual mention the following:
when a mirror rejoins a mirrored queue, it throws away any durable local contents it already has and starts empty. Its behavior is at this point the same as if it were a new node joining the cluster.
The mirror node will detect that the master node is “ahead”, and
it will be put in the “unsynchronized” state. The mirrored queue will
only become “synchronized” again after a consumer has processed
Message1 through Message5.
As soon as those messages are processed by a consumer, the mirror will be in
sync, and it will start receiving messages again. For example, let’s say we
receive Message6 and Message7 AFTER Message1 through Message5 have been
processed by a consumer. After this happened, the new situation is this:
Pitfall: Queues without
The auto-sync occurs naturally, as long as consumers are actively processing
The pitfall here are queues without active consumers: When a mirror node
goes down, and new messages are being received by the master node, the message
will never be processed. Since they are not being processed, the mirror will
never catch up. The only way to fix this is by a forced synchronization.
However, best practice dictates that every queue should have an active consumer
to prevent this problem.
- Take Control of your queues
- Indicate which queue’s must be replicated and how
- How to lose messages on a RabbitMQ Cluster
- vFabric RabbitMQ Documentation