fbpx

RabbitMQ mirrored queues troubleshooting sync issues

Troubleshooting RabbitMQ mirrored queues made easy.

RabbitMQ mirrored queues master/slave graphic illustration

 

RabbitMQ mirrored queues master/slave graphic illustration

Troubleshooting
mirrored queues sync issues

Recently, I experienced an issue with synchronization of mirrored queues in RabbitMQ. Although the root cause of my issue was quickly identified, I thought it would be a good idea to describe how mirrored queues work, as they are sometimes misunderstood.

Disclaimer: I wrote this article based on my current experience with RabbitMQ. It might contain mistakes. Should you notice any, feel free to correct me 🙋🏻‍♂️ 

How do mirrored
queues work?

Master/mirror mode

A very important thing to remember about mirrored queues, is that they work in a master/mirror mode. Only one node (= the master) is in charge of handling the messages. Consumers are transparently redirected to the master, no matter which node they connect to. So, every time a message is received on a mirrored queue, the master will replicate that message to the configured mirror nodes.

Replication consumes bandwidth

Please be aware that this replication consumes network bandwidth. For example: Let’s say you receive messages that are 10MB in size, and you have 2 mirror nodes. This 10MB message will be replicated to 2 nodes, so an additional 20MB of network bandwidth is consumed. Imagine you get a few hundreds requests  per minute of those messages, and you’ll see there is quite some additional bandwidth overhead when using mirrored queues.

Redundancy

As noted earlier, mirrored queues are master/mirror by design. Mirrored queues provide redundancy for messages, however, as only the master node is in charge of processing the messages and distributing them to consumers, mirrored queues do not distribute the workload to different nodes.

This means that when a message is received on the master node, this message is replicated to the mirror server. When this message is consumed, the master node will also replicate this information to the mirror nodes. Once a message has been consumed, there is no need for the mirrors to keep their version of this message around.

What happens when the master node goes down?

When the master node goes down, another node will be promoted to master. How this happens depends on the following queue policy settings:

Queue policy settings

ha-promote-on-failure  

ha-promote-on-failure: this setting matters during an unexpected/unannounced downtime of the master node.

ha-promote-on-shutdown

ha-promote-on-shutdown: this setting matters during an expected/announced downtime of the master node (this means the rabbitmq server is brought down via “systemctl stop rabbitmq-server” or another way to perform a controlled shutdown).

These two settings can have one of the following values:

always

always: this means any mirror node can be become a master

when-synced

when-synced: this means that only a mirror that is in sync with the master will be promoted

By default, the following settings apply:

ha-promote-on-failure: always

ha-promote-on-shutdown: when-synced

This means, when we stop RabbitMQ via systemctl, a synced mirror mode will be promoted, and during an unexpected downtime, a non-synced mirror might be promoted.

PROTIP: When consistency is preferred above availability, it’s a good idea to set “ha-promote-on-failure: when-synced” as well.

What happens when a mirror node goes down?

When a mirror node goes down, it will no longer receive replicated messages
from the master node. When the mirror node comes back up again, it will detect
that it has missed messages from the master node and will put itself in the
“unsynchronized” state.

This begs the question: How and when will the mirror node sync again?

Synchronization will happen in two cases, and both of them have
their own particularities, and you must be aware of those!

Forced synchronization

You can force a sync via the rabbitmqctl command or the management UI. However, during this kind of sync, the queue will be unavailable. No new messages will be accepted or processed, until the sync has been completed. This is not recommended, as this impacts availability.

rabbitmqctl sync_queue <vhost> <queue>

Automatic synchronization

Automatic synchronization occurs as soon as the mirror node is back up. The
mirror node is “synchronized” again, when all the messages that the
master server received during the mirror’s downtime have been processed.

Example

To illustrate this with an example:

Imagine the following queue, with 3 messages:

Master Mirror
Message1 Message1
Message2 Message2
Message3 Message3

Now, let’s say we shutdown the mirror node. In the meantime, 2 more messages are
received at the master node. When the mirror node comes up, the new situation
will be this:

Master Mirror
Message1  
Message2  
Message3  
Message4  
Message5  

You might wonder: why are
Message1,2,3 gone on the mirror node?

Durable vs non-durable queue

If the queue is non-durable, the messages will be lost after a restart, as they are not persisted on disk

If the queue is durable, the manual mention the following:

            when a mirror rejoins a mirrored queue, it throws away any durable local contents it already has and starts empty. Its behavior is at this point the same as if it were a new node joining the cluster

vFabric RabbitMQ Documentation

The mirror node will detect that the master node is “ahead”, and
it will be put in the “unsynchronized” state. The mirrored queue will
only become “synchronized” again after a consumer has processed
Message1 through Message5. 

As soon as those messages are processed by a consumer, the mirror will be in
sync, and it will start receiving messages again. For example, let’s say we
receive Message6 and Message7 AFTER Message1 through Message5 have been
processed by a consumer. After this happened, the new situation is this:

Master Mirror
Message1  
Message2  
Message3  
Message4  
Message5  
Message6 Message6
Message7 Message7

Pitfall: Queues without
consumers

The auto-sync occurs naturally, as long as consumers are actively processing
messages.

The pitfall here are queues without active consumers: When a mirror node
goes down, and new messages are being received by the master node, the message
will never be processed. Since they are not being processed, the mirror will
never catch up. The only way to fix this is by a forced synchronization.
However, best practice dictates that every queue should have an active consumer
to prevent this problem.


Resources:


Hire a cloud engineer.

Become a cloud engineer.

Jeroen Jacobs

Jeroen is Evolutionary Architect at ToThePoint NV and is one of the AWS experts at OnTheSpot, the cloud enabling competence center within Cronos Groep.

Leave a Reply

Facebook
Twitter
LinkedIn