Gluster: accidential peer "takeover" by "peer probe"

peter_b · Post by **peter_b** » Sun Dec 07, 2014 1:40 pm

[PROBLEM]
Imagine the following Gluster setup:

2 Servers (aka "node" or "peer").
Let's call them "A" and "B".
A and B are setup independently, but completely identical (=same "volume name").
A is the production pool, B is the asynchronously sync'd backup.
A contains 71 TB on 3 bricks. B is not fully sync'd and only contains around 65 TB.

Everything was running perfectly, until I tried to prepare for adding new bricks on a new node to "A":
I thought "peer probe" was just "probing" (=checking status, availability, etc) a node, so I tried to probe A on A:

peer probe: success: on localhost not needed

Okay, so I tried to probe A from B.
Big mistake...

This caused B to overwrite A's (production-use!) configuration. "/var/lib/glusterd" was now pointing to bricks on B.
Since B was a copy of A, it wasn't obvious what happened, since user access over Samba to A's mountpoint, silently retrieved the data from B's bricks.

Actually, this is great! It shows how nice Gluster does it's job. Seriously: This is exactly what we wanted to have Gluster for - but my "peer probe" caused an uncomfortable situation:
The test-volume "B" now acted as production pool, but being referenced through Samba on "A". Ouch!

The fact that A and B had identical volume names, led to the problem that A's brick-configuration was overwritten by B's config.

[SOLUTION]
Luckily, it was rather straightforward:

1) Detach the wrongly connected peer:
On server B, detach A:

Code: Select all

$ gluster peer detach A

You can't do it the other way around, or A will complain that bricks on B are still in use.

B never needed A in this constellation anyway, so B continues to run fine. Standalone.

2) Create the volume again on A:
A and B had identical volume names, so detaching A removed any reference to its previous volume configuration.
Therefore, create a new one. Same name as before. In our case "dlp-storage":

Code: Select all

$ gluster volume create dlp-storage

Now, we need to re-assign this volume's previously owned bricks.

3) "Free" the bricks on A:
In order to re-add the bricks on A to A, we must clear some extended file attributes (xattrs) first.
See "{path} or a prefix of it is already part of a volume" for instructions how to do so.
In my case, I've used "mv" rather than "rm" for the ".gluster" folder - to be able to roll-back the previous config. Just in case...

Here's it in a nutshell:
Make a backup of the xattrs you're about to clear:

Code: Select all

$ getfattr --absolute-names -d -m - /path/of/brick{1,2,3}/data >> backup_xattr.txt

Now, clear the attributes and move the ".glusterfs" folder out of the way (but keep it for recovery reasons):

Code: Select all

$ setfattr -x trusted.glusterfs.volume-id $brick_path
$ setfattr -x trusted.gfid $brick_path
$ mv -rf $brick_path/.glusterfs $brick_path/.glusterfs-DELME

NOTE: You must run this on each brick's data path.

4) Add the bricks to the volume on A:

Code: Select all

$ gluster volume add-brick dlp-storage hostname_a:/path/of/brick1/data hostname_a:/path/of/brick2/data hostname_a:/path/of/brick3/data

Guess what? That's actually almost it!

5) Start the volume and mount it:

Code: Select all

$ gluster volume start dlp-storage

Then, mount it under a new mountpoint. In our case we had "/mnt/dlp-storage" as regular one (=now temporarily mounted "B"), and "/mnt/dlp-test" for this recovered "A".

"df -h" showed that the previous data on the bricks was instantly found and available in the gluster volume: 71 TB all happily there!

There wasn't even any noticeable waiting time. The ".glusterfs" folders on each brick were also immediately rebuilt.

Finally, I've rsync'd the delta that B aquired during this "misconfiguration phase" back to A:

Code: Select all

$ rsync -avP /mnt/dlp-storage/ /mnt/dlp-test/

Debriefing:
I would have expected a warning when doing "gluster peer probe" on a node that already has a volume with an identical name - preventing this from happening in the first place (Reminder to myself: Add it to their bugtracker). But nevertheless, this showed how the folder/file-based design allows to quickly switch between nodes/bricks - and re-integrate existing data.