galera cluster - all nodes down, restart query
We are seeking a replacement for standard replication. One possibility that sprung up was Galera Cluster in MariaDB, and I was working through the Linux Academy online course for that to gain an appreciation of that.
I have two test servers to try it out on and had created a cluster on the two.
What I found was that if BOTH servers were down when i started them, neither of them would start MariaDB because the gcomm connection could not be made ... because the other node was not running MariaDB etc. But that second node couldn't start because the first node wasn't up... because it needed the second node wasn't up. In effect it seems that once all nodes in a cluster are down remedial work then needs to be done ie manual intervention, to get it working again.
i queried/checked this with Linux academy, and they are checking with their own experts but with the premise (understandably) they are only really able to comment on course content not on feasibility in a production environment... though i did stress this was a very basic general query etc.
so - having found these forums I'll ask here also...
if all nodes of a cluster stop, is it really true that to get the cluster working again (or just getting MariaDB to just start and serve schemas etc) this requires manual intervention - it cannot just restart itself automagically? Naturally in the real world one would have the servers separated on different subnets if possible etc etc tec - but unfortunately this being the real world anything that can go wrong, at some time will and i cant suggest a solution that has a major gotcha in it like this because of course not only will it at sometimes go wrong, some time down the road, it will be bound to do it at 0200 on a Sunday over a bank holiday weekend!
cheers :-)
ian
Answer Answered by Ian Gilfillan in this comment.
Three nodes is the minimum recommended size for a functional cluster, but in your example, if the cluster is entirely down, yes, manual restart is by design. You want to check which node is furthest ahead, as you don't want to start a node that's behind first. See Restarting the Cluster.