yum -y install gdisk
yum -y install util-linux
gdisk /dev/sdc
vgextend vgVAR /dev/sdc1
lvextend --extents +100%FREE /dev/vgVAR/lvVAR /dev/sdc1
xfs\_growfs /dev/mapper/vgVAR-lvVAR
/var/lib/dcos/mesos-resources -- the current Mesos Agent resource state
/var/lib/mesos/slave/meta/slaves/latest -- Mesos Agent checkpoint state
mkdir -p /dcos/volume0 && mkdir -p /var/local/ImageSilo
dd if=/dev/zero of=/var/local/ImageSilo/volume0.img bs=1M count=200000
losetup /dev/loop0 /var/local/ImageSilo/volume0.img
mkfs -t xfs /dev/loop0
losetup -d /dev/loop0
systemctl stop dcos-mesos-slave.service
rm -f /var/lib/dcos/mesos-resources
rm -f /var/lib/mesos/slave/meta/slaves/latest
echo "/var/local/ImageSilo/volume0.img /dcos/volume0 auto loop 0 2" \| sudo tee -a /etc/fstab
mount /dcos/volume0
shutdown -r now
Destroy HDFS Service
docker run mesosphere/janitor /janitor.py -r hdfs-role -p hdfs-principal -z hdfs
(run in leader node) curl -d 'frameworkId=149b8e15-6168-42fb-a2d8-785f4159eaa9-1649' -X POST http://192.168.2.210:5050/master/teardown
docker run mesosphere/janitor /janitor.py -z dcos-service-hdfs
docker run mesosphere/janitor /janitor.py -z hadoop-ha
Destroy Kafka Service
(run in leader node) curl -d 'frameworkId=4e99a619-30d4-4d10-a640-78d45a3aab38-0067' -X POST http://192.168.2.210:5050/master/teardown
docker run mesosphere/janitor /janitor.py -z dcos-service-kafka
HDFS resiliency model for an HA deployment
A quick summary of the HDFS resiliency model for an HA deployment like yours:
The two NameNodes form an active/standby pair. In the event of a machine restart of the active, then the system detects failure of the active and the standby takes over as the new active. Once the machine completes its restart, the NameNode process runs again, and it becomes the new standby. There is no downtime unless both NameNodes are down simultaneously. The data on the host (e.g. the fsimage metadata file) is typically maintained between restarts. If this is not the case in your environment, then you'll need additional recovery steps to re-establish the standby, such as by running the
hdfs namenode -bootstrapStandby
command.
The 3 JournalNodes form a quorum. In the event of a machine restart, the NameNode can continue writing its edit log transactions to the remaining 2 JournalNodes. Once the machine completes its restart, the JournalNode process runs again, catches up with transactions it may have missed, and then the NameNode writes to all 3 again. There is no downtime unless 2 or more JournalNodes are down simultaneously. If data (e.g. the edits files) are not maintained across restarts, then the restarted JournalNode will catch up by copying from a running JournalNode.
DataNodes are mostly disposable. In the event of a machine restart, clients will be rerouted to other running DataNodes for their reads and writes (assuming the typical replication factor of 3). Once the machine completes its restart, the DataNode process runs again, and it can start serving read/write traffic from clients again. There is no downtime unless a mass simultaneous failure event (extremely unlikely and probably correlated with bigger data center problems) causes all the DataNodes hosting replicas of a particular block are down simultaneously. If data (the block file directory) is not maintained across restarts, then after a restart, it will look like a whole new DataNode coming online. If this causes cluster imbalance, then that can be remedied by running the HDFS Balancer.