I am getting confused between high availability of HDFS and name node, are these two things one and the same or different?
More or less , when the NameNode is down (which is a single point of failure ) in a standard cluster all the HDFS cluster is going to be down, because basically no other role/node can replace its job .
so When we say HDFS High Availability we say creating another standby NameNode to replace the active one once it's down .
So to answer you question , I may say yes ,you can call it 'The HDFS NameNode High Availability', 'HDFS HA' , 'NameNode HA' .. you're pointing to same thing "Making HDFS Cluster WORK when the NameNode host is down ".
HDFS is a distributed file system in Hadoop project.
HDFS deals distributed storage i.e., store the data in term of blocks across the cluster nodes.
HDFS is master slave architecture. It has one or more masters i.e., NameNode(s) and one or more slave nodes i.e., DataNodes.
HDFS has two types of data:
Metadata - Managed by NameNode(s)
Data - Managed by DataNodes
In HDFS, metadata plays an important role for storage & retrieval of actual data.
So the availability of NameNode is very important for the entire cluster health.
To make NameNode highly available, HDFS introduces HDFS High availability or NameNode availability
Note: Both HDFS HA and NameNode HA are same topics
HDFS High Availability provides the option of running two NameNodes in the same cluster, in an active/passive configuration.
My understanding is that both would refer to the same phenomenon.
You could get a better understanding by referring to the Cloudera documentation here.
Related
I want to setup HA for airflow(2.3.1) on centos7. Messaging queue - Rabbitmq and metadata db - postgres. Anybody knows how to setup it.
Your question is very large, because the high availability has multiple level and definition:
Airflow availability: multiple scheduler, multiple workers, auto scaling to avoid pressure, high storage volume, ...
The databases: a HA cluster for Rabbitmq and a HA cluster for postgres
Even if you have the first two levels, how many node you want to use? you cannot put everything in the same node, you need to run one service replica per node
Suppose you did that, and now you have 3 different nodes running in the same data center, what if there is a fire in the data center? So you need to use multiple nodes in different regions
After doing all of above, is there a risk for network problem? of course there is
If you just want to run airflow in HA mode, you have multiple option to do that on any OS:
docker compose: usually we use it for developing, but you can use it for production too, you can create multiple scheduler instances, with multiple workers, it can help you to improve the availability of your service
docker swarm: similar to docker compose with additional features (scaling, multi nodes, ...), you will not find much resources to install it, but you can use the compose files and just do some changes
kubernetes: the best solution, K8S can help you to ensure the availability of your services, easy install with helm
or just running the different services on your host: not recommended, because of manual tasks, and applying the HA is complicated
you can use ceph-volume lvm create --filestore --data example_vg/data_lv --journal example_vg/journal_lv to create ceph volume,but I want to know how many volumes can ceph support,can it be infinite?
Ceph can serve an infinite number of volumes to clients, but that is not really your question.
ceph-volume Is used to prepared a disk to be consumed by Ceph for serving capacity to users. The prepared volume will be served by an OSD and join the RADOS cluster, adding its capacity to the cluster’s.
If your question is how many disks you can attach to a single cluster today, the sensible answer is “a few thousand”. You can push farther using a few tricks. Scale increases over time, but I would say 2,500-5,000 OSDs is a reasonable limit today.
I have three workloads.
DATACENTER1 sharing data by rest services - streaming ingest
DATACENTER2 load bulk - analysis
DATACENTER3 research
I want to isolated workloads, i am going to create one datacenter foreach workloads.
The objective of the operation is to prevent a heavy process from consuming all the resources and gurantee hight availablity data.
Is anyone already trying this ?
During a loadbulk on datacenter2, is data availability good on datacenter1 ?
Short answer is that workload won't cause disruption of load across datacenter. How it works is as follows:
Conceptually when you create a Keyspace, Cassandra creates a Virtual Data Center (VDC). Nodes with similar workloads must be assigned to same VDC. Segregating workload will ensure that only (exactly) one workload is ever executed at a VDC. As long as you follow this pattern, it works.
Data sync needs to be monitored under load on busy nodes but thats a normal concern on any Cassandra deployment.
Datastax Enterprise also support this model as can be seen from:
https://docs.datastax.com/en/datastax_enterprise/4.6/datastax_enterprise/deploy/deployWkLdSep.html#deployWkLdSep__srchWkLdSegreg
I am a newbie to both Hadoop and Virtual Machine(VM). I would like to have a Hadoop cluster with 4-5 nodes. What I understand is that each node is a commodity hardware(PC running Unix). My thought is, is it possible to create 4-5 Virtual Machines(VMs) on an external HDD and use them as nodes for a Hadoop cluster and hoist big data applications on them? If so, what are the general steps I take to achieve this VM based Hadoop cluster?
That would be plain wrong.
The idea of clustering is to increase the available computational power by using multiple physical machines and let them communicate in a manner that allows the overall problem to be split among them.
Now, if you just use four or five VMs on the same physical PC, you're not getting more CPU power than what you'd get if you'd just let your stuff run locally with only a single node -- you're getting less.
I am working with flume to ingest a ton of data into hdfs (about petabytes of data). I would like to know how is flume making use of its distributed architecture? I have over 200 servers and I have installed flume in one of them from where I would get the data from (aka data source) and the sink is the hdfs. (hadoop is running over serengeti in these servers). I am not sure whether flume distributes itself over the cluster or I have installed it incorrectly. I followed apache's user guide for flume installation and this post of SO.
How to install and configure apache flume?
http://flume.apache.org/FlumeUserGuide.html#setup
I am a newbie to flume and trying to understand more about it..Any help would be greatly appreciated. Thanks!!
I'm not going to speak to Cloudera's specific recommendations but instead to Apache Flume itself.
It's distributed however you decide to distribute it. Decide on your own topology and implement it.
You should think of Flume as a durable pipe. It has a source (you can choose from a number), a channel (you can choose from a number) and a sink (again, you can choose from a number). It is pretty typical to use an Avro sink in one agent to connect to an Avro source in another.
Assume you are installing Flume to gather Apache webserver logs. The common architecture would be to install Flume on each Apache webserver machine. You would probably use the Spooling Directory Source to get the Apache logs and the Syslog Source to get syslog. You would use the memory channel for speed and so as not to affect the server (at the cost of durability) and use the Avro sink.
That Avro sink would be connected, via Flume load balancing, to 2 or more collectors. The collectors would be Avro source, File channel and whatever you wanted (elasticsearch?, hdfs?) as your sink. You may even add another tier of agents to handle the final output.
In the latest version, Apache Flume no longer follows master-slave architecture. It is deprecated after Flume 1.x.
There is no longer a Master, and no Zookeeper dependency. Flume now runs with a simple file-based configuration system.
If we want it to scale, we need to install it in multiple physical nodes and run our own topology. As far as single node is considered.
Say we hook to a JMS server that gives 2000 XML events per second, and I need two Fulme agents to get that data, I have two distributed options:
Two Flume agents started and running to get JMS data in same physical node.
Two Flume agents started and running to get JMS data in two physical nodes.