Using Storm in Cloudera - cloudera

I have been looking to use Storm which is available with Hortonworks 2.1 installation but in order to avoid installing Hortonworks in addition to a Cloudera installation (which has Spark in it), I tried to find a way to use Storm in Cloudera.
If one can use both Storm and Spark on a single platform then it will save additional resources required to have both Cloudera and Hortonworks installations on a machine.

You can use storm with Cloudera installation. You will have to install it on your own and maintain it as such. It will not be part of the Cloudera stack but that should not stop you from using it along with Hadoop if you need it.

You can use Storm on any of the vendor platform. However, storm cluster management is something you have to consider. Storm is not part of the CDH distribution. Cloudera Manager does not manage the lifecycle of the storm services and configurations, nor does it monitor the storm cluster, unless you are willing to write a Clouderea Manager extension yourself. On the contrary, if you choose a vendor such as HDP, the Ambari management tool on HDP provides all the above management features.
If you have a streaming project on CDH, you should strongly consider Apache Spark first, as it provides the same programming model for both batch and streaming processing. You do not need to learn a new API. However, Apache Spark streaming is micro-batch. Thus in use cases that requires sub-second low latency real-time processing, Storm is more suitable.

You can use Storm alongside Cloudera.

All the above are true, but why would you?
Spark includes Spark Streaming, which allows you to handle data processing and stream/event processing workloads using a single API. Spark/Streaming is already inside CDH.
So, why burden yourself with two different APIs?

You can install Apache Storm on Cloudera VM.
For a basic setup and test run, follow below link:
https://github.com/vrmorusu/StormOnClouderaVM/wiki/Apache-Storm-on-Cloudera-VM
This should get you started on developing Storm applications on Cloudera VM.

Related

Difference b/w Mapr Vs Cloudera?

Cloudera is free edition and enterprise edition but MapR is almost enterprise edition why? is there any major difference between them?
Basically, Cloudera and MapR are Big data platforms. In Cloudera have three editions, one is free, enterprise edition up to 60 days and full enterprise edition. In free edition, some services are not there compare with enterprise edition. There is no default security.
http://commandstech.com/mapr-vs-cloudera-vs-hortonworks/
In MapR has completely enterprise edition because of it has own security and inbuilt services are there and finance domains are used mostly. High availability also more compare with Cloudera
Cloudera is basically just Apache Hadoop including Spark and Hive with some management tools. It is largely limited to HDFS operation.
MapR is a much more versatile system. It supports Apache software like Hadoop, Spark, Hive and Drill, but it goes far beyond that as well. Support for Kubernetes is excellent (including very conventional software like postgres or mySQL) and you can mix and match conventional software with big data software freely. You can also mix in machine learning and AI software without having to copy data around to specialist clusters.
In addition, you can run various HPC (high performance computing) systems directly on MapR without having to convert them to use big data APIs.
Cloudera runs on HDFS wheras MAPR runs on MAPRFS. HDFS is append only whereas MAPRFS allows random read/writes making it highly efficient. This effectively means MAPR can provide the same performance in much lesser memory requirement than HDFS. The lowest unit or read/write is much smaller in MAPRFS. HDFS is a distributed file system but underneath it uses linux file system to write data to the actual disk. This is lack of control on optimization during actual writes on raw disk, in MapR they directly have the native code which writes directly into disks in an optimized way. This itself is single big reason for improved writes.Since the code is written in C, there is no need of JVM garbage collection.
For further details, you could look up the link :
https://mapr.com/blog/database-comparison-an-in-depth-look-at-mapr-db/

Where to find IBM WebSphere WMQ 6.0 jar files

I am trying to implement code where I can send and receive the SOAP messages to IBM MQ. As of my knowledge jar file are required for my code to work, but could not find any place where either I can download the files or can do whole setup of WebSphere 6.0
Do anyone have any idea how can I get it ?
Please be aware that grabbing the jar files from an MQ Server or other installation is not supported by IBM and never has been. However, because it is one of the most commonly used methods to install the MQ client for Java or JMS and fairly common in Java developer culture, IBM has provided a Java-only install option. Please see the Redistributable Clients page in the Knowledge Center for details.
As the name suggests, this install provides an MQ Client package that can be redistributed with independently developed MQ applications. While that is helpful, the main reason IBM provides it is to provide a lightweight install package that...
Contains the correct and complete set of jar files as packaged by IBM.
Is intact and verifiable against a known specification and inventory.
Can reliably be expected to perform as per the documentation set for that version.
Contains all of IBM's diagnostic utilities both in the compiled binaries and in the Java classes.
Contains additional utilities such as GSKit for managing certificates.
Can be patched using IBM's standard Fix Pack install media so that integrity of the installed classes and libraries is preserved.
When using IBM's install media and procedure, the result is far more stable but int he event something goes wrong, the presence of the diagnostic utilities and conformance to a standard install procedure can dramatically reduce outage durations.
Also, there are occasional instances in which a customer with full support entitlements is told that their non-standard installation is not supported and they need to correct it before continuing the PMR. Though this doesn't happen often, in most cases the problem is resolved when the MQ client is installed according to spec. When that doesn't fix it, at least diagnostics can proceed at a faster pace.
The link above has all the details, including links to the client downloads, and is highly recommended reading. You can also go directly to Fix Central for the downloads. Fix Central offers all supported MQ client versions and the relocatable clients come in v8.0 and up. In the download list, look for the "All Java" package.
As Tim noted, mixing client and server versions is supported, provided both client and server are currently in service. Generally you want to develop against the latest version of MQ client because it has the most recent client-side features and will have the longest service life before a version upgrade is required.
Assuming you're on a Unix platform for your queue manager, the client will be found at:
/opt/mqm/java/lib
However, all MQ clients are compatible with all queue manager versions. I strongly recommend you use a client which is still supported, which means 7.1, 7.5, 8.0, or 9.0 at time of writing. These are freely downloadable from the SupportPac website.
The SupportPacs of interest are those starting 'MQC'. SupportPac MQC8 for example contains the MQ V8.0 client.
Thanks everyone. Just an update to the above answer. In my case I have asked the WebSphere administrator for providing me the lib folders which contains all the required MQ jar files.
I have asked him to provide following files from the C:\Program Files (x86)\IBM\WebSphere MQ\Java\lib\ folders:
* com.ibm.mq.jar
* connector.jar
* com.ibm.mq.jmqi.jar
* com.ibm.mq.headers.jar
* com.ibm.mq.commonservices.jar

Can Ambari blueprints be used for creating a Cloudera cluster or is puppet/chef/ansible type tools more suitable for CDH cluster

I am looking at configuration management options for Cloudera clusters. It looks like Ambari blueprints is suitable for this for a HDP cluster but there is nothing similar for Cloudera. Is it better to use infrastructure configuration management tools like Puppet,Chef,etc. I am after a script-based solution rather than a UI wizard. Thanks!
It sounds like you're looking for the Cloudera Manager REST API:
http://www.cloudera.com/documentation/enterprise/latest/topics/cm_intro_java_api.html

Run multiple instances of IBM BPM

I have the IBM Business Process Manager Advanced 7.5 installed.
Question:
Is it possible to install and run newer version - IBM BPM 8.5 on the same machine?
I worry about ports conflict (for example port 9043 to IBM Console).
Maybe I should ask how to change default port configuration?
Please help.
Technically it can be possible, however I suggest you do not do this as ibm bpm requires a lot of system resources to run and installing two versions of ibm bpm can make the system slower than ever before.
However I have seen multiple instances of same ibm bpm version running on a single cluster on server VM. This is practically stable and in use from considerable tenure.
PS. - I had administered a huge ibm bpm infra containing 80+ ibm bpm servers.
As Gas already commented, in theory this is possible. But you have to be aware, that IBM BPM is not only using the specified ports for web access, it also uses ports for internal communication. In my opinion, this is not an easy task to get right.
On the other hand, the system requirements for IBM BPM are quite challenging for the server, if You want to run both instances in parallel, you should consider that your server will need to be capable. WebSphere is kind of greedy and not really designed to share its resources ;)
Yes, you can run multiple versions of BPM on the same system. The primary concerns are going to be port conflict and OS system resources. Use the BPMConfig to create a new profile and installation that is on different ports. On my lab machines with VMs, I install all the BPM installs with the default ports and only have one (1) running at a time. If I need 2, I just spin up a new VM from the base template and go from there.
By Default, the port conflicts are addressed by the WebSphere Application server code. If needed you can specify "initialPortAssignment" for Dmgr, node and cluster members while creating the environment using BPMConfig command. You can even specify specific port numbers using the
https://www.ibm.com/support/knowledgecenter/en/SSFPJS_8.6.0/com.ibm.wbpm.ref.doc/topics/samplecfgprops.html
You can also provide Websphere options like "-startingPort starting_port | -portsFile ports_file_path | -defaultPorts" for Dmgr bpm.dmgr.profileOptions= and nodes bpm.de.node.#.profileOptions in the BPMConfig properties file. For cluster members just have option to indicate the starting port.
REf: https://www.ibm.com/support/knowledgecenter/cs/SSAW57_8.5.5/com.ibm.websphere.nd.multiplatform.doc/ae/rxml_manageprofiles.html
I would not advise on changing the port numbers once you start using the BPM environment.
As indicated by others make sure you have enough resources if you are planning to run both environments at the same time.
Yes, I am using two versions for evaluation. Port conflicts can be handled using server (WebSphere Integrated Solutions Console) console or BPMConfig utils.

What is a recommended Storm distribution?

I want to try install Storm.
Does Storm have distributions like Hadoop (cloudera, mapr, etc.)?
Or should I install all by myself (ZEROMQ, GZMQ, etc.)
What about versions? Where can I find the versions to use?
I see that Storm has 0.8.1. ZeroMq is already at version 3.2.2.
The Storm-starter project on GitHub is a good place to start. You can easily deploy and run local topologies (entirely on your own machine). It is useful for getting your first topology up and running.
If you want to deploy Storm to Amazon AWS you should take a look at the Storm-deploy project. This will take care of the installation of the correct dependencies on AWS (Zookeeper, etc.).
There's a steep enough learning curve, but if you work through the online documentation you should be able to get the sample topology deployed to AWS pretty quickly.
The Storm wiki is the primary source of Storm documentation.

Resources