What is a recommended Storm distribution? - bigdata

I want to try install Storm.
Does Storm have distributions like Hadoop (cloudera, mapr, etc.)?
Or should I install all by myself (ZEROMQ, GZMQ, etc.)
What about versions? Where can I find the versions to use?
I see that Storm has 0.8.1. ZeroMq is already at version 3.2.2.

The Storm-starter project on GitHub is a good place to start. You can easily deploy and run local topologies (entirely on your own machine). It is useful for getting your first topology up and running.
If you want to deploy Storm to Amazon AWS you should take a look at the Storm-deploy project. This will take care of the installation of the correct dependencies on AWS (Zookeeper, etc.).
There's a steep enough learning curve, but if you work through the online documentation you should be able to get the sample topology deployed to AWS pretty quickly.
The Storm wiki is the primary source of Storm documentation.

Related

What is the difference between github enterprise 'cloud' and 'On premise'

We are investigating how to integrate our app with Github Enterprise.
There are 2 different deployment models - 'Cloud' and 'On Premise'
I have been looking around but couldn't find the differences between the two.
Maybe there is no such difference
The basic difference is that GitHub Enterprise Server is software you deploy on a virtual machine you provision and control (on-premise here is a bit of a misnomer since your VM could be in AWS).
GitHub Enterprise Cloud, on the other hand, is an enterprise-level of service at GitHub.com.
You'll find more here.

Is there a package to connect R to AWS SSM?

Python has an SSM client in the Boto 3 package: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ssm.html. Is there something similar for R? If not, any recommendations for how to make something similar? Thanks!
At the moment a package called paws on github
Access over 150 AWS services, including Machine Learning Translation
Natural Language Processing Databases File Storage
Or the cloudyr project
Welcome to the cloudyr project! The goal of this initiative is to make
cloud computing with R easier, starting with robust tools for working
with cloud computing platforms. The project’s inital work is with
Amazon Web Services, various crowdsourcing platforms, and popular
continuous integration services for R package development. Tools for
Google Cloud Services and Microsoft Azure are also on the long-term
agenda.
I only checked paws and that has the ssm functionality according to the paws documentation about ssm. The cloudyr project has many aws packages on cran. Not sure if in one of them is the ssm functionality.

How to setup the infrastructure with blade servers for OpenStack

We have 24 Huawei CH242 V3 blade servers and want to setup a private cloud with OpenStack, but we're very new to OpenStack and very lack of experiences about infrastructures. Could somebody kindly give us some useful information about the following question:
What kind of OS is more suitable for those blade servers? Is Linux like CentOS a good choice?
Is it OK(or encouraged) to directly use blade servers as OpenStack controller/compute/storage nodes? Or do we need to use one hypervisor to create many VMs and install OpenStack services on top of VMs?
What're the best practices or suggestions will you want to give beginners?
Maybe some questions are very silly but we're really stuck on the first step, thanks in advance for any information.
Below is my suggestions and there can be more good answers too
What kind of OS is more suitable for those blade servers? Is Linux like CentOS a good choice?
You can try any Linux flavours (OpenSUSE/CentOS/Ubuntu) mentioned in the openstack official site. I personally used Ubuntu for installing openstack.
There are openly available JuJu charms that works on Ubuntu for installing Openstack services. So it will be easy for you to edit the charms and deploy.
Is it OK(or encouraged) to directly use blade servers as OpenStack controller/compute/storage nodes? Or do we need to use one hypervisor to create many VMs and install OpenStack services on top of VMs?
I will prefer VM based installation from your list of choices. I personally suggest you to use containers to deploy your openstack services for better performance.
For compute service, you can go for bare metal installation, but it is upto you.
What're the best practices or suggestions will you want to give beginners?
a. Try installing the same topology/setup as mentioned in the openstack documentation
b. Use recommended databases and AMQP brokers
What kind of OS is more suitable for those blade servers? Is Linux like CentOS a good choice?
I use CentOS7.2, its very stable for openstack. and Ubuntu is also stable which is tried.
Is it OK(or encouraged) to directly use blade servers as OpenStack controller/compute/storage nodes? Or do we need to use one hypervisor to create many VMs and install OpenStack services on top of VMs?
Yes, I do like this, use bare machine as controller/compute/storage, performance good for me, I did not use container like docker.
What're the best practices or suggestions will you want to give beginners?
Because you are new to openstack, I recommend you begin with install openstack, see more logs when you install it. read official website docs is necessary. but you need to notice there are also some errors in the docs, and the configuration also is not optimized, that is just for experiment of private cloud.
If you are skilled at install openstack, then you can read the source code on github, try to contribute the code for it, from fix docs typo.

RStudio and git workflow on a closed network

Is anyone aware of an R-based data analysis setup that works well in a research data centre with no internet access? I would like to use good reproducible analysis practices, but I do not have the permission upload files to a repository, for example. Also, RStudio preferences (such as path to a local package repository) are not saved. So far I know:
The miniCRAN package helps with gathering all R package dependencies.
local git version control does not require internet access. (Plus, the data centre technician may be willing to use it to release results and R scripts if the learning curve is outweighed by time savings in the long run.)
I am considering writing up a proposal for a pilot project, but I don't want to re-invent the wheel - especially if a more comprehensive platform already exists. I have looked, but haven't found one, unless Microsoft's R Open essentially does it (not clear to me).
Thanks for your consideration!

Using Storm in Cloudera

I have been looking to use Storm which is available with Hortonworks 2.1 installation but in order to avoid installing Hortonworks in addition to a Cloudera installation (which has Spark in it), I tried to find a way to use Storm in Cloudera.
If one can use both Storm and Spark on a single platform then it will save additional resources required to have both Cloudera and Hortonworks installations on a machine.
You can use storm with Cloudera installation. You will have to install it on your own and maintain it as such. It will not be part of the Cloudera stack but that should not stop you from using it along with Hadoop if you need it.
You can use Storm on any of the vendor platform. However, storm cluster management is something you have to consider. Storm is not part of the CDH distribution. Cloudera Manager does not manage the lifecycle of the storm services and configurations, nor does it monitor the storm cluster, unless you are willing to write a Clouderea Manager extension yourself. On the contrary, if you choose a vendor such as HDP, the Ambari management tool on HDP provides all the above management features.
If you have a streaming project on CDH, you should strongly consider Apache Spark first, as it provides the same programming model for both batch and streaming processing. You do not need to learn a new API. However, Apache Spark streaming is micro-batch. Thus in use cases that requires sub-second low latency real-time processing, Storm is more suitable.
You can use Storm alongside Cloudera.
All the above are true, but why would you?
Spark includes Spark Streaming, which allows you to handle data processing and stream/event processing workloads using a single API. Spark/Streaming is already inside CDH.
So, why burden yourself with two different APIs?
You can install Apache Storm on Cloudera VM.
For a basic setup and test run, follow below link:
https://github.com/vrmorusu/StormOnClouderaVM/wiki/Apache-Storm-on-Cloudera-VM
This should get you started on developing Storm applications on Cloudera VM.

Resources