Difference b/w Mapr Vs Cloudera? - cloudera

Cloudera is free edition and enterprise edition but MapR is almost enterprise edition why? is there any major difference between them?

Basically, Cloudera and MapR are Big data platforms. In Cloudera have three editions, one is free, enterprise edition up to 60 days and full enterprise edition. In free edition, some services are not there compare with enterprise edition. There is no default security.
http://commandstech.com/mapr-vs-cloudera-vs-hortonworks/
In MapR has completely enterprise edition because of it has own security and inbuilt services are there and finance domains are used mostly. High availability also more compare with Cloudera

Cloudera is basically just Apache Hadoop including Spark and Hive with some management tools. It is largely limited to HDFS operation.
MapR is a much more versatile system. It supports Apache software like Hadoop, Spark, Hive and Drill, but it goes far beyond that as well. Support for Kubernetes is excellent (including very conventional software like postgres or mySQL) and you can mix and match conventional software with big data software freely. You can also mix in machine learning and AI software without having to copy data around to specialist clusters.
In addition, you can run various HPC (high performance computing) systems directly on MapR without having to convert them to use big data APIs.

Cloudera runs on HDFS wheras MAPR runs on MAPRFS. HDFS is append only whereas MAPRFS allows random read/writes making it highly efficient. This effectively means MAPR can provide the same performance in much lesser memory requirement than HDFS. The lowest unit or read/write is much smaller in MAPRFS. HDFS is a distributed file system but underneath it uses linux file system to write data to the actual disk. This is lack of control on optimization during actual writes on raw disk, in MapR they directly have the native code which writes directly into disks in an optimized way. This itself is single big reason for improved writes.Since the code is written in C, there is no need of JVM garbage collection.
For further details, you could look up the link :
https://mapr.com/blog/database-comparison-an-in-depth-look-at-mapr-db/

Related

Open Stack and Development Stack

i am really confuse about the main difference between Open Stack and development Stack ?
is it different version or different platform ?
First of all, let's get the terminology correct. You are talking about OpenStack (one word) and DevStack (one word).
OpenStack is a suite of services for managing computing and related resources. You typically use it for managing "cloud computing" resources, consisting of multiple physical compute and storage servers. Setting up a system like this for production use on "real" hardware takes a lot of time and expertise.
DevStack is a essentially a set of scripts that will set up set up a simple OpenStack test environment running as virtual machines on (typically) a Linux PC with a few GB of RAM and a chunk of free disk space. The purpose is to allow a developer to set up a tiny cloud for trying things out, and doing service development work.
OpenStack and DevStack are neither different versions or different platforms. They are different parts of a larger whole. (You could say that DevStack is part of the OpenStack suite of software. Development of OpenStack and DevStack are handled by the same umbrella organization under the same governance, etc, etc.)

Using Storm in Cloudera

I have been looking to use Storm which is available with Hortonworks 2.1 installation but in order to avoid installing Hortonworks in addition to a Cloudera installation (which has Spark in it), I tried to find a way to use Storm in Cloudera.
If one can use both Storm and Spark on a single platform then it will save additional resources required to have both Cloudera and Hortonworks installations on a machine.
You can use storm with Cloudera installation. You will have to install it on your own and maintain it as such. It will not be part of the Cloudera stack but that should not stop you from using it along with Hadoop if you need it.
You can use Storm on any of the vendor platform. However, storm cluster management is something you have to consider. Storm is not part of the CDH distribution. Cloudera Manager does not manage the lifecycle of the storm services and configurations, nor does it monitor the storm cluster, unless you are willing to write a Clouderea Manager extension yourself. On the contrary, if you choose a vendor such as HDP, the Ambari management tool on HDP provides all the above management features.
If you have a streaming project on CDH, you should strongly consider Apache Spark first, as it provides the same programming model for both batch and streaming processing. You do not need to learn a new API. However, Apache Spark streaming is micro-batch. Thus in use cases that requires sub-second low latency real-time processing, Storm is more suitable.
You can use Storm alongside Cloudera.
All the above are true, but why would you?
Spark includes Spark Streaming, which allows you to handle data processing and stream/event processing workloads using a single API. Spark/Streaming is already inside CDH.
So, why burden yourself with two different APIs?
You can install Apache Storm on Cloudera VM.
For a basic setup and test run, follow below link:
https://github.com/vrmorusu/StormOnClouderaVM/wiki/Apache-Storm-on-Cloudera-VM
This should get you started on developing Storm applications on Cloudera VM.

Neo4j replication alternative to Neo4j Enterprise edition?

It seems Neo4J High Availability is only available for the Enterprise edition which is paid- is there another alternative to achieve replication without that module? (i.e. without cost). Thanks for any help!
Update:
This answer has changed. Neo4j is now open core, so the Enterprise code is no longer dual-licensed - only the commercial license option remains.
You can find more details here: https://neo4j.com/open-core-and-neo4j/
Original Answer:
Enterprise is available as quid-pro-quo - if you put your code out under an open source license, then you get access to the open source Neo4j Enterprise free of charge. However, if you are closed source, Neo Tech charges a license fee. This fee is determined by your needs and your ability to pay - if you are a small outfit with no venture capital, it's still free, and then the licensing cost increases as your ability to pay back to the development of Neo4j increases.
If your application is open-source as you mention, then you are free to use Neo4j Enterprise without paying for it, simply download it at neo4j.org.
Actually Neo4j Enterprise is free under the open source AGPLv3 license.
Neo4j Inc can't modify the terms and still call it AGPL.
If you use Neo4j Enterprise as a server (like most people do) and communicate with it via its REST API or any of the official BOLT drivers then you never trigger AGPL's copyleft requirements.
In other words - the software that connects to it does not have to be open sourced.
You can download Neo4j Enterprise open source licensed binaries up to version 3.2.x from dist.neo4.org. The links for the windows and unix packages are below. (Replace the version number for specific versions)
http://dist.neo4j.org/neo4j-enterprise-3.2.8-windows.zip
http://dist.neo4j.org/neo4j-enterprise-3.2.8-unix.tar.gz
If you want Neo4j Enterprise 3.3.0 and on under it's free open source license, then you can build them from source like we do for our US government clients, or just grab them from our free distribution site.
Check out the blog post if you want to understand why this has happened.
https://blog.igovsol.com/2017/11/14/Neo4j-330-is-out-but-where-are-the-open-source-enterprise-binaries.html

Alfresco Community Enterprise Feature Comparison

I've seen this question but the answers are simply not good enough. I've searched the web and could find a clear listing of the main differences.
I am particularly surprised to see contradictions in the above link, that holds only 4 short answers.
So the question is, beyond support, what are (all) the differences between Alfresco Community and Enterprise editions (for the current versions of course)?
Are there functional or technical features that available in the Enterprise edition, that are not in the community edition?
I find it strange that it's so difficult to get a clear list. Looking at the forums to find this answer is not a serious option from a business perspective.
Until now, I found this link to be useful, but it's from 2009.
In particular, I find the platform support interesting, with the community edition supporting only lamp stuff:
Linux
MySQL
Tomcat
OpenLDAP
Firefox
And the enterprise edition supporting:
Windows
SQL Server
WebLogic, WebSphere
AD/Kerberos
IE and Safari
Apparently, these features are only available in the enterprise edition:
JMX monitoring
Runtime admininstration: What's that exactly? And what's in the community edition then?
Runtime indexing consistency check and update: What's in the community edition then?
High performance and availability: How is that implemented and what's in the community edition then?
Storage policies
Open source and proprietary technology stack support: which ones exaclty? Which ones are supported in the community edition?
If anyone could guide me towards serious documentation about these differences, that would be great.
I also went through the wiki but could not find an answer to my questions in there.
differences between Enterprise and Community vary in detail from version to version and are mainly visible for administrators. We see or maintain both flavors of Alfresco in midsize to very large environments and I would say it's more or less a question of taste and budget what the best decision / edition is for you. Excellent skills in infrastructure and java are highly advisable for both editions to run Alfresco in production.
The technical differences are not as dramatic as not being able to provide very similar functionality for the users - so if you're actually in a decision you should focus on a good technical partner, the support services and maybe the fact that you only get official patches in the Enterprise subscription, not on the Community. BTW Alfresco Enterprise is not Open Source but this is not a real point of interest for most end users. You can access the code as a subscription customer but it is not public available/accessible.
The main differences in features are already named more or less:
Administration
Enterprise has more views and setting in the admin web GUI. In Community you can access most configuration only from the command line. This may be a restriction but in real live Administrators prefer the command line and scripting automation.
Enterprise lets you change some Alfresco settings during runtime (most settings still require restart). Some can be change in the GUI and more in the jmx interface. Also you're able to stop and start subsystems like the CIFS protocol server. We use this feature to switch a system in read only mode. This point is meant with "runtime admininstration". Community requires restart of the service for most configuration changes. It is possible to work around this by advanced scripting like groovy or by implementing modules.
Indexing
Runtime indexing consistency check and update is not a self healing functionality as expected. You will have to learn (at least for now) that you have to recreate the Alfresco index from time to time even in Enterprise environments and that it is better to focus on good strategies how to speed recreation or how to setup standby indexes instead of hunting failed indexing transactions using the check and update methods. For major document model changes you need to recreate the index anyway.
High performance and availability
This is mainly the cluster and replication functionality which is no longer available in Community. It's similar to MS Clusters: It's a lot, lot work for very view more availability since some concepts are missing. The price is high in terms of complexity and can end up in loss of robustness. Even with enterprise support it's a hard job to keep a alfresco cluster running - so you need very good arguments why to go this way. But of course: its possible and available!
High performance: There shouldn't be any difference and if - I'm very curious about the explanation.
Technology stack
The main difference is the database support. In the Community you only can choose between MySQL and Postgres (No Oracle or MS SQL for Community). All other technologies are independent from Enterprise or Community (AD, Kerberos, OS, Browser, ...)
Java Container: I believe over 95% of all Alfresco installations run in tomcat. That's the configuration which is documented, tested and scales. Using WebLogic or WebSphere gives you no added value except new challenges - quite the contrary: You have to solve most issues for yourself and can't benefit from others experience.
Storage policies: I'm not pretty sure and should check in 4.2.x if the Content Store Selector / Storage policies is no longer available in the Community, but it was there in the 3.x versions.
[Edit]: storage policies have been removed in Community 4.2.x:
NoSuchBeanDefinitionException: No bean named 'storeSelectorContentStoreBase' is defined
If there is a really need for this functionality someone may re-enable that feature by coding a module for Community.
Regards
This page explains the difference between the editions:
https://wiki.alfresco.com/wiki/Enterprise_Edition
This page is the canonical, comprehensive list of the differences.
If you are considering an Enterprise Subscription and you have a question that isn't answered by what you can find on that page, you should talk to your account rep.
Well, regarding JMX monitoring:
Runtime administration: Alfresco enterprise allows to perform certain actions on Alfresco subsystems without restarting the server. This allows you to be very fast during debugging/developing and also making changes in production environment. Also you can access the JMX interface that supports JMX Remoting.
There is no consistency check or update, until you restart the server (during the startup you have to validate/check/rebuild your indexes). There is an option in alfresco.global.properties (or the original repository.properties config file) for that. If you have some inconsistencies in the Alfresco Community index, you're gonna have a bad time xD.
Alfresco Enterprise has specific license for clustering your architecture, the Community edition doesn't support those systems. Replicate and cluster Alfresco is one of the main improvements in performance/scalability/availability you could achieve.
The storage policies allow you to use Content Store selectors in Alfresco Enterprise. You can manage a primary and a secondary file store, and map/connect these stores in your architecture. The Community Edition allows you only to use one content store at a time.
These include everything inside Alfresco (Spring Framework, Apache-Lucene/Solr, Tomcat, and so on), because with the Enterprise license you have also the full support with everything inside the Alfresco package. The difference is that the Community is based on daily builds, supported by community, and therefor not guaranteed. The Enterprise support helps you resolve many problems that you might encounter during developing and in production environment, not only Alfresco related, but also on some configurations on supported platforms (Windows/Linux), your web application servers, and so on.
Hope it helps.

Build Server Hardware Configuration

So I've seen this question, but I'm looking for some more general advice: How do you spec out a build server? Specifically what steps should I take to decide exactly what processor, HD, RAM, etc. to use for a new build server. What factors should I consider to decide whether to use virtualization?
I'm looking for general steps I need to take to come to the decision of what hardware to buy. Steps that lead me to specific conclusions - think "I will need 4 gigs of ram" instead of "As much RAM as you can afford"
P.S. I'm deliberately not giving specifics because I'm looking for the teach-a-man-to-fish answer, not an answer that will only apply to my situation.
The answer is what requirements will the machine need in order to "build" your code. That is entirely dependent on the code you're talking about.
If its a few thousand lines of code then just pull that old desktop out of the closet. If its a few billion lines of code then speak to the bank manager about giving you a loan for a blade enclosure!
I think the best place to start with a build server though is buy yourself a new developer machine and then rebuild your old one to be your build server.
I would start by collecting some performance metrics on the build on whatever system you currently use to build. I would specifically look at CPU and memory utilization, the amount of data read and written from disk, and the amount of network traffic (if any) generated. On Windows you can use perfmon to get all of this data; on Linux, you can use tools like vmstat, iostat and top. Figure out where the bottlenecks are -- is your build CPU bound? Disk bound? Starved for RAM? The answers to these questions will guide your purchase decision -- if your build hammers the CPU but generates relatively little data, putting in a screaming SCSI-based RAID disk is a waste of money.
You may want to try running your build with varying levels of parallelism as you collect these metrics as well. If you're using gnumake, run your build with -j 2, -j 4 and -j 8. This will help you see if the build is CPU or disk limited.
Also consider the possibility that the right build server for your needs might actually be a cluster of cheap systems rather than a single massive box -- there are lots of distributed build systems out there (gmake/distcc, pvmgmake, ElectricAccelerator, etc) that can help you leverage an array of cheap computers better than you could a single big system.
Things to consider:
How many projects are going to be expected to build simultaneously? Is it acceptable for one project to wait while another finishes?
Are you going to do CI or scheduled builds?
How long do your builds normally take?
What build software are you using?
Most web projects are small enough (build times under 5 minutes) that buying a large server just doesn't make sense.
As an example,
We have about 20 devs actively working on 6 different projects. We are using a single TFS Build server running CI for all of the projects. They are set to build on every check in.
All of our projects build in under 3 minutes.
The build server is a single quad core with 4GB of ram. The primary reason we use it is to performance dev and staging builds for QA. Once a build completes, that application is auto deployed to the appropriate server(s). It is also responsible for running unit and web tests against those projects.
The type of build software you use is very important. TFS can take advantage of each core to parallel build projects within a solution. If your build software can't do that, then you might investigate having multiple build servers depending on your needs.
Our shop supports 16 products that range from a few thousands of lines of code to hundreds of thousands of lines (maybe a million+ at this point). We use 3 HP servers (about 5 years old), dual quad core with 10GB of RAM. The disks are 7200 RPM SCSI drives. All compiled via msbuild on the command line with the parallel compilations enabled.
With that setup, our biggest bottleneck by far is the disk I/O. We will completely wipe our source code and re-checkout on every build, and the delete and checkout times are really slow. The compilation and publishing times are slow as well. The CPU and RAM are not remotely taxed.
I am in the process of refreshing these servers, so I am going the route of workstation class machines, go with 4 instead of 3, and replacing the SCSI drives with the best/fastest SSDs I can afford. If you have a setup similar to this, then disk I/O should be a consideration.

Resources