Big data solution as a service - bigdata

Is there a big data solution as a service?
I've been looking for some possible solutions like Microsoft HD Insight, but they only leave you the ambari for you to manage it.
Does anyone know of any solution that provides all the Big Data services (HDFS, Kafka, Spark, Yarn, etc.) as a service?

Related

How does the DBMS affect application performance? And Informix GUI tools?

I use Informix DBMS in all my web applications. My question has two parts:
Does the DBMS have a big effect on the performance of my applications
and if the answer is yes what about Informix and `MS SQL Server in this
issue?
I want some GUI tools to facilitate my job when writing queries,
creating database, relationships, ERD, etc. The Informix client
is so bad. There are no facilities at all. I want some tools
like SQL Server Management Studio
As a GUI tool for Informix you can use Aqua Data Studio from Aquafold. It also supports MS SQL Server.
As of the performance: it depends. How well is your Database design. Do you use indexes, is your query well-written, etc. etc. Very hard to answer your question, we just don't know enough.
To design a solution that would perform the best, one needs to know the nature of the application you are building. For example, if you are building a system that needs to process and compute large volume of data and computations can be distributed, a "traditional" relational database is not a good option no matter what vendor you choose. You would be better off with an option that supports sharding, Hadoop and will likely be based on some kind of NoSQL solution.
If you are sticking with RDMS and building something that has a lot of reads and not a lot of writes, go for a database that supports Snapshot Isolation which will allow your readers to not be blocked by writers.
Cost also plays into this - some RDMS systems are free, some are not. Your question is way to general to be answered specifically.
Aqua Data Studio is good but quite expensive. An open source tool SQL Workbench/J is also an effective tool for informix.
Informix have its own charm but i guess it should not be said that MS-Sql Server is slower or not good in performance. You may decide DBMS according to your nature of application. There are many techniques to optimize Database performance like, Applying Indexes/ Not too many Joins/ Queries can be optimize too/ Stored Procedure can also be used/ Multi-DBs level etc.
Once i need to develop Social Media site, i used MySQL in this project but only for POSTs i installed MongoDB.
Regards,
Salik

What about OpenEJB? Is it worth it? Any opinions?

I whould like to know some opinions about OpenEJB: we are considering to use it on a new project, but really didn't found many opinions about it.
So, here is my question: how about it? Does it perform well? Is it stable enough for a production environment?
We switched to OpenEJB (deployed embedded in our app on Tomcat). Performance tests showed better or not worse results processing our transactions compared to JBoss (transactions include data access, JMS, and servlets). We use ActiveMQ within OpenEJB for JMS. There are no stability problems as of yet - we are still in staging (pre-production) environment though. The documentation is definitely lacking, but not as poor as other embedded choices. Overall, we consider this as a good choice if you run on Tomcat. Deploying it on other application servers turned out to be much more difficult (JBoss, Weblogic, Websphere) but there are not many reasons for this usually (we had few but dropped this after several attempts basically failed).
And as in all open source products: expect lack of support (documentation, troubleshooting, bugs, etc.) to be compensated by free access to sources.
We've had experience with Oracle OAS and JBOSS before. We decided to give OpenEJB a try. We've found out that it is not only very fast but it also much easier to setup and configure, and it has much better defaults.
Currently we implement our own failure measures in the client, so we don't know how they compare for clustering, or other advanced features that we don't use.
We we have to go back and deal with JBOSS in the developer side, we see a drop on productivity, because it takes too long to bootstrap.

Cassandra and asp.net (C#)

I am interested to create portal on cassandra services, since I faced some performance and scale issues starting from 1 million of records.
Definitely, it could be solved, but I am interested on other options.
My main issues is cost of updating all necessary indexes, to make reading fast.
First, is cassandra is good way for asp.net programmers? I mean, maybe there is some other projects, which worth to take a look
And second, can you provide any documentation samples on how to start with cassandra programming from C#?
since I faced performance and scale issues starting from 1 million of records.
Maybe your design was not that good, NoSQL is not a magic bullet for bad design. I have multi billion row tables and 95% of the response is sub second. Also what do you mean by updating indexes, do you mean updating statistics or rebuilding indexes?
since I faced performance and scale
issues starting from 1 million of
records.
You know, the one million mark for modern databases is where it is not something "totally ridiculously small" where you can ignore actually knowing what you do. Below one million is "tiny". I have a 800 million row table and get a LOT of sql running through with it - no problem at all.
First, is cassandra is good way for
asp.net programmers?
I would more suggest a basic book about SQL, reading the documentation and POSSIBLY throwing some hardware on the problem. As in: having totally bad hardware will kill all data management systems.
If you are using Cassandra for your .NET Application take a look at Aquiles. I developed it based on my company needs. If you find it useful or need any help let me know.
You can't really speak of Cassandra documentation. There's a myriad of partial tutorials on the web.
You may want to setup Linux in a virtual machine, because the windows build process is quite challenging, to say the least. (http://www.virtualbox.org, http://www.ubuntu.com)
Here's the howto:
http://www.ridgway.co.za/archive/2009/11/06/net-developers-guide-to-getting-started-with-cassandra.aspx
Note that the cassandra SVN url and the code sample have changed since the writing of this tutorial.
Here's another C# client:
http://github.com/mattvv/hectorsharp
And here some sample code:
http://www.copypastecode.com/26752/
Note that you need to download the latest Java Development Kit (JDK) from Sun for Linux.
It's not in the repositories of Ubuntu 10.04.
Then you need to type
export JAVA_HOME="/path/to/jdk"
in order for Cassandra to find your Java installation.
You might also want to take a look at:
http://en.wikipedia.org/wiki/NoSQL
Especially the taxonomy section is interesting.
Make sure Cassandra is the right type of NoSQL solution for your problem, e.g. use Neo4J if your problem actually is a graph problem.
Also, you need to make sure your NoSQL solution is ACID-compliant.
For example, Neo4J is the only ACID-compliant NoSQL graph engine.
Edit: Here's a jumpstart guide for Windows, without compiling:
http://coderjournal.com/2010/03/cassandra-jump-start-for-the-windows-developer/
http://www.ronaldwidha.net/2010/06/23/running-cassandra-on-windows-first-attempt/
http://www.yafla.com/dforbes/Getting_Started_with_Apache_Cassandra_a_NoSQL_frontrunner_on_Windows/
Instead of cassandra you might take a look at: ravendb. Supposedly it is a document store made with and created for .Net. It has Linq integration, and is (again supposedly) very fast.
As with any new technology, read if it helps you with your specific case, and check if it is proven technology (Do they have mainstream clients using it).
Before you go into this route see if you can't optimize your current solution first. Check if your queries are fast, if the indexes are done correctly, and if you can't remove load by adding caching.
Last nut not least, if adding some processors to your SQL machine might fix issues, it is typically a much cheaper solution.
If you want to do something new, then instead of going for noSQL, you might want to consider trying a database cluster.
The idea is when two machines each search half of the original database at the same time, you have half the search time without totally redesigning your existing database.

What are you using for Distributed Caching in web farms running ASP.NET?

I am curious as to what others are using in this situation. I know a couple of the options that are out there like a memcached port or ScaleOutSoftware. The memcached ports don't seem to be actively worked on (correct me if I'm wrong). ScaleOutSoftware is too expensive for me (I don't doubt it is worth it). This is not to say that I don't want to hear about people using memcached or ScaleOutSoftware. I'm just stating what I "know" at this point.
So my question is basically this: for those of you ACTIVELY using distributed caching, what are you using, are you happy with it, and what should I look out for?
I am moving to two servers very soon...both will be at the same location. I use caching fairly heavily (but carefully) to reduce the load on my database server.
Edit: I downloaded Scaleout Software's solution. I've coded for it and it seems to work real well. I just have to decide if my wallet will part with the cash for it. :) Anyone have experiences good or bad with ScaleoutSoftware?
Edit Again: It's been a little while since I asked this? Any more thoughts on it? We ended up buying the solution from ScaleOutSoftware and have been happy with it, but I'm curious what others are doing.
Microsoft has a product pending code-named Velocity. It's still in CTP, and is moving slowly, but looks like it will be pretty good. We'll be beating it up in the near future to see how it handles what we want it to do (> 2 million read/writes per hour). Will post back with results.
There is a 100% native .NET, well documented open source (LGPL) project called Shared Cache. Looks like it is not yet mentioned on SO, but it's promising and should be able to do what most people expect from a distributed cache. It even supports different strategies like distributed or replicated caching etc.
I will update this post with more details as soon as I had a chance to try it on a real project.
We're currently using an incredibly simple cache that I wrote in a couple of hours, based on re-hosting the ASP.NET cache in a Windows Service (more info and source code here). I won't pretend it's anywhere near as optimised as something like Memcached but we were just looking for something simple and free until Velocity came along, and it's held up extremely well even under fairly heavy load.
It comes down to our personal preference for core components - i.e. ones that affect whether the site is available or not - that they are either (a) supported by a vendor with a history of rapid and high quality support, or (b) written by us so that if something goes wrong we can fix it quickly. Open source is all well and good, and indeed we do use some OSS, but if your site is offline then unfortunately newsgroups et al don't have a 1 hour SLA, and just because it's OSS doesn't mean you have the necessary understanding or ability to fix it yourself.
We are using the memcached port for Windows and we are very pleased with it. The enyim.com memcached client API is great and easy to work with. It's also open source, which is a big advantage, if you ask me.
We are now using this setup in a production web-app and it has helped a lot in improving its performance.
There's a great .NET wrapper/port found here on Codeplex. Awesomesauce!
We use memcached with the enyim library in a production environment (www.funda.nl). Works fine, very pleased with it, but we did notice a substantial raise in CPU use on the clients. Presumably due to the serializing/deserializing going on. We do around 1000 reads per second.
One tried and tested product by 100's of customers worldwide is NCache. Its
a feature rich product that lets you store session state in a redundant and highly available manner, lets you share data
within the enterprise as well as bridging for WAN communication essentially acting as a data fabric and lastly it lets you build an elastic caching tier so that when
your application scales, you can add servers to the cache and actually boost performance further.

Any experiences with Websphere Integration Developer (WID)?

My company (a large organization) is developing a "road-map" for evolving their rather old, tangled confederation of systems to an SOA model. A few people are pushing hard for using Websphere Integration Developer and Websphere Process Server as the defacto platform for developing future applications...because they feel IBM is a stable vendor, the tools are made for the enterprise, they drank the "business agility" BPEL kool-aid, etc.
Does anyone have positive or negative thoughts on this platform? Do the GUI tools help eliminate monotonous/redundant coding...or just obscure things and make things harder to maintain? Basically, do the benefits justify the complexity?
My experience with the IBM Java tool set is pure pain. Days to install lots of different versions of different components all incompatible with each other, discover a bug in component A get told to update to see if it fixes, updating component A breaks component B and C, get told to update these etc.
I find Eclipse with out the IBM extensions far more stable and quicker and provides more features (as its stable versions are a couple releases ahead of WID/RAD).
I would advise against going the IBM way for development tools. As for process server I have less experience but the people in my team using it seemed to enjoy it as much as I enjoyed WID. not a lot.
So far I havent been impressed by any tools with the "SOA" and/or "BPM" labels on them. My "roadmap" would be very very iterative to see some results with the archetecture as fast as possible while trying to grab some of the easy fruits. That way you gain your feel for what works for you and your people.
I would never let any vendor push me anywhere in the "scuplturing" of the architecture.
I agree with other users complaining about WID. The only reason we are using WID is that a decision was made a while back to use IBM products across the board by our sales department.
That's right, our sales department made the decision to use IBM products.
Development has been painful and frustrating. We have lots of stability problems with Process Server, sometimes it doesn't want to start or shutdown properly. Yeah you can easily draw processes in the IDE, but most any toolset provides that functionality these days. It is nothing special or unique to WID or IBM. IBM is a few iterations behind mainstream.
There are plenty of open source implementations out there that offer great support. Checkout JBoss or RedHat, they are pretty good. If that doesn't float your boat, you can always use Apache tools.
Walter
Developers don't choose WID, WMB, or WPS. Managers do, because IBM is a "stable vendor".
Look at JBoss, or K.I.S.S.
WID/WPS is actually pretty simple. The original intention was for analysts and business people to "compose" services (DO NOT LET THEM DO THIS!) so the UI is simple and easy.
Most of the work will be in defineing and implementing the back end services which depending on the platform will mostly involve wrapping existing code in SOA service.
The most important thing to bear in mind is that SOAP is technoligy and SOA is an architecture and a state of mind.
There is a zen to a succesful SOA implementation. Its all about "business services", if you have a service that you cannot describe to a business user in less than six words you have done it wrong! Ideally the service name alone should be enough to describe the functionality of the service.
If you end up with a service called "MyApp.GetContactData" described as "get name, addresses tel fax etc." then you are there. If You have a service called MyAppGetFaxNoFromOldSys" described as "Retrieve current-fax-nmbr from telephony table in legacy system" you are doomed!
Incidently most of the Websphere tooling for WS* is pretty nice. But I would recommend the very wonderful SAOPUI tool from http://www.eviware.com which is very good for compsing/reading WSDL based messages and also function as a useful test client or server.
Do the GUI tools help eliminate monotonous/redundant coding...or just obscure things and make things harder to maintain? Basically, do the benefits justify the complexity?
As a Developer, I find the tools at varying levels of being bug free. 6.0.1 was a pain, 6.2 is so much better. But once you develop with the tool, there is minimal effort to maintain it. I develop in hours what java developers take days to do. It is also easy to maintain as changes can be made very quickly. I cannot answer your question from the perspective of an architect or a Manager but i would agree with comments of some others here.

Resources