Why Cloudera's Impala is still "incubating"? - cloudera

We are using Impala in my company and we have used it in my previous one without
problems.
Is there something affecting possible production use (for example, it breaks under heavy usage, or memory leaks possible, or concurrent access is not recommended)?

Quoting from http://incubator.apache.org/:
The Apache Incubator has two primary goals:
Ensure all donations are in accordance with the ASF legal standards
Develop new communities that adhere to our guiding principles
According to the above, the Incubator doesn't have to do anything with performance or stability, but rather attracting an active, diverse community to a given project and reviewing the licenses involved.

Related

What is the prescriptive approach to supporting multiple RDBMS's with Flyway?

I have an application that supports multiple RDBMS's. The SQL needed to build the data model is different between each of the RDBMS's that I need to support. The differences aren't small either, they stem from the fact that one of the supported systems is expected for light use (development, small installations) and heavy use. Simply standardizing on a single supported RDBMS is not an option.
As it stands I need to be able to apply migrations to my application in all of the supported RDBMS's. Where possible I'd like to be able to share migration scripts to reduce the amount of duplication involved but I imagine that isn't entirely possible.
The only approach I can come up with so far is to keep separate directories in source control for each of the supported environments. Then at runtime, pick the appropriate directory for the RDBMS that the system is connected to.
Is having one directory per supported RDBMS the prescriptive approach or is there a better way?
Right from the FAQ: What is the best strategy for handling database-specific sql?

Which Publish method is most efficient at maintaining a large website?

I'm using VS2010 and TFS to build a complex medium sized website.
Which protocol is most efficient and secure? I think I can enable any protocol I need since I own the IIS server and control all aspects of it.
My choices are:
Webdeploy
FTP
FileSystem
FPSE
There is also a hint at something called "one click"... not sure what that is, or if it relates to any of the above.
OK.. I'm sorry, but I'm not sure where to even start, and I'm not sure the question is answerable as-is. I'd probably put this as a note if there weren't a limit on the number of characters.
So much depends on the type of data in this app, your financial resources, etc. This is one of those subjects that seems like a simple question, but the more you learn, the more you realize you don't know. What you're talking about it Release management, which is just one piece of the puzzle in an overall Application Life-cycle Management strategy.
(hint, start at the link posted, and be prepared to spend months learning).
Some of the factors you may need to be aware of are regulatory factors that you many not even have thought of. Certain data is protected, and different standards require you to have formalized risk and release management built into your processes. For example, credits card data, medical records, etc, all have different regulations (some actual laws, some imposed by the Payment Card Industry) that you need to be aware of.
If your site contains ANY sensitive data, you need to first find out whether any of these rules apply to you, and if so, which ones? Do any of them require audit trails for how code goes from development to deployment? (PCI does, for example. That's because we take credit card payments, and in order to do that, you need to be PCI Certified or face heavy fines.)
If your site contains NO sensitive information at all, then your question could be answered as-is, and the question becomes a matter of what you're comfortable with.
If your application DOES contain sensitive info that makes it subject to rules that mandate a documented, secure ALM process, then the question becomes more complex, because doing deployments manually in such a situation is a PAIN IN THE BUTT. It' doesn't take too long before you start looking at tools to help automate some of the processes. (Build servers, tools such as Aldon for deployment, etc. There is a whole host of commercial and open source software to choose from.)
(we're using Atlassian for most of our ALM, but Team Foundation Server is also excellent, and there are a TON of other options.)

What is the Reason large sites don't use MySQL with ASP.NET?

I have read this article from High Scalability about Stack Overflow and other large websites. Many large high traffic .NET sites such as plentyoffish.com, MySpace and Stack Overflow all use .NET technologies and use SQL Server for their database. In the article it says a source in Stack Overflow said:
As you add more and more database
servers the SQL Server license costs
can be outrageous. So by starting
scale up and gradually going scale out
with non-open source software you can
be in a world of financial hurt.
Why don't these sites use MySQL instead of SQL Server?
Adding into what AJ said... Remember Facebook also pays C programmers to hack up MySQL code and also PHP code to get things to really work "well" for the amount of traffic they get.
Facebook already made statements in the past and this year about having wished they made a better choice.
As a matter of fact, for coding they're now compiling their PHP down to C++ code using HipHopPHP and about 90% of their servers are running the C++ binaries instead of the PHP scripts.
Their MySQL database might save them a dime or two, but the costs to maintain it, scale it, etc. is extremely intense.
A product like Oracle however would really allow you to scale seamlessly compared to MySQL.
I have a site right now that uses a lot of bandwidth on my database, large number of queries, and the truth is, scaling is a pain in the neck with MySQL and their Clustering product isn't that great and requires a license. Oracle right now has the best "grid" database setup but the costs are insane there...
Also, I code C# as well.. Let me tell you it's MUCH easier to integrate enterprise level sites with SQL Server compared to MySQL.
I would guess that it's probably because it's really really easy to get started making a site with ASP.NET hooked into SQL Server. And for the sites you mentioned, speed to market was probably more important than getting the architecture "right" (not to say that SQL Server is or isn't the right choice - just that speed to market is the priority). Remember that a developer's job is to release software.
So long as one avoids using too many database specific features, it will be relatively straightforward to switch to a different database with moderate effort. But why bother unless your site becomes super-popular?
Edit: And if you become super-popular, you may even want to venture into the land of NoSQL.
While this doesn't directly answer your questions I really have to refute your comment about outrageous licensing costs. ALL ENTERPRISE grade commercial software comes with a high price tag because it has the VALUE for it. If it doesn't have that value, it wouldn't be a successful product.
SQL Server's pricing is extremely competitive and has a very substantially lower TCO than Oracle. Another reason a decision to use MS SQL Server would be made is that most shops that develop on the Microsoft stack are Windows Server shops. MS SQL Server is built specifically for Windows Server so it can integrate as flawless as possible with the operating system. Many other products are not primarily and solely developed for Windows Server so this results in feature differences and environmental bugs.
These enviromental issues can be further compounded with the fact that large scale shops will employ primarily system administrators that have long backgrounds in that specific stack so in a .NET shop most system administrators are all most fluent in Windows Server, having to support multiple operating systems becomes a large cost especially in the risk management side when you're a large scale business.
To repeat what others have said. I work in a corporation and money, so to speak, does not matter that much when it comes to these matters. Decisions are made on the basis of "What kind of support can we get from the vendor", "How many skilled people are in the market", "What are the vendors reputation", etc.
I think there are two distinct groups for adopters of MySQL or SQL Server.
Large websites that are privately owned that does not have additional financial resources. These websites will typically run MySQL. Naturally.
Large websites built by corporations. These sites will run whatever is the accepted database technology within the corporation. Money does not dictate this decision, but more of who can support this software and development.
No Microsoft SQL Server Management Studio. For real. A lot of stuff is done there instead of raw SQL that happens in Open Source software world.
Things are much easier to deal with when the technology stack is homogeneous.
If you want MySQL support for Linq-to-SQL, good luck. It's still very much immature. With SQL Server, it's a matter of drag and drop. Literally.
You can also conduct Database queries from within Visual Studio for SQL Server. I've never tried it for any other database, but I'm not convinced you'd be able to.
It's great to say 'Oh, MySQL is so much cheaper than SQL Server.' Yes, it is. But I'm not sure the integration costs are worth it; not to mention having to rely on Yet Another Vendor to provide support if something goes wrong.
You use what you know...
(IMHO) The Microsoft tool stack is brilliant. It works well, we learn with it and grow with it, as the technology grows. It becomes easier to use as you become accustomed to it (its quirks and idiosyncrasies).
MySQL is also a brilliant tool. It works, and works well. We could all have religious wars as to what tool is best, but remember it is just a tool to get a job done.
Now let's factor in the cost of the software - Plenty of Fish 2 years ago made $7M, do you really think they care how much their database/server software costs? SO is on BizSpark $0 cost for 3 years (that's got to hurt).
For the sceptics, FaceBook runs MySQL on/for 30K servers and MySQL Enterprise Unlimited Licences cost $40k so this is not necessarily cheap either.
I don't know about you, but for me when I make a ton of cash, I really won't care how much it "costs", because I am making more with it, than without it!
I would say because of the following:
Microsoft is very well integrated while used with Microsoft products ;
Though using SQL Server, a free Express edition is available and can be used to host sites ;
With the .NET Framework coming through, Microsoft gained a lot of terrain over its competitors in schools an so, thus making SQL Server a well known database engine ;
Microsoft products works better with other Microsoft products ;
There are two ways of licensing SQL Server, per client (CAL), and per server processors or something like that. For sites hosts, perhaps is there an advantage of using SQL Server this way ;
Other database engines such as MySQL, PostgreSQL, Firebird, etc. all have their syntaxic differences, thus making SQL Server TSQL somehow a wise choice as for the number of persons being able to interact with SQL Server more easily ;
There might be some other politic related reasons for using SQL Server over other less costly solutions.
I would like to mention that some are using SQL Server, yes, but they use SQL Server Express Edition. Though they are whether aware or not that publishing or commercialising a solution with SQL Server Express Edition makes, according to the EULA of Microsoft for this product, your solution a free solution as well, as the EULA states that you need to provide your solution to your customer, and your customer is free to share your commercial solution with whom who wishes because it is sat on SQL Server Express. Although this is stated, some continue to use SQL Server Express without informing their customers about this information. Most of common clients won't know about this and they will respect their contract with the solution's supplier.
Furthermore, as I think I have above-written, some don't care about the price, but they have political reasons for using commercial products such as SQL Server and other software products. There are some places where the money isn't the most important factor, but service after sale, etc. They want specialized engineers or support teams directly, not necessarily what offers MySQL-like communities.
Hope this enlights a bit.
It's just culture. People group themselves. It's natural. People who prefer open-source, will naturally choose LAMP (Linux, Apache, MySQL and PHP) for the same kind of project that people who prefer corporate support choose Microsoft IIS, Microsoft SQL Server and Microsoft .NET for. There is a good deal of human psychology involved in this, make no mistake about it. There is nothing prohibiting one from using IIS with PHP and MySQL, or Apache with Microsoft SQL Server, but the way it goes is as described above.
Shorter put, large sites do use either, but yes, not often the two you mentioned together.
I believe George has it on mark: "homogeneous".
Most of Microsoft's technologies are built to work together. There are direct hooks between .NET and SQL Server to provide additional functionality like cache management that just don't exist between .NET and MySQL.
IIRC, MySQL doesn't have built in cache management which is why Ehcache and memcached exist.
re Joshua's comment: "A product like Oracle however would really allow you to scale seamlessly compared to MySQL." Years ago, Sabre picked MySQL over Oracle for some high scale projects based on cost and feature set. AFAIK, it's still picked over Oracle unless you can prove through cost/benefit analysis why Oracle is the better choice for a project.
I think it really boils down to functionality, user knowledge base and interoperability.
Sometimes SQL Server is the better match, sometimes it's MySQL, sometimes it's Oracle.
Less compatibility issues when you single source.
MS SQL Server is the "default" database for ASP.NET apps (see LINQ to SQL, ADO.NET, ApplicationServices etc)
Immature .NET tools for other databases. For example, you don't have to worry about a feature or functionality not being supported if you stick with MS SQL Server, other databases might not have full support (e.g. DbLInq, etc.).
MS SQL Server is also free to get started, (SQL Server Express) and once you're ready to go public, it's hard to change the Data layer.
I'm in the process of writing an ASP.NET MVC2 site with MySQL as the backend (mainly due to licensing costs) I've implemented DbLinq, but it also means writing a custom Membership/Role provider, and general tweaking of the data layer. It's definitely doable, but it's not as simple as sticking with MS SQL Server. I'm also hoping to move the site over to Mono 2.7 (once it's released) running on a Linux server to sidestep the server licensing issues as well.
The real reason is people usually go with MS SQL Server as .NET comes from the same brand. For instance PHP people always prefer MySQL over other databases. It's all mind set and people don't want to take any risk.
Any large enterprise site isn't going to care about licensing costs that much. What they want is fast, reliable data access and access to company technical support. They also want something that can easily be partioned to scale and that is designed for huge databases. They also want the easy availablity of performance tuning specialists, datawarehousing and Business Intelligence specialists, database developers, and database administrators. SQL Server and Oracle both meet these criteria. I really don't see MySQL as having as many people qualified to design and monitor large systems. I am Not sure how it stacks up on the partitioning and scalibility though.
Well, for one thing there are other, better, free databases (e.g., PostgreSQL). For another, the Microsoft ecosystem is designed to suck you in, getting you to spend more and more with the guys from Redmond.

Open Source Identity vs. Real Life Identity

I maintain 2 identities one for open source development - which doesn't really contain any personal information. I also have another identity obviously - my real one.
This may be community wiki - but my question is programming related in that when you put software out there, you publish it with some name as the author, and that choice may have real life consequences.
I am considering merging my identities, what are the pro's and con's of this? Is it a good idea, or do privacy concerns outweigh the convenience of maintaining a single identity.
(By the way, this second identity was created out of my World of Warcraft addon development, and I have just continued using it for my open source projects)
Edit: I am considering this, because I am thinking of changing jobs, and I want to refer to my open source work without it looking unprofessional due to the author naming.
Well, as a part-time open-source hacker, I've recently discovered that ohloh can help you "professionnalize" your identity by allowing you to reclaim all the commits you've done in projects knwon by this engine (and they're numerous).
As a consquence, instead of merging your identities, I would suggest you give them some weight by marketting the contributions you've done.
Besides, I've never considered as valid the fact that commiting for a game plugin as open-èsource activity was not that professionnal. It is code, and code used by non-developpers, which must be noted.
In many professions using a pseudonym for publishing works has a long tradition until today: Artists, writers, etc.
Is it really unprofessional for a software developer to do the same?
If you are good, why not get a little famous? Who knows, if person hiring you is not using/participating in open source project and you'll be valued more from the start?

Scalability Case Studies

I'm starting to build a community website from the site up and my web framework will be Asp.net and Mysql.
I want to start planning some scalability into the infrastructure early because I'm anticipating high traffic when the site goes live.
Are there any case studies which you recommend reading where asp.net or mysql has been scaled and which demonstrates good scaling techniques?
I think it could be a challenge to find reference materials for that particular combination. Many .NET shops stick to SQL Server, and fewer use MySQL (at least at scale).
In general it would be appropriate to:
Follow general .NET practices for scalability. Weed out what is not appropriate for you.
Learn about database performance and implications of various design strategies such as denormalisation (when and why).
Consider out-of-process caching like memcached.
Review books on MySQL performance. Most of these are focused on UNIX platforms. Windows users may have problems applying some of these practices.
Read up on how other people are scaling their sites (Building Scalable Sites and The Art of Capacity Planning)
Consider how you might optimise your web design to be more scalable. Are you using AJAX? Work out what the impact of excessive polling may be etc.
Learn how to measure the performance of your application and database (starting points ASP.NET and MySQL).
Develop a plan for scaling your architecture (1 server to 2 servers, to multiple servers etc) so that you have some frame of reference for making decisions about building things in your system.
I only know of one really good resource to read case studies about scalability techniques and I am really surprised no one has mentioned it. High Scalability
There is so many examples of "out of the box" thinking that and different techniques for scaling that I think it makes a good read for anyone who is interested in the topic.
BrianLy said it best here:
"Develop a plan for scaling your
architecture (1 server to 2 servers,
to multiple servers etc) so that you
have some frame of reference for
making decisions about building things
in your system."
As a forum I frequent says, 'quoted for truth'. All of his points are excellent, but this one is a key point that many people overlook. It doesn't matter how scalable your code and database are if you are running on a creaky old server. The hardware may not be as important as your code, improving it beyond a certain point will give diminishing returns VERY quickly, but do NOT forget to get your hardware to that point. If you have crap hardware, or even good hardware but not enough of it, your site will bomb out.
For mysql scaling, you may find this interesting: danga livejournal

Resources