Is NAS / SAN + HTTP server a good match? - http

Is NAS / SAN + HTTP server a good solution for serving large number of static files over the internet?

Add some memory caching on your server, and you should be good. Apache has a couple of modules that do that.
You could also take a look at static distributed caching services, if you want to improve latency for your users and reduce your bw costs, like Akamai and PantherExpress. The latter can be a good investment, depending on your bw costs.

This really depends on the overall problem you are solving. SANs are incredibly complicated and are just a problem waiting to happen. The complexity of the solution adds huge numbers of failure points, maintenance difficulty, possibly nonstandard drivers on every system, interoperability problems between versions of every component.
Most NAS solutions are overengineered problems waiting to happen. They only add value when you need to share one data set in real-time between clients. Think about whether your problem really calls for this. Netapp is really the only NAS vendor that I consider acceptably reliable.
If you can avoid a SAN or NAS, avoid it. Internal hard drives are usually cheaper and faster. They also have less performance confusion when there is an issue. Maintenance is easier. Scalability is easier (i.e. you add performance as you add capacity, if you are replicating the data across every server).
Think about how easy it is to get a large amount of fast storage in a server. A HP DL380 G5 can comfortably have over 1.5TB in one 2U server. Expect the storage to be faster than most SAN or NAS solutions. You won't have controller redundancy, but if you have redundant servers anyway, you increase the overall reliability of the solution vs having one copy of the data with redundant paths to it.
If you need to instantly change the data across multiple servers, I would still consider whether a NAS is the correct solution. Depending on your definition of instantly, and whether you can point requests for updated files to the servers with the current data during synchronization.
I can only imagine a SAN being the correct solution when the data set is huge and there is no time to create a software solution. My experience is that the vast majority of SANs are set up more based on political requirements than technical ones.
I can only imagine a NAS being the correct solution when the NAS server is a Netapp, the data set is very large, and the solution needs to be deployed too quickly to allow for a software solution to spreading the data across multiple servers internal storage. A good NAS server is very expensive, certainly more expensive than paying for development of a software solution to avoid one. But it can possibly be deployed more quickly.
If there are political considerations, SANs and NASes can help to push blame for problems/failures to other groups or to vendors. This is usually the most important consideration when I see a SAN or NAS solution chosen.

Related

How to store huge amount of data in database

I have a simple basic question. Assume i have a large website like facebook, gmail and so on. this site probably save hundreds of gigabytes information every day. My question is how these sites save this large information in their database(Because of database capacity). Is there only one database? Is there only one server for this site? If there is another server and database, how they can communicate with each others?
They are clearly not using one computer...
The system behind such large sites are very complex, and distributed across datacenters. See - http://royal.pingdom.com/2010/06/18/the-software-behind-facebook/
Take a look at this site for info on various architectures employed by those sites (and this site): http://highscalability.com/all-time-favorites/
Most of these sites have gone with a strategy called NoSQL - that is they don't use traditional RDBM databases, but instead have created their own object relationship frameworks which have the ability to be persisted. This strategy works well at large scale as it drops a number of constraints which would seriously impact performance of traditional DB methods. However this generally comes at the cost of a lowering of reliability, which is generally considered acceptable for those sites' scenarios.
ps. if your question's general interest then no worries. If you're trying to build a highly scalable application hold off and consider it for a moment - are you going to be serving a significant percentage of the population of the world, or are you writing a site for maybe a few thousand users. If it's the latter you don't need Facebook style scaling; invest your effort and resources elsewhere. If it's the former start small then evolve your system, bringing in investment and expertise as your user base grows.

to rewrite or not to rewrite

so I need to maintain this old legacy project, where one part of it is amaturely written with Wordpess, lots of crappy custom plugins, lots of ducktaped scripts, that solve one or other problem, database is designed very very purely too, there is also other part written with Zend, which is thightly connected with WP part, there is also this other "masterpiece" project connected to first project data. Main table contains around 1.5M records and needs to be normalized too. now this big ball of nails "works", also it has lots of LOCs, which are result of bad foundations, so it is a huge pain to maintain. Thing is, the way I see this, by not rewriting we are loosing on the long term, because we lose flexibility both from technology and busi
ness perspective,plus it starts not to scale, but rewriting is a risk, plus we would need to convert old data to new data structures. Hacker part of me wants to break this, take a risk and do it right, but at the same time I am having a feeling that my immaturity tries to take a too big bite at once. So what do you think?
In situations like yours, 9 out of 10 times people suggest a rewrite and they are wrong.
Unless you have great application level knowledge about what the system is doing you will not be able to rewrite it successfully, quickly.
If the system is working today, but is crappy in many ways, and you have management buy-in (they own the software) to "fix stuffs", then I suggest an incremental approach will often be better than a full-on rewrite.
I suspect that the database is giving you the most headaches, so that may be the best place to start. Start by understanding the problem that it is currently solving and write that down. If there is no layer between the software and the db (other than jdbc or the like) add a layer. Once there is a layer separating the db from the application, it will be easier to change the db (and the layer) while minimizing the impact on the application.
At some point you will be happy with the first thing you changed. At that point, fix some other part. Repeat until the system is "more better".
Concerning risk: Taking risks is not bad, but being careless is terrible. Understand the risks and plan to mitigate them.
In situations like yours, 9 out of 10 times, I suggest to rewrite. Rationally, the situation a) won't get better, and b) will certainly get worse. You should bite the bullet before it's too late.
And by too late I mean something breaks completely and not only will you have to rewrite the whole thing, but your service will also be offline (ergo you may be also losing users/customers).
A good strategy in this situations is to "strangle" your application as described by Martin Fowler:
http://martinfowler.com/bliki/StranglerApplication.html
The strategy is to gradually create a new system around the edges of the old, letting it grow slowly over several years until the old system is strangled.
I already strangled a legacy application using this approach with great results and practically no offline time

Are there any scalability best practices specifically for sites with huge audiences?

While this question has been asked in a variety of contexts before, I can't find any information pertaining specifically to sites targeting very large audiences - for example on the scale of hundreds of thousands or even millions of users.
When writing sites that target smaller audiences (such as intranet hosted data driven sites that handle from a few to a few thousand users) we only tend to follow best practices within the confines of our project budgets/deadlines - i.e. developer costs, rollout schedules and maintainability have a far bigger impact than we would often like on how we code things.
Some things are also negligible (to a point), for instance delivery time, image compression/size, bandwidth because the nature of a LAN hosted application tends to mean that there is a relatively small amount of financial cost that (within reason) we don't need to worry about too much.
However, when looking to target a much broader audience for instance an audience of (hopefully) millions of users:
Are there any best practices that no longer need to be worried about (i.e. become more negligible the larger the audience)?
Are there any practices that should be adhered to even more tightly?
Also, are there any practices that only really come into play as your audience achieves some critical mass [and what would that critical mass be]? i.e. applying artificial constraints that wouldn't begin to concern you on a private network
Examples I've come across so far are:
Host codebases such as jQuery on Google as it's delivered from Google's CDN and can be served much faster than from your own servers. This will also help keep bandwidth costs down for delivery of your site.
Host images on a CDN for the same reason as hosting your javascript code elsewhere.
I guess it depends on what one aims for on the "triangle" of pressures: CAP (Consistency, Availability & Tolerance to Partition). E.g. one can only have so much "C" when faced with network disruptions which incur "P".
Nowadays, it would appear that the accent is put more on delivering "good user experience" which seems to hinge on "Time to Result" (e.g. having a complete web page on the user's desktop): this translate to investing (amongst other things) more on the "A" and "P" sides then the "C" one.
More concretely: spend some time deciding when to perform data aggregation for the presentation layer to your users e.g. can I aggregate this data over a longer time period before recomputing another view to push?
Of course, I am only barely scratching the surface of the problem.
I think there are three big things to keep in mind here:
a) You aren't going to write the next twitter/youtube/facebook/ebay/amazon/whatever. It don't happen too often so it is a big case of YAGNI.
b) If you do happen to write one of those, chances are you'll have the opportunity to rewrite the application more than a few times.
c) Only object lesson from any of the architecture types who have spoken publicly about those apps is that scaling horizontally is the way to go. Vertical maxes out real, real quick.
Also, I'd argue that process improvements become much bigger at these lofty scales. You will have legions of developers, strict deployment windows and lots of boxes to worry about. It had better be real scripted, automated and repeatable.
I would check out YSlow and follow their reccomendations with regards to improving performance.
#jldupont - Just looked at the presentation that you have linked to. One thing that I didn't get is that how come "Distributed Databases" is an example scenario when you lose Availability to gain Consistency and Partitioning.
I think for distributed databases you lose Consistency.

Easy way to measure LAN network speed of a partcular computer?

I want to measure the network speed between two PCs on Local Area Network. (I'm thinking of getting a Network Area Storage (NAS) device and I want to see how fast the current setup is to get an idea of how fast the NAS needs to be.
I'm thinking I'll just copy some files and look at how long that takes, but I thought there might be a more precise way to measure that.
To measure the network performance, iperf is the best tool I have seen.
I do not believe this will be helpful in solving your problem, however. You aren't going to be able to size a NAS solution based on the network performance.
Transferring a file using the same protocol the NAS would is certainly more relevant for this case. Still, not helpful.
You can assume that any modern machine can transfer data at over 90% of the link speed. (whether involving the local storage or not, unless there's a problem or significant other load).
A dozen reasonably modern clients should be able to crush a fairly powerful NAS. You really need to simulate real-world load. That's basically imposssible.
You need to approach sizing your NAS solution from a different perspective. The capabilities of the clients really aren't relevant.
This might work if you do the same process with another computer. It might not be precise, but could give an idea.
Each NAS device tends to produce as part of its spec a throughput number. Generally speaking if speed is of concern, you would be better off which choosing SAN over NAS.

How to Convince Programming Team to Let Go of Old Ways?

This is more of a business-oriented programming question that I can't seem to figure out how to resolve. I work with a team of programmers who have been working with BASIC for over 20 years. I was brought in to help write the same software in .NET, only with updates and modern practices. The problem is that I can't seem to get any of the other 3 team members(all BASIC programmers, though one does .NET now as well) to understand how to correctly do a relational database. Here's the thing they won't understand:
We basically have a transaction that keeps track of a customer's tag information. We need to be able to track current transactions and past transactions. In the old system, a flat-file database was used that had one table that contained records with the basic current transaction of the customer, and another transaction that contained all the previous transactions of the customer along with important money information. To prevent redundancy, they would overwrite the current transaction with the history transactions-(the history file was updated first, then the current one.) It's totally unneccessary since you only need one transaction table, but my supervisor or any of my other two co-workers can't seem to understand this. How exactly can I convince them to see the light so that we won't have to do ridiculous amounts of work and end up hitting the datatabse too many times? Thanks for the input!
Firstly I must admit it's not absolutely clear to me from your description what the data structures and logic flows in the existing structures actually are. This does imply to me that perhaps you are not making yourself clear to your co-workers either, so one of your priorities must be to be able explain, either verbally or preferably in writing and diagrams, the current situation and the proposed replacement. Please take this as an observation rather than any criticism of your question.
Secondly I do find it quite remarkable that programmers of 20 years experience do not understand relational databases and transactions. Flat file coding went out of the mainstream a very long time ago - I first handled relational databases in a commercial setting back in 1988 and they were pretty commonplace by the mid-90s. What sector and product type are you working on? It sounds possible to me that you might be dealing with some sort of embedded or otherwise 'unusual' system, in which case you do need to make sure that you don't have some sort of communication issue and you're overlooking a large elephant that hasn't been pointed out to you - you wouldn't be the first 'consultant' brought into a team who has been set up in some manner by not being fed the appropriate information. That said such archaic shops do still exist - one of my current clients systems interfaces to a flat-file based system coded in COBOL, and yes, it is hell to manage ;-)
Finally, if you are completely sure of your ground and you are faced with a team who won't take on board your recommendations - and demonstration code is a good idea if you can spare the time -then you'll probably have to accept the decision gracefully and move one. Myself in this position I would attempt to abstract out the issue - can the database updates be moved into stored procedures for example so the code to update both tables is in the SP and can be modified at a later date to move to your schema without a corresponding application change? Make sure your arguments are well documented and recorded so you can revisit them later should the opportunity arise.
You will not be the first coder who's had to implement a sub-optimal solution because of office politics - use it as a learning experience for your own personal development about handling such situations and commiserate yourself with the thought you'll get paid for the additional work. Often the deciding factor in such arguments is not the logic, but the 'weight of reputation' you yourself bring to the table - it sounds like having been brought in you don't have much of that sort of leverage with your team, so you may have to work on gaining a reputation by exceling at implementing what they do agree to do before you have sufficient reputation in subsequent cases - you need to be modded up first!
Sometimes you can't.
If you read some XP books, they often say that one of your biggest hurdles will be convincing your team to abandon what they have always done.
Generally they will recommend letting people who can't adapt go to other projects (Or just letting them go).
Code reviews might help in your case. Mandatory code reviews of every line of code is not unheard of.
Sometime the best argument is an example. I'd write a prototype (or a replacement if not too much work). With an example to examine it will be easier to see the pros and cons of a relational database.
As an aside, flat-file databases have their places since they are so much easier to "administer" than a true relational database. Keep an open mind. ;-)
I think you may have to lead by example - when people see that the "new" way is less work they will adopt it (as long as you don't rub their noses in it).
I would also ask yourself whether the old design is actually causing a problem or whether it is just aesthetically annoying. It's important to pick your battles - if the old design isn't causing a performance problem or making the system hard to maintain you may want to leave the old design alone.
Finally, if you do leave the old design in place, try and abstract the interface between your new code and the old database so if you do persuade your co-workers to improve the design later you can drop the new schema in without having to change anything else.
It is difficult to extract a whole lot except general frustration from the original question.
Yes, there are a lot of techniques and habits long-timers pick up over time that can be useless and even costly in light of technology changes. Some things that made sense when processing power, memory, and even disk was expensive can be foolish attempts at optimization now. It is also very much the case that people accumulate bad habits and bad programming patterns over time.
You have to be careful though.
Sometimes there are good reasons for the things those old timers do. Sadly, they may not even be able to verbalize the "why" - if they even know why anymore.
I see a lot of this sort of frustration when newbies come into an enterprise software development shop. It can be bad even when the environment is all fairly modern technology and tools. If most of your experience is in writing small-community desktop and Web applications a lot of what you "know" may be wrong.
Often there are requirements for transaction journaling at a level above what your DBMS may do. Quite often it can be necessary to go beyond DB transaction semantics in order to ensure time-sequence correctness, once and only once updating, resiliancy, and non-repudiation.
And this doesn't even begin to address the issues involved in enterprise or inter-enterprise scalability. When you begin to approach half a million complex transactions a day you will find that RDBMS technology fails you. Because relational databases are not designed to handle high transaction volumes you must often break with standard paradigms for normalization and updating. Conventional RDBMS locking techniques can destroy scalability no matter how much hardware you throw at the problem.
It is easy to dismiss all of it as stodginess or general wrong-headedness - even incompetence. But be careful because this isn't always the case.
And by the way: There are other models besides the RDBMS, and the alternative to an RDBMS is not necessarily "flat files" - contrary to the experience of of most coders today. There are transactional hierarchical DBMSs that can handle much higher throughput than an RDBMS. IMS is still very much alive in large IBM shops, for example. Other vendors offer similar software for different platforms.
Of course in a 4-man shop maybe none of this applies.
Sign them up for some decent trainings and then it's up to you to convince them that with new technologies a lot more is possible (or at least easier!).
But I think the most important thing here is that professional, certified trainers teach them the basics first. They will be more impressed by that instead of just one of their colleagues telling them: "hey, why not use this?"
Related post here.
The following may not apply in yr situation, but you make very little mention of technical details, so I thought I'd mention it...
Sometimes, if the access patterns are very different for current data than for historical data (I'm making this example up, but say that Current data is accessed 1000s of times per second, and accesses a small subset of columns, and all current data fits in less than 1 GB, whereas, say, historical data uses 1000s of GBs, is accessed only 100s of times per day, and access is to all columns),
then, what your co-workers are doing would make perfect sense, for performance optimization. By separating the current data (albiet redundantly) you can optimize the indices and data structures in that table, for the higher frequency access paterns that you could not do in the historical table.
Not everything that is "academically", or "technically" correct from a purely relational perspective makes sense when applied in an actual practical situation.

Resources