so I need to maintain this old legacy project, where one part of it is amaturely written with Wordpess, lots of crappy custom plugins, lots of ducktaped scripts, that solve one or other problem, database is designed very very purely too, there is also other part written with Zend, which is thightly connected with WP part, there is also this other "masterpiece" project connected to first project data. Main table contains around 1.5M records and needs to be normalized too. now this big ball of nails "works", also it has lots of LOCs, which are result of bad foundations, so it is a huge pain to maintain. Thing is, the way I see this, by not rewriting we are loosing on the long term, because we lose flexibility both from technology and busi
ness perspective,plus it starts not to scale, but rewriting is a risk, plus we would need to convert old data to new data structures. Hacker part of me wants to break this, take a risk and do it right, but at the same time I am having a feeling that my immaturity tries to take a too big bite at once. So what do you think?
In situations like yours, 9 out of 10 times people suggest a rewrite and they are wrong.
Unless you have great application level knowledge about what the system is doing you will not be able to rewrite it successfully, quickly.
If the system is working today, but is crappy in many ways, and you have management buy-in (they own the software) to "fix stuffs", then I suggest an incremental approach will often be better than a full-on rewrite.
I suspect that the database is giving you the most headaches, so that may be the best place to start. Start by understanding the problem that it is currently solving and write that down. If there is no layer between the software and the db (other than jdbc or the like) add a layer. Once there is a layer separating the db from the application, it will be easier to change the db (and the layer) while minimizing the impact on the application.
At some point you will be happy with the first thing you changed. At that point, fix some other part. Repeat until the system is "more better".
Concerning risk: Taking risks is not bad, but being careless is terrible. Understand the risks and plan to mitigate them.
In situations like yours, 9 out of 10 times, I suggest to rewrite. Rationally, the situation a) won't get better, and b) will certainly get worse. You should bite the bullet before it's too late.
And by too late I mean something breaks completely and not only will you have to rewrite the whole thing, but your service will also be offline (ergo you may be also losing users/customers).
A good strategy in this situations is to "strangle" your application as described by Martin Fowler:
http://martinfowler.com/bliki/StranglerApplication.html
The strategy is to gradually create a new system around the edges of the old, letting it grow slowly over several years until the old system is strangled.
I already strangled a legacy application using this approach with great results and practically no offline time
I want to write an application that removes data from a hard drive. Are there any standards that I need to adhere to which will ensure that my software removes at least the bare minimum, or should I just use off the shelf software? If so any advice?
I think any "standard" you may encounter won't be any less science fiction or science mysticism than anything you come up with yourself. Basically, as long as you physically overwrite the data (even just once), there's no commercial forensic service that - even in the face of any amount of money you throw at them - will claim to be able to recover your data.
(Any "overwrite 35 times with rotating bit patterns" advice may have been true for coarsely spaced magnetic tapes in the 1970s, but it is entirely irrelevant for contemporary hard disks).
The far more important problem you have to solve is how to overwrite data physically. This is essentially impossible through any sort of application or even OS programming, and you'll have to find a way to talk to the hardware properly and get a reliable confirmation that the location you intended to write to has indeed be written to, and that there aren't any relocations of the clusters in question to other parts of the disk that might leak the data.
So in essence this is a very low-level question that'll probably have you pouring over your hard disk manufacturer's manuals quite a bit if you want a genuine solution.
Please define "data removal". Is this scrubbing in order to make undeletions impossible; or simply deletion of data ?
It is common to write over a file several times with a random bitpattern, if one wants to make sure it cannot be recovered. Due to the analog nature of the magnetic bit patterns, it might be possible to recover overwritten data in some circumstances.
Under all circumstances a normal file system delete operation will be revertable in most cases. When you delete a file (using a normal file system delete operation), you remove the file allocation table entry, not the data.
There are standards... see http://en.wikipedia.org/wiki/Data_erasure
You don't give any details so it is hard to tell whether they apply to your situation... Deleting a file with OS built-in file deletion can be almost always reverted... OTOH formatting a drive (NOT quick format) is usually ok except when you deal with sensitive data (like data from clients, patients, finance etc. or some security relevant stuff) then the above mentioned standards which usually use differents amounts/rounds/patterns of overwriting the data so make it nearly impossible to revert the deletion... in really really sensitive cases you first use the best of these methods, then format the drive, then use that method again and then destroy the drive physically (which in fact means real destruction, not only removing the electronics or similar!).
The best way to avoid all this hassle is to plan for this kind of thing and to use strong proven full-disk-encryption (with a key NOT stored on the drive electronics or media!)... this way you can easily just format the drive (NOT quick) and then sell it for example... since any strong encryption will look like "random data" is (if implemented correctly) absolutely useless without the key(s).
It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 13 years ago.
I have been working for two years in software industry. Some things that have puzzled me are as follows:
There is lack of application of mathematics in current software industry.
e.g.: When a mechanical engineer designs an electricity pole , he computes the stress on the foundation by using stress analysis techniques(read mathematical equations) to determine exactly what kind and what grade of steel should be used, but when a software developer deploys a web server application he just guesses on the estimated load on his server and leaves the rest on luck and god, there is nothing that he can use to simulate mathematically to answer his problem (my observation).
Great softwares (wind tunnel simulators etc) and computing programs(like matlab etc) are there to simulate real world problems (because they have their mathematical equations) but we in software industry still are clueless about how much actual resources in terms of memory , computing resources, clock speed , RAM etc would be needed when our server side application would actually be deployed. we just keep on guessing about the solution and solve such problem's by more or less 'hit and trial' (my observation).
Programming is done on API's, whether in c, C#, java etc. We are never able to exactly check the complexity of our code and hence efficiency because somewhere we are using an abstraction written by someone else whose source code we either don't have or we didn't have the time to check it.
e.g. If I write a simple client server app in C# or java, I am never able to calculate beforehand how much the efficiency and complexity of this code is going to be or what would be the minimum this whole client server app will require (my observation).
Load balancing and scalability analysis are just too vague and are merely solved by adding more nodes if requests on the server are increasing (my observation).
Please post answers to any of my above puzzling observations.
Please post relevant references also.
I would be happy if someone proves me wrong and shows the right way.
Thanks in advance
Ashish
I think there are a few reasons for this. One is that in many cases, simply getting the job done is more important than making it perform as well as possible. A lot of software that I write is stuff that will only be run on occasion on small data sets, or stuff where the performance implications are pretty trivial (it's a loop that does a fixed computation on each element, so it's trivially O(n)). For most of this software, it would be silly to spend time analyzing the running time in detail.
Another reason is that software is very easy to change later on. Once you've built a bridge, any fixes can be incredibly expensive, so it's good to be very sure of your design before you do it. In software, unless you've made a horrible architectural choice early on, you can generally find and optimize performance hot spots once you have some more real-world data about how it performs. In order to avoid those horrible architectural choices, you can generally do approximate, back-of-the-envelope calculations (make sure you're not using an O(2^n) algorithm on a large data set, and estimate within a factor of 10 or so how many resources you'll need for the heaviest load you expect). These do require some analysis, but usually it can be pretty quick and off the cuff.
And then there are cases in which you really, really do need to squeeze the ultimate performance out of a system. In these case, people frequently do actually sit down, work out the performance characteristics of the systems they are working with, and do very detailed analyses. See, for instance, Ulrich Drepper's very impressive paper What Every Programmer Should Know About Memory (pdf).
Think about the engineering sciences, they all have very well defined laws that are applicable to the design, and building of physical items, things like gravity, strength of materials, etc. Whereas in Computer science, there are not many well defined laws when it comes to building an application against.
I can think of many different ways to write a simple hello world program that would satisfy the requirment. However, if I have to build an electricity pole, I am severely constrained by the physical world, and the requirements of the pole.
Point by point
An electricity pole has to withstand the weather, a load, corrosion etc and these can be quantified and modelled. I can't quantify my website launch success, or how my database will grow.
Premature optimisation? Good enough is exactly that, fix it when needed. If you're a vendor, you've no idea what will be running your code in real life or how it's configured. Again you can't quantify it.
Premature optimisation
See point 1. I can add as needed.
Carrying on... even engineers bollix up. Collapsing bridges, blackout, car safety recalls, "wrong kind of snow" etc etc. Shall we change the question to "why don't engineers use more empirical observations?"
The answer to most of these is in order to have meaningful measurements (and accepted equations, limits, tolerances etc) that you have in real-world engineering you first need a way of measuring what it is that you are looking at.
Most of these things simply can't be measured easily - Software complexity is a classic, what is "complex"? How do you look at source code and decide if it is complex or not? McCabe's Cyclomatic Complexity is the closest standard we have for this but it's still basically just counting branch instructions in methods.
There is little math in software programs because the programs themselves are the equation. It is not possible to figure out the equation before it is actually run. Engineers use simple (and very complex) programs to simulate what happens in the real world. It is very difficult to simulate a simulator. additionally, many problems in computer science don't even have an answer mathematically: see traveling salesman.
Much of the mathematics is also built into languages and libraries. If you use a hash table to store data, you know to find any element can be done in constant time O(1), no matter how many elements are in the hash table. If you store it in a binary tree, it will take longer depending on the number of elements [0(n^2) if i remember correctly].
The problem is that software talks with other software, written by humans. The engineering examples you describe deal with physical phenomenon, which are constant. If I develop an electrical simulator, everyone in the world can use it. If I develop a protocol X simulator for my server, it will help me, but probably won't be worth the work.
No one can design a system from scratch and people that write semi-common libraries generally have plenty of enhancements and extensions to work on rather than writing a simulator for their library.
If you want a network traffic simulator you can find one, but it will tell you little about your server load because the traffic won't be using the protocol your server understands. Every server is going to see completely different sets of traffic.
There is lack of application of mathematics in current software industry.
e.g.: When a mechanical engineer designs an electricity pole , he computes the stress on the foundation by using stress analysis techniques(read mathematical equations) to determine exactly what kind and what grade of steel should be used, but when a software developer deploys a web server application he just guesses on the estimated load on his server and leaves the rest on luck and god, there is nothing that he can use to simulate mathematically to answer his problem (my observation).
I wouldn't say that luck or god are always the basis for load estimation. Often realistic data can be had.
It's also not true that there are no mathematical techniques to answer the question. Operations research and queuing theory can be applied to good advantage.
The real problem is that mechanical engineering is based on laws of physics and a foundation of thousands of years worth of empirical and scientific investigation. Computer science is only as old as me. Computer science will be much further along by the time your children and grandchildren apply the best practices of their day.
An MIT EE grad would not have this problem ;)
My thoughts:
Some people do actually apply math to estimate server load. The equations are very complex for many applications and many people resort to rules of thumb, guess and adjust or similar strategies. Some applications (real time applications with a high penalty for failure... weapons systems, powerplant control applications, avionics) carefully compute the required resources and ensure that they will be available at runtime.
Same as 1.
Engineers also use components provided by others, with a published interface. Think of electrical engineering. You don't usually care about the internals of a transistor, just it's interface and operating specifications. If you wanted to examine every component you use in all of it's complexity, you would be limited to what one single person can accomplish.
I have written fairly complex algorithms that determine what to scale when based on various factors such as memory consumption, CPU load, and IO. However, the most efficient solution is sometimes to measure and adjust. This is especially true if the application is complex and evolves over time. The effort invested in modeling the application mathematically (and updating that model over time) may be more than the cost of lost efficiency by try and correct approaches. Eventually, I could envision a better understanding of the correlation between code and the environment it executes in could lead to systems that predict resource usage ahead of time. Since we don't have that today, many organizations load test code under a wide range of conditions to empirically gather that information.
Software engineering are very different from the typical fields of engineering. Where "normal" engineering are bound to the context of our physical universe and the laws in it we've identified, there's no such boundary in the software world.
Producing software are usually an attempt to mirror a subset of the real-life world into a virtual reality. Here we define the laws ourselves, by only picking the ones we need and by making them just as complex as we need. Because of this fundamental difference, you need to look at the problem-solving from a different perspective. We try to make abstractions to make complex parts less complex, just like we teach kids that yellow + blue = green, when it's really the wavelength of the light that bounces on the paper that changes.
Once in a while we are bound by different laws though. Stuff like Big-O, Test-coverage, complexity-measurements, UI-measurements and the likes are all models of mathematic laws. If you look into digital signal processing, realtime programming and functional programming, you'll often find that the programmers use equations to figure out a way to do what they want. - but these techniques aren't really (to some extend) useful to create a virtual domain, that can solve complex logic, branching and interact with a user.
The reasons why wind tunnels, simulations, etc.. are needed in the engineering world is that it's much cheaper to build a scaled down prototype, than to build the full thing and then test it. Also, a failed test on a full scale bridge is destructive - you have to build a new one for each test.
In software, once you have a prototype that passes the requirements, you have the full-blown solution. there is no need to build the full-scale version. You should be running load simulations against your server apps before going live with them, but since loads are variable and often unpredictable, you're better off building the app to be able to scale to any size by adding more hardware than to target a certain load. Bridge builders have a given target load they need to handle. If they had a predicted usage of 10 cars at any given time, and then a year later the bridge's popularity soared to 1,000,000 cars per day, nobody would be surprised if it failed. But with web applications, that's the kind of scaling that has to happen.
1) Most business logic is usually broken down into decision trees. This is the "equation" that should be proofed with unit tests. If you put in x then you should get y, I don't see any issue there.
2,3) Profiling can provide some insight as to where performance issues lie. For the most part you can't say that software will take x cycles because that will change over time (ie database becomes larger, OS starts going funky, etc). Bridges for instance require constant maintenance, you can't slap one up and expect it to last 50 years without spending time and money on it. Using libraries is like not trying to figure out pi every time you want to find the circumference of a circle. It has already been proven (and is cost effective) so there is no need to reinvent the wheel.
4) For the most part web applications scale well horizontally (multiple machines). Vertical (multithreading/multiprocess) scaling tends to be much more complex. Adding machines is usually relatively easy and cost effective and avoid some bottlenecks that become limited rather easily (disk I/O). Also load balancing can eliminate the possibility of one machine being a central point of failure.
It isn't exactly rocket science as you never know how many consumers will come to the serving line. Generally it is better to have too much capacity then to have errors, pissed of customers and someone (generally your boss) chewing your hide out.
While this question has been asked in a variety of contexts before, I can't find any information pertaining specifically to sites targeting very large audiences - for example on the scale of hundreds of thousands or even millions of users.
When writing sites that target smaller audiences (such as intranet hosted data driven sites that handle from a few to a few thousand users) we only tend to follow best practices within the confines of our project budgets/deadlines - i.e. developer costs, rollout schedules and maintainability have a far bigger impact than we would often like on how we code things.
Some things are also negligible (to a point), for instance delivery time, image compression/size, bandwidth because the nature of a LAN hosted application tends to mean that there is a relatively small amount of financial cost that (within reason) we don't need to worry about too much.
However, when looking to target a much broader audience for instance an audience of (hopefully) millions of users:
Are there any best practices that no longer need to be worried about (i.e. become more negligible the larger the audience)?
Are there any practices that should be adhered to even more tightly?
Also, are there any practices that only really come into play as your audience achieves some critical mass [and what would that critical mass be]? i.e. applying artificial constraints that wouldn't begin to concern you on a private network
Examples I've come across so far are:
Host codebases such as jQuery on Google as it's delivered from Google's CDN and can be served much faster than from your own servers. This will also help keep bandwidth costs down for delivery of your site.
Host images on a CDN for the same reason as hosting your javascript code elsewhere.
I guess it depends on what one aims for on the "triangle" of pressures: CAP (Consistency, Availability & Tolerance to Partition). E.g. one can only have so much "C" when faced with network disruptions which incur "P".
Nowadays, it would appear that the accent is put more on delivering "good user experience" which seems to hinge on "Time to Result" (e.g. having a complete web page on the user's desktop): this translate to investing (amongst other things) more on the "A" and "P" sides then the "C" one.
More concretely: spend some time deciding when to perform data aggregation for the presentation layer to your users e.g. can I aggregate this data over a longer time period before recomputing another view to push?
Of course, I am only barely scratching the surface of the problem.
I think there are three big things to keep in mind here:
a) You aren't going to write the next twitter/youtube/facebook/ebay/amazon/whatever. It don't happen too often so it is a big case of YAGNI.
b) If you do happen to write one of those, chances are you'll have the opportunity to rewrite the application more than a few times.
c) Only object lesson from any of the architecture types who have spoken publicly about those apps is that scaling horizontally is the way to go. Vertical maxes out real, real quick.
Also, I'd argue that process improvements become much bigger at these lofty scales. You will have legions of developers, strict deployment windows and lots of boxes to worry about. It had better be real scripted, automated and repeatable.
I would check out YSlow and follow their reccomendations with regards to improving performance.
#jldupont - Just looked at the presentation that you have linked to. One thing that I didn't get is that how come "Distributed Databases" is an example scenario when you lose Availability to gain Consistency and Partitioning.
I think for distributed databases you lose Consistency.
Is NAS / SAN + HTTP server a good solution for serving large number of static files over the internet?
Add some memory caching on your server, and you should be good. Apache has a couple of modules that do that.
You could also take a look at static distributed caching services, if you want to improve latency for your users and reduce your bw costs, like Akamai and PantherExpress. The latter can be a good investment, depending on your bw costs.
This really depends on the overall problem you are solving. SANs are incredibly complicated and are just a problem waiting to happen. The complexity of the solution adds huge numbers of failure points, maintenance difficulty, possibly nonstandard drivers on every system, interoperability problems between versions of every component.
Most NAS solutions are overengineered problems waiting to happen. They only add value when you need to share one data set in real-time between clients. Think about whether your problem really calls for this. Netapp is really the only NAS vendor that I consider acceptably reliable.
If you can avoid a SAN or NAS, avoid it. Internal hard drives are usually cheaper and faster. They also have less performance confusion when there is an issue. Maintenance is easier. Scalability is easier (i.e. you add performance as you add capacity, if you are replicating the data across every server).
Think about how easy it is to get a large amount of fast storage in a server. A HP DL380 G5 can comfortably have over 1.5TB in one 2U server. Expect the storage to be faster than most SAN or NAS solutions. You won't have controller redundancy, but if you have redundant servers anyway, you increase the overall reliability of the solution vs having one copy of the data with redundant paths to it.
If you need to instantly change the data across multiple servers, I would still consider whether a NAS is the correct solution. Depending on your definition of instantly, and whether you can point requests for updated files to the servers with the current data during synchronization.
I can only imagine a SAN being the correct solution when the data set is huge and there is no time to create a software solution. My experience is that the vast majority of SANs are set up more based on political requirements than technical ones.
I can only imagine a NAS being the correct solution when the NAS server is a Netapp, the data set is very large, and the solution needs to be deployed too quickly to allow for a software solution to spreading the data across multiple servers internal storage. A good NAS server is very expensive, certainly more expensive than paying for development of a software solution to avoid one. But it can possibly be deployed more quickly.
If there are political considerations, SANs and NASes can help to push blame for problems/failures to other groups or to vendors. This is usually the most important consideration when I see a SAN or NAS solution chosen.