Data Removal Standards - standards

I want to write an application that removes data from a hard drive. Are there any standards that I need to adhere to which will ensure that my software removes at least the bare minimum, or should I just use off the shelf software? If so any advice?

I think any "standard" you may encounter won't be any less science fiction or science mysticism than anything you come up with yourself. Basically, as long as you physically overwrite the data (even just once), there's no commercial forensic service that - even in the face of any amount of money you throw at them - will claim to be able to recover your data.
(Any "overwrite 35 times with rotating bit patterns" advice may have been true for coarsely spaced magnetic tapes in the 1970s, but it is entirely irrelevant for contemporary hard disks).
The far more important problem you have to solve is how to overwrite data physically. This is essentially impossible through any sort of application or even OS programming, and you'll have to find a way to talk to the hardware properly and get a reliable confirmation that the location you intended to write to has indeed be written to, and that there aren't any relocations of the clusters in question to other parts of the disk that might leak the data.
So in essence this is a very low-level question that'll probably have you pouring over your hard disk manufacturer's manuals quite a bit if you want a genuine solution.

Please define "data removal". Is this scrubbing in order to make undeletions impossible; or simply deletion of data ?
It is common to write over a file several times with a random bitpattern, if one wants to make sure it cannot be recovered. Due to the analog nature of the magnetic bit patterns, it might be possible to recover overwritten data in some circumstances.
Under all circumstances a normal file system delete operation will be revertable in most cases. When you delete a file (using a normal file system delete operation), you remove the file allocation table entry, not the data.

There are standards... see http://en.wikipedia.org/wiki/Data_erasure
You don't give any details so it is hard to tell whether they apply to your situation... Deleting a file with OS built-in file deletion can be almost always reverted... OTOH formatting a drive (NOT quick format) is usually ok except when you deal with sensitive data (like data from clients, patients, finance etc. or some security relevant stuff) then the above mentioned standards which usually use differents amounts/rounds/patterns of overwriting the data so make it nearly impossible to revert the deletion... in really really sensitive cases you first use the best of these methods, then format the drive, then use that method again and then destroy the drive physically (which in fact means real destruction, not only removing the electronics or similar!).
The best way to avoid all this hassle is to plan for this kind of thing and to use strong proven full-disk-encryption (with a key NOT stored on the drive electronics or media!)... this way you can easily just format the drive (NOT quick) and then sell it for example... since any strong encryption will look like "random data" is (if implemented correctly) absolutely useless without the key(s).

Related

to rewrite or not to rewrite

so I need to maintain this old legacy project, where one part of it is amaturely written with Wordpess, lots of crappy custom plugins, lots of ducktaped scripts, that solve one or other problem, database is designed very very purely too, there is also other part written with Zend, which is thightly connected with WP part, there is also this other "masterpiece" project connected to first project data. Main table contains around 1.5M records and needs to be normalized too. now this big ball of nails "works", also it has lots of LOCs, which are result of bad foundations, so it is a huge pain to maintain. Thing is, the way I see this, by not rewriting we are loosing on the long term, because we lose flexibility both from technology and busi
ness perspective,plus it starts not to scale, but rewriting is a risk, plus we would need to convert old data to new data structures. Hacker part of me wants to break this, take a risk and do it right, but at the same time I am having a feeling that my immaturity tries to take a too big bite at once. So what do you think?
In situations like yours, 9 out of 10 times people suggest a rewrite and they are wrong.
Unless you have great application level knowledge about what the system is doing you will not be able to rewrite it successfully, quickly.
If the system is working today, but is crappy in many ways, and you have management buy-in (they own the software) to "fix stuffs", then I suggest an incremental approach will often be better than a full-on rewrite.
I suspect that the database is giving you the most headaches, so that may be the best place to start. Start by understanding the problem that it is currently solving and write that down. If there is no layer between the software and the db (other than jdbc or the like) add a layer. Once there is a layer separating the db from the application, it will be easier to change the db (and the layer) while minimizing the impact on the application.
At some point you will be happy with the first thing you changed. At that point, fix some other part. Repeat until the system is "more better".
Concerning risk: Taking risks is not bad, but being careless is terrible. Understand the risks and plan to mitigate them.
In situations like yours, 9 out of 10 times, I suggest to rewrite. Rationally, the situation a) won't get better, and b) will certainly get worse. You should bite the bullet before it's too late.
And by too late I mean something breaks completely and not only will you have to rewrite the whole thing, but your service will also be offline (ergo you may be also losing users/customers).
A good strategy in this situations is to "strangle" your application as described by Martin Fowler:
http://martinfowler.com/bliki/StranglerApplication.html
The strategy is to gradually create a new system around the edges of the old, letting it grow slowly over several years until the old system is strangled.
I already strangled a legacy application using this approach with great results and practically no offline time

Abstraction or not?

The other day i stumbled onto a rather old usenet post by Linus Torwalds. It is the infamous "You are full of bull****" post when he defends his choice of using plain C for Git over something more modern.
In particular this post made me think about the enormous amount of abstraction layers that accumulate one over the other where I work. Mine is a Windows .Net environment. I must say that I like C# and the .Net environment, it really makes most things easy.
Now, I come from a very different background made of Unix technologies like C and a plethora or scripting languages; to me, also, OOP is just one, and not always the best, programming paradigm.. I often struggle (in a working kind of way, of course!) with my colleagues (one in particular), because they appear to be of the "any problem can be solved with an additional level of abstraction" church, while I'm more of the "keeping it simple" school. I think that there is a very different mental approach to the problems that maybe comes from the exposure to different cultures.
As a very simple example, for the first project I did here I needed some configuration for an application. I made a 10 rows class to load and parse a txt file to be located in the program's root dir containing colon separated key / value pairs, one per row. It worked.
In the end, to standardize the approach to the configuration problem, we now have a library to be located on every machine running each configured program that calls a service that, at startup, loads up an xml that contains the references to other xmls, one per application, that contain the configurations themselves.
Now, it is extensible and made up of fancy reusable abstractions, providers and all, but I still think that, if we one day really happen to reuse part of it, with the time taken to make it up, we can make the needed code from start or copy / past the old code and modify it.
What are your thoughts about it? Can you point out some interesting reference dealing with the problem?
Thanks
Abstraction makes it easier to construct software and understand how it is put together, but it complicates fully understanding certain issues around performance and security, because the abstraction layers introduce certain kinds of complexity.
Torvalds' position is not absurd, but he is an extremist.
Simple answer: programming languages provide data structures and ways to combine them. Use these directly at first, do not abstract. If you find you have representation invariants to maintain that are at a high risk of being broken due to a large number of usage sites possibly outside your control, then consider abstraction.
To implement this, first provide functions and convert the call sites to use them without hiding the representation. Hide the data representation only when you're satisfied your functional representation is sufficient. Make sure at this time to document the invariant being protected.
An "extreme programming" version of this: do not abstract until you have test cases that break your program. If you think the invariant can be breached, write the case that breaks it first.
Here's a similar question: https://stackoverflow.com/questions/1992279/abstraction-in-todays-languages-excited-or-sad.
I agree with #Steve Emmerson - 'Coders at Work' would give you some excellent perspective on this issue.

How to prevent user from decrypting data file while the program is capable to read the data file freely?

I want to ask for a mature model to do this.
Suppose I want to deliver a program and a sensitive data file to user. The program is able to read any data stored in the data file, but user is not allowd to easly break the data file. The data file will be encrypted by standard algorithm such as AES. Now, the problem turns to how to manage the key. Putting the key in the program seems to be a bad idea, but what else I can do? Apparently I can't give the key to user directly.
There is no way to do this securely, ie. to really prevent the user from reading the data. As long as they have the data and they have the program that can read it a competent disassembler will be able to figure out how the program reads it and do the same thing. Or, even easier, they could let the program do it and then get the decrypted version out of its memory.
Having said that, if you just want to prevent the average user from doing it, hardcoding the key in your source code should be fine. :) Just be realistic about the level of protection this provides.
Does it have to be pure software? If not, you could look at a solution which does decryption and storage of the key on a hardware device, e.g. a USB dongle.
You can also potentially prevent the whole problem by having your software retireve the data from a web-service instead of a data file. In this case you can control access to the data much more tightly (i.e. who gets how much of what and when). This might or might not work for your application.
Otherwise as others pointed out, there is no known good pure software solution.
There is no 100% safe solution to this because at some point you have to have the key loaded into memory so that de/encryption can take place and a savvy-enough hacker will be able to capture it. The best you can do is to make it very difficult to capture and to mitigate exposure to data (by limiting access as much as possible) if the key is compromised.
As far as how to make it safer... you could have a combined key that is made up of something stored in the program and something derived from the user's passcode?
Is your perceived user determined? are they skilled enough to do reverse the application or the key? If the user is considered to just be a generic desktop user you probably could implement a partial key using some general encryption just to make the key non obvious, beyond that a determined individual will be able to reverse must simple means of encrypting keys and data.
A DVD John conundrum, eh? Why is having a key in the program bad? You could have a super-obscured function which computes it reliably once. Someone with disassembler and debugger can break your key given enough time, IMO.

MUD Programming questions

I used to play a MUD based on the Smaug Codebase. It was highly customized, but was the same at the core. I have the source code for this MUD, and am interested in writing my own (Just for a fun project). I've got some questions though, mostly about design aspects. Maybe someone can give me a hand?
What language should I use? Interpreted or compiled? Does it make a difference? SMAUG is written in C. I am comfortable with a lot of languages, and have no problem learning more.
Is there a particular approach I should follow to not hinder performance? Object Oriented, functional, etc?
What medium should I use for storing data? Flat files (This is what SMAUG uses), or something like SQLite. What are the performance pros/cons of both?
Are there any guides that anyone knows of on how to get started on a project like this?
I want it to scale to allow 50 players online at a time with no decrease in performance. If I used Ruby 1.8 (very slow), would it make a difference compared to using Python 3.1 (Faster), or compiled C/C++?
If anyone can lend a hand and give some info or advice, I'd be eternally grateful.
I'll give this a shot:
In 2009, for a 50 player game, it doesn't matter. You may want to pick a language that you're familiar with profiling tools for, if you want to grow it further, but since RAM is so cheap nowadays, the constraints driving the early LPMUD (which I have experience with) and DikuMUD (which your Smaug is derived from) don't apply. (LPMUD could handle ~10-15 players on a machine with 8MB RAM)
The programming style doesn't necessarily lead to performance difficulties, large sites like Amazon's 'obidos' webserver are written in C, but just-as-large sites like the original Yahoo Stores were written in Lisp, StackOverflow is written in ASP.NET, etc. I'd /personally/ use C but many people would call me a sadist.
Flat Files are kind of pointless in today's day and age for lots of data storage, there are specific-case exceptions (Large mailservers sometimes use 'maildir' which is structured flat-files, for example). The size of your game likely means you won't be running into huge slowness driven by data retrieval delays, but the data integrity in-case-of-crash are probably going to make the most convincing argument.
Don't know of any guide, but what I'd do is try to get the game started as a dumb chat server to start, make sure users can log in and do something (take their input and dump it to all other users), then build that up to allowing specific logins, so you'll start facing the challenge of username/password handling, and user option setting / storage / retrieval ... then start adding the gamedriver elements (get tic tac toe games working in game), then go a little more complex (get a 5-room setup working with objects you can pick up / drop / bash each other with), then add some non-player characters, and THEN worry about slurping in the Diku-derived smaug castles / etc and working with them. :)
This is a bit off the cuff , I'm sure there are dissenting opinions. :) Good luck!
This is a text based game, right? In that case, with current hardware, it seems all you would have to worry about is not accidentally creating an O(n**2) algorithm. Even that probably wouldn't be too bad with 50 users.

How to Convince Programming Team to Let Go of Old Ways?

This is more of a business-oriented programming question that I can't seem to figure out how to resolve. I work with a team of programmers who have been working with BASIC for over 20 years. I was brought in to help write the same software in .NET, only with updates and modern practices. The problem is that I can't seem to get any of the other 3 team members(all BASIC programmers, though one does .NET now as well) to understand how to correctly do a relational database. Here's the thing they won't understand:
We basically have a transaction that keeps track of a customer's tag information. We need to be able to track current transactions and past transactions. In the old system, a flat-file database was used that had one table that contained records with the basic current transaction of the customer, and another transaction that contained all the previous transactions of the customer along with important money information. To prevent redundancy, they would overwrite the current transaction with the history transactions-(the history file was updated first, then the current one.) It's totally unneccessary since you only need one transaction table, but my supervisor or any of my other two co-workers can't seem to understand this. How exactly can I convince them to see the light so that we won't have to do ridiculous amounts of work and end up hitting the datatabse too many times? Thanks for the input!
Firstly I must admit it's not absolutely clear to me from your description what the data structures and logic flows in the existing structures actually are. This does imply to me that perhaps you are not making yourself clear to your co-workers either, so one of your priorities must be to be able explain, either verbally or preferably in writing and diagrams, the current situation and the proposed replacement. Please take this as an observation rather than any criticism of your question.
Secondly I do find it quite remarkable that programmers of 20 years experience do not understand relational databases and transactions. Flat file coding went out of the mainstream a very long time ago - I first handled relational databases in a commercial setting back in 1988 and they were pretty commonplace by the mid-90s. What sector and product type are you working on? It sounds possible to me that you might be dealing with some sort of embedded or otherwise 'unusual' system, in which case you do need to make sure that you don't have some sort of communication issue and you're overlooking a large elephant that hasn't been pointed out to you - you wouldn't be the first 'consultant' brought into a team who has been set up in some manner by not being fed the appropriate information. That said such archaic shops do still exist - one of my current clients systems interfaces to a flat-file based system coded in COBOL, and yes, it is hell to manage ;-)
Finally, if you are completely sure of your ground and you are faced with a team who won't take on board your recommendations - and demonstration code is a good idea if you can spare the time -then you'll probably have to accept the decision gracefully and move one. Myself in this position I would attempt to abstract out the issue - can the database updates be moved into stored procedures for example so the code to update both tables is in the SP and can be modified at a later date to move to your schema without a corresponding application change? Make sure your arguments are well documented and recorded so you can revisit them later should the opportunity arise.
You will not be the first coder who's had to implement a sub-optimal solution because of office politics - use it as a learning experience for your own personal development about handling such situations and commiserate yourself with the thought you'll get paid for the additional work. Often the deciding factor in such arguments is not the logic, but the 'weight of reputation' you yourself bring to the table - it sounds like having been brought in you don't have much of that sort of leverage with your team, so you may have to work on gaining a reputation by exceling at implementing what they do agree to do before you have sufficient reputation in subsequent cases - you need to be modded up first!
Sometimes you can't.
If you read some XP books, they often say that one of your biggest hurdles will be convincing your team to abandon what they have always done.
Generally they will recommend letting people who can't adapt go to other projects (Or just letting them go).
Code reviews might help in your case. Mandatory code reviews of every line of code is not unheard of.
Sometime the best argument is an example. I'd write a prototype (or a replacement if not too much work). With an example to examine it will be easier to see the pros and cons of a relational database.
As an aside, flat-file databases have their places since they are so much easier to "administer" than a true relational database. Keep an open mind. ;-)
I think you may have to lead by example - when people see that the "new" way is less work they will adopt it (as long as you don't rub their noses in it).
I would also ask yourself whether the old design is actually causing a problem or whether it is just aesthetically annoying. It's important to pick your battles - if the old design isn't causing a performance problem or making the system hard to maintain you may want to leave the old design alone.
Finally, if you do leave the old design in place, try and abstract the interface between your new code and the old database so if you do persuade your co-workers to improve the design later you can drop the new schema in without having to change anything else.
It is difficult to extract a whole lot except general frustration from the original question.
Yes, there are a lot of techniques and habits long-timers pick up over time that can be useless and even costly in light of technology changes. Some things that made sense when processing power, memory, and even disk was expensive can be foolish attempts at optimization now. It is also very much the case that people accumulate bad habits and bad programming patterns over time.
You have to be careful though.
Sometimes there are good reasons for the things those old timers do. Sadly, they may not even be able to verbalize the "why" - if they even know why anymore.
I see a lot of this sort of frustration when newbies come into an enterprise software development shop. It can be bad even when the environment is all fairly modern technology and tools. If most of your experience is in writing small-community desktop and Web applications a lot of what you "know" may be wrong.
Often there are requirements for transaction journaling at a level above what your DBMS may do. Quite often it can be necessary to go beyond DB transaction semantics in order to ensure time-sequence correctness, once and only once updating, resiliancy, and non-repudiation.
And this doesn't even begin to address the issues involved in enterprise or inter-enterprise scalability. When you begin to approach half a million complex transactions a day you will find that RDBMS technology fails you. Because relational databases are not designed to handle high transaction volumes you must often break with standard paradigms for normalization and updating. Conventional RDBMS locking techniques can destroy scalability no matter how much hardware you throw at the problem.
It is easy to dismiss all of it as stodginess or general wrong-headedness - even incompetence. But be careful because this isn't always the case.
And by the way: There are other models besides the RDBMS, and the alternative to an RDBMS is not necessarily "flat files" - contrary to the experience of of most coders today. There are transactional hierarchical DBMSs that can handle much higher throughput than an RDBMS. IMS is still very much alive in large IBM shops, for example. Other vendors offer similar software for different platforms.
Of course in a 4-man shop maybe none of this applies.
Sign them up for some decent trainings and then it's up to you to convince them that with new technologies a lot more is possible (or at least easier!).
But I think the most important thing here is that professional, certified trainers teach them the basics first. They will be more impressed by that instead of just one of their colleagues telling them: "hey, why not use this?"
Related post here.
The following may not apply in yr situation, but you make very little mention of technical details, so I thought I'd mention it...
Sometimes, if the access patterns are very different for current data than for historical data (I'm making this example up, but say that Current data is accessed 1000s of times per second, and accesses a small subset of columns, and all current data fits in less than 1 GB, whereas, say, historical data uses 1000s of GBs, is accessed only 100s of times per day, and access is to all columns),
then, what your co-workers are doing would make perfect sense, for performance optimization. By separating the current data (albiet redundantly) you can optimize the indices and data structures in that table, for the higher frequency access paterns that you could not do in the historical table.
Not everything that is "academically", or "technically" correct from a purely relational perspective makes sense when applied in an actual practical situation.

Resources