how to provide exclusive copies of big data repository to many developers? - bigdata

Here's a situation I am facing right now at work:
we currently have 300GB+ of production data (and it increases every day at large). It's in a mongodb clustr
data science team members are working on few algorithms that require access to all of this data at once and those algorithms may update data in place, hence, they have replicated the data in dev environment for their use until they are sure their code works
if multiple devs are running their algorithms then all/some of them may end up with unexpected output because other algorithms are also updating the data
this problem could be easily solved if everyone had their own copy of data!
however, given the volume of data, it's not feasible for me to provide them (8 developers right now) with their exclusive copy everyday. Even if I automate this process, we'll have to wait until copy is completed over the wire
I am hoping for a future proof approach considering we'll be dealing with TB's of data quite soon
I am assuming that many organizations would be facing such issues, and wondering how do other folks approach such a case.
I'd highly appreciate any pointers, leads, solutions for this.
Thanks

You can try using snapshots on the replicated data so each developer can have his own "copy" of the data. See Snapshots definition and consult your cloud provider if it can provide writable snapshots.
Note, snapshots are created almost instantly and at the moment of creation they almost do not require storage space because this technology utilizes pointers but not data itself. Unfortunately each snapshot can grow up to the original volume size because any change of data will initiate physical data copy: the technology that hides behind the process is usually CoW - Copy-on-write. So there is a serious danger that uncontrolled snapshots can "eat" all your free storage space.

Related

what is the best way to store preprocessed data in machine learning pipeline?

In my case, raw data is stored on NoSQL. Before training ML model, i should preprocess raw data on NoSQL. At this time, if i preprocess raw data, then what is the best way to keep prerocessed data?
1. keep it on memory
2. keep it on another table in NoSQL
3. can you recommend another options?
Depends on your use case, size of the data, tech stack and machine learning framework / library. Truth be told, without knowledge of your data and requirements, no-one on SO will be able to give you a complete answer.
In terms of passing data to the model/ running the model, load it in memory. Look at batching your data into the model if you hit memory limits. Or use an AWS EMR cluster!
For the question on storing the data, I’ll use the previous answer’s example of Spark and try to give some general rules.
If the processed data is “Big” and regularly accessed (eg once a month/week/day), then store it in a distributed manner, then load into memory when running the model.
For Spark, best bet is to write it as partitioned parquet files or to a Hive Data Warehouse.
The key thing about those two is that they are distributed. Spark will create N parquet files containing all your data. When it comes to reading the dataset into memory (before running your model), it can read from many files at once - saving a lot of time. Tensorflow does a similar thing with the TFRecords format.
If your NoSQL database is distributed, then you can potentially use that.
If it won’t be regularly accessed and is “small”, then just run the code from scratch & load into memory.
If the processing takes no time at all and it’s not used for other work, then there’s no point storing it. It’s a waste of time. Don’t even think about it. Just focus on your model, get the data in memory and get running.
If the data won’t be regularly accessed but is “Big”, then time to think hard!
You need to carefully think about the trade off of processing time vs. data storage capability.
How much will it cost to store this data?
How often is it needed?
Is it business critical?
When someone asks for this, is it always a “needed to be done yesterday” request?
Etc.
—-
The Spark framework is a good solution to make what you want to do learn more about it here: spark. Spark for machine learning: here.

Is R+bigmemory package sufficient for column-oriented data management?

I have a collection of financial time series of various sorts. Most of my analysis is either a column or a row oriented, very rarely I have to do any sort of complex queries. Also, I am (by now) doing almost all analysis in R.
Because of this, I am seriously considering not deploying any sort of RDBMS and instead managing data in R directly (saving RDS files). This would save me the pain of installing an administering a DB as well as probably improve the data loading speeds.
Is there any reason I should consider otherwise? Do you know anyone who manages their data this way? I know this is vague, but I am looking for opinions, not answers.
If working in R is your comfort zone.. I'd keep your data management there as well, even if your analyses or runs are longer.
I've had a similar decision lately:
Should I go in the direction of learning and applying a new (language/dialect/system) to shave some milliseconds off execution time.
or...
Should I go forth with the same stodgy old tools I have used, even if they will run slower at execution time?
Is the product of your runs for you only? If so, I'd stick with data management in R only.. even if production runs are slower.
If you were designing something for a Bank, Cell Phone Service, or a similar transactional environment, I'd recommend finding the super solution.
But if your R production is for you.. I'd stay in R.
Consider the opportunity cost. Learning a new language/ecosystem - and something like PostgreSQL surely qualifies - will soak up far more time than you likely think. Those skills may be valuable, but will they generate a return on time invested that is as high as the return you would get from additional time spent on your existing analysis?
If it's for personal use and there is no pressing performance issue, stick with R. Given that it's generally easier to do foolish things with text and RDS files than it is with a fully-fledged DB, just make sure you back up everything. From being a huge skeptic about cloud-based storage I have over the past half-year become a huge convert and all but my most sensitive information is now stored there. I use Dropbox, which maintains previous versions of data if you do mess up badly.
Being able to check a document or script from the cafe on the corner on your smartphone is nice.
There is a column-by-column management package, colbycol in CRAN designed to provide DB-like functions for large datasets. I assume the author must have conducted the same sort of analysis.

Data Removal Standards

I want to write an application that removes data from a hard drive. Are there any standards that I need to adhere to which will ensure that my software removes at least the bare minimum, or should I just use off the shelf software? If so any advice?
I think any "standard" you may encounter won't be any less science fiction or science mysticism than anything you come up with yourself. Basically, as long as you physically overwrite the data (even just once), there's no commercial forensic service that - even in the face of any amount of money you throw at them - will claim to be able to recover your data.
(Any "overwrite 35 times with rotating bit patterns" advice may have been true for coarsely spaced magnetic tapes in the 1970s, but it is entirely irrelevant for contemporary hard disks).
The far more important problem you have to solve is how to overwrite data physically. This is essentially impossible through any sort of application or even OS programming, and you'll have to find a way to talk to the hardware properly and get a reliable confirmation that the location you intended to write to has indeed be written to, and that there aren't any relocations of the clusters in question to other parts of the disk that might leak the data.
So in essence this is a very low-level question that'll probably have you pouring over your hard disk manufacturer's manuals quite a bit if you want a genuine solution.
Please define "data removal". Is this scrubbing in order to make undeletions impossible; or simply deletion of data ?
It is common to write over a file several times with a random bitpattern, if one wants to make sure it cannot be recovered. Due to the analog nature of the magnetic bit patterns, it might be possible to recover overwritten data in some circumstances.
Under all circumstances a normal file system delete operation will be revertable in most cases. When you delete a file (using a normal file system delete operation), you remove the file allocation table entry, not the data.
There are standards... see http://en.wikipedia.org/wiki/Data_erasure
You don't give any details so it is hard to tell whether they apply to your situation... Deleting a file with OS built-in file deletion can be almost always reverted... OTOH formatting a drive (NOT quick format) is usually ok except when you deal with sensitive data (like data from clients, patients, finance etc. or some security relevant stuff) then the above mentioned standards which usually use differents amounts/rounds/patterns of overwriting the data so make it nearly impossible to revert the deletion... in really really sensitive cases you first use the best of these methods, then format the drive, then use that method again and then destroy the drive physically (which in fact means real destruction, not only removing the electronics or similar!).
The best way to avoid all this hassle is to plan for this kind of thing and to use strong proven full-disk-encryption (with a key NOT stored on the drive electronics or media!)... this way you can easily just format the drive (NOT quick) and then sell it for example... since any strong encryption will look like "random data" is (if implemented correctly) absolutely useless without the key(s).

What cache strategy do I need in this case ?

I have what I consider to be a fairly simple application. A service returns some data based on another piece of data. A simple example, given a state name, the service returns the capital city.
All the data resides in a SQL Server 2008 database. The majority of this "static" data will rarely change. It will occassionally need to be updated and, when it does, I have no problem restarting the application to refresh the cache, if implemented.
Some data, which is more "dynamic", will be kept in the same database. This data includes contacts, statistics, etc. and will change more frequently (anywhere from hourly to daily to weekly). This data will be linked to the static data above via foreign keys (just like a SQL JOIN).
My question is, what exactly am I trying to implement here ? and how do I get started doing it ? I know the static data will be cached but I don't know where to start with that. I tried searching but came up with so much stuff and I'm not sure where to start. Recommendations for tutorials would also be appreciated.
You don't need to cache anything until you have a performance problem. Until you have a noticeable problem and have measured your application tiers to determine your database is in fact a bottleneck, which it rarely is, then start looking into caching data. It is always a tradeoff, memory vs CPU vs real time data availability. There is no reason to make your application more complicated than it needs to be just because.
An extremely simple 'win' here (I assume you're using WCF here) would be to use the declarative attribute-based caching mechanism built into the framework. It's easy to set up and manage, but you need to analyze your usage scenarios to make sure it's applied at the right locations to really benefit from it. This article is a good starting point.
Beyond that, I'd recommend looking into one of the many WCF books that deal with higher-level concepts like caching and try to figure out if their implementation patterns are applicable to your design.

How to Convince Programming Team to Let Go of Old Ways?

This is more of a business-oriented programming question that I can't seem to figure out how to resolve. I work with a team of programmers who have been working with BASIC for over 20 years. I was brought in to help write the same software in .NET, only with updates and modern practices. The problem is that I can't seem to get any of the other 3 team members(all BASIC programmers, though one does .NET now as well) to understand how to correctly do a relational database. Here's the thing they won't understand:
We basically have a transaction that keeps track of a customer's tag information. We need to be able to track current transactions and past transactions. In the old system, a flat-file database was used that had one table that contained records with the basic current transaction of the customer, and another transaction that contained all the previous transactions of the customer along with important money information. To prevent redundancy, they would overwrite the current transaction with the history transactions-(the history file was updated first, then the current one.) It's totally unneccessary since you only need one transaction table, but my supervisor or any of my other two co-workers can't seem to understand this. How exactly can I convince them to see the light so that we won't have to do ridiculous amounts of work and end up hitting the datatabse too many times? Thanks for the input!
Firstly I must admit it's not absolutely clear to me from your description what the data structures and logic flows in the existing structures actually are. This does imply to me that perhaps you are not making yourself clear to your co-workers either, so one of your priorities must be to be able explain, either verbally or preferably in writing and diagrams, the current situation and the proposed replacement. Please take this as an observation rather than any criticism of your question.
Secondly I do find it quite remarkable that programmers of 20 years experience do not understand relational databases and transactions. Flat file coding went out of the mainstream a very long time ago - I first handled relational databases in a commercial setting back in 1988 and they were pretty commonplace by the mid-90s. What sector and product type are you working on? It sounds possible to me that you might be dealing with some sort of embedded or otherwise 'unusual' system, in which case you do need to make sure that you don't have some sort of communication issue and you're overlooking a large elephant that hasn't been pointed out to you - you wouldn't be the first 'consultant' brought into a team who has been set up in some manner by not being fed the appropriate information. That said such archaic shops do still exist - one of my current clients systems interfaces to a flat-file based system coded in COBOL, and yes, it is hell to manage ;-)
Finally, if you are completely sure of your ground and you are faced with a team who won't take on board your recommendations - and demonstration code is a good idea if you can spare the time -then you'll probably have to accept the decision gracefully and move one. Myself in this position I would attempt to abstract out the issue - can the database updates be moved into stored procedures for example so the code to update both tables is in the SP and can be modified at a later date to move to your schema without a corresponding application change? Make sure your arguments are well documented and recorded so you can revisit them later should the opportunity arise.
You will not be the first coder who's had to implement a sub-optimal solution because of office politics - use it as a learning experience for your own personal development about handling such situations and commiserate yourself with the thought you'll get paid for the additional work. Often the deciding factor in such arguments is not the logic, but the 'weight of reputation' you yourself bring to the table - it sounds like having been brought in you don't have much of that sort of leverage with your team, so you may have to work on gaining a reputation by exceling at implementing what they do agree to do before you have sufficient reputation in subsequent cases - you need to be modded up first!
Sometimes you can't.
If you read some XP books, they often say that one of your biggest hurdles will be convincing your team to abandon what they have always done.
Generally they will recommend letting people who can't adapt go to other projects (Or just letting them go).
Code reviews might help in your case. Mandatory code reviews of every line of code is not unheard of.
Sometime the best argument is an example. I'd write a prototype (or a replacement if not too much work). With an example to examine it will be easier to see the pros and cons of a relational database.
As an aside, flat-file databases have their places since they are so much easier to "administer" than a true relational database. Keep an open mind. ;-)
I think you may have to lead by example - when people see that the "new" way is less work they will adopt it (as long as you don't rub their noses in it).
I would also ask yourself whether the old design is actually causing a problem or whether it is just aesthetically annoying. It's important to pick your battles - if the old design isn't causing a performance problem or making the system hard to maintain you may want to leave the old design alone.
Finally, if you do leave the old design in place, try and abstract the interface between your new code and the old database so if you do persuade your co-workers to improve the design later you can drop the new schema in without having to change anything else.
It is difficult to extract a whole lot except general frustration from the original question.
Yes, there are a lot of techniques and habits long-timers pick up over time that can be useless and even costly in light of technology changes. Some things that made sense when processing power, memory, and even disk was expensive can be foolish attempts at optimization now. It is also very much the case that people accumulate bad habits and bad programming patterns over time.
You have to be careful though.
Sometimes there are good reasons for the things those old timers do. Sadly, they may not even be able to verbalize the "why" - if they even know why anymore.
I see a lot of this sort of frustration when newbies come into an enterprise software development shop. It can be bad even when the environment is all fairly modern technology and tools. If most of your experience is in writing small-community desktop and Web applications a lot of what you "know" may be wrong.
Often there are requirements for transaction journaling at a level above what your DBMS may do. Quite often it can be necessary to go beyond DB transaction semantics in order to ensure time-sequence correctness, once and only once updating, resiliancy, and non-repudiation.
And this doesn't even begin to address the issues involved in enterprise or inter-enterprise scalability. When you begin to approach half a million complex transactions a day you will find that RDBMS technology fails you. Because relational databases are not designed to handle high transaction volumes you must often break with standard paradigms for normalization and updating. Conventional RDBMS locking techniques can destroy scalability no matter how much hardware you throw at the problem.
It is easy to dismiss all of it as stodginess or general wrong-headedness - even incompetence. But be careful because this isn't always the case.
And by the way: There are other models besides the RDBMS, and the alternative to an RDBMS is not necessarily "flat files" - contrary to the experience of of most coders today. There are transactional hierarchical DBMSs that can handle much higher throughput than an RDBMS. IMS is still very much alive in large IBM shops, for example. Other vendors offer similar software for different platforms.
Of course in a 4-man shop maybe none of this applies.
Sign them up for some decent trainings and then it's up to you to convince them that with new technologies a lot more is possible (or at least easier!).
But I think the most important thing here is that professional, certified trainers teach them the basics first. They will be more impressed by that instead of just one of their colleagues telling them: "hey, why not use this?"
Related post here.
The following may not apply in yr situation, but you make very little mention of technical details, so I thought I'd mention it...
Sometimes, if the access patterns are very different for current data than for historical data (I'm making this example up, but say that Current data is accessed 1000s of times per second, and accesses a small subset of columns, and all current data fits in less than 1 GB, whereas, say, historical data uses 1000s of GBs, is accessed only 100s of times per day, and access is to all columns),
then, what your co-workers are doing would make perfect sense, for performance optimization. By separating the current data (albiet redundantly) you can optimize the indices and data structures in that table, for the higher frequency access paterns that you could not do in the historical table.
Not everything that is "academically", or "technically" correct from a purely relational perspective makes sense when applied in an actual practical situation.

Resources