What to call staging tables when company uses dev/staging/production environments for software development - bigdata

I'm setting up a data warehouse in my company. In my experience the initial tables you insert data into, before you transform it into something user friendly, are called staging tables.
However the tech team here use dev, staging and production environments for software development. I've asked and seeing something called *_staging on a production environment would seem really confusing to them.
This naming clash must be a common one so I'm wondering is it normal practice to just put up with it or is there a standard alternative name?

In traditional data warehouse design these are usually called "staging tables", although sometimes called "raw tables". You might mitigate the confusion by using the common contraction "stg" instead of "staging".
And increasingly, and especially in big data projects, the term "data lake" is used to refer to the repository that stores copies of source system data. "data lake" is essentially persistent staging with some direct access for analytics allowed.

Related

CMS - How to work with multiple environments? Do I really need them?

I've never worked with any CMS and I simply wanted to play with such ones. As originally I come from .NET roots, so I was thinking about choosing Orchard Core CMS.
Let's imagine very simple scenario, together with my colleague I'd like to create a blog. As I'm used to work with web based systems and applications for a business for me it's kinda normal to work with code repository, having multiple environments dev/test/stage/prod, implementing CI / CD, adjusting database via migrations or scripts.
Now the question is do I need all of this with working on our blog with a usage of CMS.
To be more specific I can ask few questions:
Shall I create blog using CMS locally (My PC) -> create few articles and then deploy it to the web or I should create a blog over the internet and add articles in prod environment directly.
How to synchronize databases between environments (dev / prod).
I can add, that as I do not expect many visitors on a website I was thinking to use Orchard Core CMS together with SQLite. Also I expect that I can customize code, add new modules, extend existing ones etc. - not only add content (articles). You can take that into consideration in answering the question
So basically my question is what should be the workflow of a person who want to create / administer and maintain CMS (let it be blog) as a single person or as a team.
Shall I work and create content locally, then publish it and somehow synchronize both application and database (database is my main question mark - also in a context how to do that properly using SQLite).
Or simply all the changes - code + content should be managed directly on a server let's call it production environment.
Excuse me if question is silly and hard to understand, but I'm looking for any advice as I really didn't find any good examples / information about that or maybe I'm seeking in totally wrong direction.
Thanks in advance.
Great question, not at all silly ;)
When dealing with a CMS, you need to think about the data/content in very different terms from the code/modules, despite the fact that the boundary between them is not always completely obvious.
For Orchard, the recommendation is not to install modules in production, but to have a dev - staging - production type of environment: install new modules on a dev environment, test them in staging, and then deploy to production when it's safe to do so. Depending on the scale of the project, the staging may be skipped for a more agile dev to prod setting but the idea remains the same, and is not very different from any modular application.
Then you have the activation and configuration of the settings of the modules you deploy. Because in a CMS like Orchard, those settings are considered data and stored in the database, they should be handled like content. This includes metadata such as the very shape of the content of your site: content types are data.
Data is typically not deployed like code is, with staging and prod environments (although it can, to a degree, more on that in a moment). One reason for this is that a CMS will often feature user-provided data, such as reviews, ratings, comments or usage stats. Synchronizing all that two-ways is very impractical. Another even more important reason is that the very reason to use a CMS is to let non-technical owners of the site manage content themselves in a fast and direct manner.
The difference between code and data is also visible in the way you secure their changes: for code, usual source control is still the rule, whereas for the content, you'll setup database backups.
Also important to mention is the structure of the database. You typically don't have to worry about this until you write your own modules: Orchard comes with a rich data migration feature that makes sure the database structure gets updated with the code that uses it. So don't worry about that, the database will just update itself as you deploy code to production.
Finally, I must mention that some CMS sites do need to be able to stage contents and test it before exposing it to end-users. There are variations of that: in some cases, being able to draft and preview content items is enough. Orchard supports that out of the box: any content type can be marked draftable. When that is not enough, there is an optional feature called Deployments that enables rich content deployment workflows that can be repeated, scheduled and validated. An important point concerning that module is that the deployment only applies to the subset of the site's content you decide it should apply to (and excludes, obviously, stuff like user-provided content).
So in summary, treat code and modules as something you deploy in a one-way fashion from the dev box all the way to production, with ordinary source control and deployment methods, and treat data depending on the scenario, from simple direct in production database instances with a good backup policy, to drafts stored in production, and then all the way to complex content deployment rules.

Sharing stored procedures across multiple apps

Team A has an enterprise app that uses ADO.NET for data access that executes stored procedures. The data access is encapsulated in it's own project (let's call it DAL.dll)
Team B is creating another unrelated app that's reusing the stored procedures in the enterprise app. This app is currently using the MS application block for data access. The issue we run into is that whenever Team A make any change to the input/output params in the stored procedures, there is a runtime error in Team B's app and this app needs to be updated to accommodate the additional params (or params that were removed). So, most of these go unnoticed until a user complains. At the very least, we would like to have the app throw a compilation error so that the build process warns us of the changes made.
One way to do this is to have Team B's project add a reference to the DAL.dll
I'd like to know if there are any other cleaner ways of solving the issue. We are ready to replace Team B's MS Data application block to use a different technology (Entity Framework?) if necessary.
Among the other answers, I'd strongly suggest getting those stored procedures into source control, in a Database Project. You then may be able to use the features of your source control system to do several things:
Lock some of the code so that it cannot be changed
Give you notifications if the code is changed
Warn you if the stored procedures change in a way that would prevent them from being called
Branch the stored procedures so that each team can have their own version of changed code, while keeping the unchanged stored procedures common. You of course will need to separate the different versions in the database.
I agree with the other posters on this thread that you should not share stored procedure's across different .NET DLL's, that is just a recipe for disaster. I would also shy away from ORM's like Entity Framework if you are doing anything at all complicated with your database schema because ORM's excel at getting a simple object model translated from your .NET application classes into SQL tables and SP's, but traditionally do poorly at optimizing them for performance on the database side. There will be people who claim otherwise, and they may have a valid point if you are an expert in wrangling an ORM to do waht you want like they are, but chances are you are not and it will cause you headaches in the long run.
A shared data access layer might work, but conceptually you are then just changing the implementation of the dependency from some code that a DBA wrote to some code that a .NET programmer wrote. Yes, you can use integration tests to achieve better verifiability, but the same case could be made for SQL with tools like Red Gate's SQL Test. I would shy away from this approach if the two applications are already experiencing some sort of pain from sharing SP's. That is an indication that the dependency just should be done away with.
If it were up to me, I'd just make a new schema for Team B's app. You can read more about schemas in SQL Server here: MSDN Schema description for 2008 R2. You can think of them as namespaces for SQL Server but with some additional bells and whistles like permission and access control. Separating out your different applications into separate schemas on the same shared database will probably make for the most flexible implementation in the long run.
unrelated app that's reusing the stored procedures in the enterprise app
If these two application are really unrelated why are those sharing procedures or even the same database. I know this is a long read, but I recommend you to read this: A Better Path to Enterprise Architectures
The partioning concept in there relates to the bounded context in Domain driven design:
Multiple models are in play on any large project. Yet when code based on distinct models is combined, software becomes buggy, unreliable, and difficult to understand. Communication among team members becomes confusing. It is often unclear in what context a model should not be applied.
Therefore: Explicitly define the context within which a model applies. Explicitly set boundaries in terms of team organization, usage within specific parts of the application, and physical manifestations such as code bases and database schemas. Keep the model strictly consistent within these bounds, but don’t be distracted or confused by issues outside.
It is expected you end with problems when you don't explicitely deal with this. You're lucky you're seeing early failures, as it can turn into problems much harder to find on the long run.
Analyze the problem again with the above in mind. Consider if you're missing some explicit context where this common functionality should live.
My question is: which team owns the store procedured and the database shared? Usually as a good architecture/design, you should not have two different apps sharing same database / procedures.
A better way to share data/functionality between two different applications is through a services or API, so the team who owns the functionality would be responsible to maintain it.
Also, have a good communication between both teams is highly recommend.
Depending on the owner of the DAL project, you could host web services and share the API. That way, you separate the Data Access Layer from the business logic, which allows anyone to use the same DAL without having to publish it to each different location.
From my point of view, it looks like both Team A and Team B should share the same core model and look at Multitier architecture as a possible solution.
It sounds like it would make sense to create a shared DAL that both applications can share.
I would add unit tests (or really integration tests) to make sure the DAL is compatible with the apps after changes. That way your tests would fail if incompatible changes have been made
"I'd like to know if there are any other cleaner ways of solving the issue."
The cleanest way is for Team B to sit down with Team A and encapsulate the relevant business logic into a shared API. It doesn't matter so much how you implement that API; what does matter is that the API's interface is documented and versioned so everyone knows what to expect.
One reasonable mechanism for this in a .NET environment is to use Microsoft's WebAPI.
In short, the question of "how do we share a stored procedure?" is most likely looking at the wrong level of abstraction.

Best Practice for maintaining a TSQL database creation script for a web application

We have a ASP.NET web application and need to maintain the database creation and initialization script.
Are there any industry best practices that people know of for maintaining database creation and initialization scripts. I can think of two main approaches.
Maintain a tsql creation script directly by hand.
Maintain a master database and create the script that is then checked into source safe.
Also the script should be able to be tracked through source control, i.e. table order should be controllable.
If possible should also include the ability to track initialisation data either in the same or a seperate script.
Currently we generate the script from management studio but the order of the tables seems to be random.
And the more automated the solution the better.
The problem is not maintaining the script, nor maintaining a 'master' copy of the database. The real problem is upgrading existing database(s). You do your modification in the developer environment, which are then propagated to the test environment, and finally pushed into production environment. While at developer and test environment stage is possible to start from scratch, in production you always have to upgrade the existing deployment.
In my experience the best practice is to use upgrade scripts. This practice is useful even with a single deployed site, but it becomes invaluable with multiple locations that may be at different versions. But even with one single operational site is still useful to be able to test the upgrade repeatedly (starting from backups of current version), keep the changes in source control, have a well formalized and peer reviewed change procedure (the upgrade script). And upgrade scripts can be tailored to specific needs of the operational site, like handling a large table with special care, or deal with encrypted data, or whatever one of the myriad of the details diff based tools neglect or ignore. The main disadvantage is the the scripts have to be written, which require real T-SQL knowledge (forget all the 'designers' in you favorite management tool).
You might want to check out RedGate SQL Source Control.
Are you looking for Visual Studio Database Projects?
I use database projects to store all database objects (tables, views, functions, keys, triggers, indexes across schemas) and keep versioning in TFS. You can build the database to ensure that everything is valid. You can deploy to a fresh database, or do a schema comparison with an existing database.
I also keep all reference and setup data in post deployment scripts which are automatically run after deployment.

Good way to make changes to production database / source code

I'm interested to find out what would be the good way to make changes to production database and source code in web application (ASP.NET, SQL Server 2008).
A little bit more details, we develop on local machines, and then we need to transfer the code and database changes to production (pretty much standard story).
At the moment we do it in the evening, change the database directly from management studio on production server, and then just overwrite the existing asp.net code (copy/past).
You're talking about Release management. What you're asking about is a big subject with a LOT of different answers. The best solution for you is not something we can tell you. There are trade offs to consider.
For example, what you're describing is a very basic release management process that would be considered an "immature" process.... It does not take into account rollback plans, versioning, separation of concerns, proper testing, or any of a hundred other factors that a "mature" release management process involves.
A mature process is very good, but if you don't have the resources, it's not feasible.
To get to the point, I don't think you question can be answered fully here. I'd suggest starting to research "change management", "release management", "Application Lifecycle management", and "Applicaiton Development Lifecycle". I'll have a few good starter links for you in a minute.
Just a forewarning, though, you are asking a question that's going to open your eyes and your world in ways you probably haven't considered. There are things like automated builds to consider, tools to do it for you (high priced, free, and everything in between)
http://en.wikipedia.org/wiki/Release_management
http://en.wikipedia.org/wiki/Application_lifecycle_management
A few simple options for JUST what you're asking about can be found here:
http://msdn.microsoft.com/en-us/library/7hd4c0x3(VS.80).aspx
Also, since you talked about source code without mentioning which source control you're using, I need to say... if you're not already using source control, you need to. You'll wonder how you ever lived without it once you start using it.
Depends on whether it's the first deployment of a new app, or an update to the app.
For small updates, record all your database changes as sql scripts. You must strictly enforce that all changes to development are applied as sql scripts. Put the scripts in source control. Deploy the update by running the scripts on production.
For new apps you may have thousands of scripts. You can't run them individually. Consolidating them into a master script takes too much time. (although you still want to check EVERY script into source control). In this case you reach a milestone in development then FREEZE the development database, and declare it a baseline. Use the database tools to generate a master script(s). Deploy production by running this script(s). Manually create data scripts for your lookup tables to keep it separate from junk dev data.
Avoid a database copy. Avoid changing by hand through the GUI. Scripts are the way. How you go about collecting the scripts, consolidating to master scripts, generating the scripts, etc is another story.

Should we have separate database instance for each developer?

What is the best way for developing a database based application? We can have two approaches.
One common database for all the developers.
Separate database for all the developers.
What are the pros and cons of each? And which one is better way?
Edit: More then one developer is supposed to update the database and we already have SqlExpress 2005 on each developer machine.
Edit: Most of us are suggesting a common database. However if one of the dev has modified the code and database schema . He has not committed the code changes but the schema changes has gone to the common database. Will it not possibly break the other developers code.
Both -
I like a single database that changes are tested on before going live, or going to a 'formal' test environment. This is your developer's sanity check; it stays up to date with the live system and it makes sure they always consider each others changes. The rule should be that changes don't go on here if they might break something else.
A database per developer is great (even essential) when more than one developer is making updates. It allows them all the development flexibility they want without breaking things for other developers.
The key is to have a process for moving database changes from development through to your live system, and stick to your process.
Shared database
Simpler
Less cases of "It works on my machine".
Forces integration
Issues are found quickly (fail fast)
Individual databases
Never affect other developers, but this is also a bad thing, in continuous integration
We use a shared development database and it works out nicely. Our schema rarely changes in a way that makes it backwards incompatible, but occasionally a design change will occur before we go live, and we simply ask the other developers to update.
We do have separate development application (web) servers, but they share the same database. Our developers do have the option to use their own database, as they know how to set this up, and will do that on occasion, but only temporarily. The norm, for us, is to share the database.
Thought I'd throw this out there, but why not let every developer host their own instance of SQL Server Developer on their desktops and then have a shared server for each of the other environments (development, QA, and prod)? I think even the basic MSDN that comes with Visual Studio Pro (if you opt for it) includes a license for SQL Server Developer.
The developer can work on their desktop without impacting the others and then you can have them move the code to the next shared environment as you see fit (at will, with daily/weekly builds, etc.).
EDIT:
I should add that the desktop instance allows developers to do things that he DBAs often restrict on shared environments. This includes database creation, backup/restore, profiler, etc.. These things are not essential but they allow the developer to become so much more productive while reducing the demands they make against your DBAs.
The shared environment is completely necessary for testing - I would not recommend going from desktop to production. But you can add so much by allowing the developers to have 100% control over a given database environment (including isolation from others) with a relatively minor cost.
Depends on your development, testing and maintenance cycles. Also on the size and location of the development team (and of course organization). If you support several versions of the database you might need even more environments.
In real world I found the following approach rather satisfying:
single central database/application for testing purposes, gets all the changes by various developers periodically merged into it
local copies for development (so you are free to drop and reload the whole database)
upgrade scripts are maintained for any changes to schema, auxiliary and sample data sets
Here are some further points:
If two developers (two teams) are working on changes that can affect each other then they should complete their tasks independently and then integrate/merge and test. For this it is much better to have separate development environments (unless they have to work together in which case I consider them to be a part of the same team; still they can work on their own copies of the database and share it if necessary)
If they work on the changes that do not influence each other they could work on the main server. Or on their own local copies of the database.
So, developing on the local copy has all the benefits with no risk in a general case (when you support multiple versions of the system and maintain upgrade scripts anyway).
Still it is great if you can share test cases so ability to dump/restore the database easily and quickly is a big plus.
EDIT:
All of the above assume that having a copy on the local machine of the whole system for testing purposes is feasible (size, performance, licenses, etc).
I would opt for solution #1 : One common database for all the developers.
Pros
Less expensive for the infrastructure;
Only one dump is required when it's time to refresh the development database;
Everyone develops with the same data, so it closely represents the production environment;
Cons
If one developer performs a bad operation, this could impact a larger amount of developers.
As for solution #2 : One independant database for each of the developers;
Pros
This could be useful for new features developments, when development requires isolation;
Cons
More expensive for the company (infrastructure, licences...);
Multiplication of problems caused by eager isolation development environment (works in devloper's environement, not integrated);
Multiplication of dumps by the DBAs of the same copy from the production environment.
Considering the above, I would recommend, depending on your company size:
One database for development;
One database for testing the integration;
One database for acceptance tests;
One for new feature development that will perhaps require integration tests.
If your company doesn't require integration tests, then go with acceptance tests, this step is crucial before going to production.
One per developer plus a continuous integration and build server to run unit and integration tests. That gives you the best of both worlds.
Having all developers modify a single dev database quickly becomes less productive once the amount of database change reaches a certain level because it forces a developer to deploy changes to the shared database before he is ready to check-in, which means other parts of the code line may break unnecessarily.
Simple answer:
Have one development database, and if the developers want their own, they can just run their own instance on their own machines. Just be sure to test/publish on the shared.
We do both:
We use code generation where I'm at and our database is generated as well. So we have an instance on each developer's box where the database is generated. Then we use the scripts that are generated to apply the changes to a central test database. If that goes well we apply the changes to the production database during a release.
What's nice with this approach is that when our "source of truth" is checked in to source control, all the database changes are automatically distributed to the other developers when they rebase and regenerate. It works well for us.
The best way is single database on Test/QA server and one database (probably on developer's local computer) for each developer (so, 10 developers work with 10 + 1 databases).
The same approach as for general development: each developer has own copy of source code on local machine.
Also, multiple-database approach simplifies the keeping database schema in version control systems. We are keeping database creation scripts in SVN.
We are using the approach, described here:
http://www.sqlaccessories.com/Howto/Version_Control.aspx
You might also want to look at Refactoring Databases. Aside from discussing database changes, he includes discussions on going from development to production in a way that reduces risk.
Why on earth would you want a separate database for all developers?
Have one common database for all, that way the table structure is consistent and the sql statements are as well.
The biggest problems with developers having their own databases are:
First it is unlikely to be the size
of the real production database (if
you take all the databases we need to
work with here, they would take up
several hundred gigabytes of space, I
don't have that available on my
machine), this causes bad code to be
written that will never work on a
large database for performance
reasons. SQL code should never be written against a data set significantly smaller than the one on prod.
Second, developers who use their own
database create problems when they
spend a long time developing
something and then find out only
after they merge with a real datbase
that it affects something else. You
find this stuff much faster when you
share the environment. So there is
inthe end less wasted development
time.
Third developers working on related
things need to know about the changes
you are making, it will affect their
change.
When you know you are going to affect others, I think you tend to be more careful what you do which isa plus in my book.
Now the shared database server should have what we call a scratch database, a place where people can create and test table changes, so if they are doing something that might need to drop and recreate a table (which should be a rare case!), they can test the process first by copying the table to the scratch database and running their process there and then changin to the real database when they are sure it works. Or we often copy a backup table to the scratch database before testing a particular change, so we can easily recreate the old data if it goes bad.
I see no advantages at all to using individual databases.

Resources