I'm trying to figure out the best way to populated the staging database in a
data warehouse. I will have a number of databases (identical schema, SQL
Server 2005 Standard). Ideally I'd set up each as a publisher, with the
same publication. There will be a single subscriber database (SQL Server
2005 Enterprise) that will subscribe to each of the publisher databases.
Data in the publisher databases will be modified. The subscriber database
will only be updated by its subscriptions, and therefore does not need to
send changes back to any of the publishers. Publisher databases don't need
to update each other. Replication will be occurring over the internet
(although VPN could be used).
I'm
not clear on what kind of replication I should be using for this.
Can I do it with replication? what about Incremental fields?
Replication can definitely handle this. You don't have to do anything aside from the bog-standard setup unless there's any overlap between the different publishers' tables. That is, if you have pub_a and pub_b as publishers that both have a table tbl_a, then you either have to publish them to different tables at the subscriber (the destination table is defined in your call to sp_addarticle) or you have to guarantee that the data between disparate publishers will never collide. In the latter case, you also need to be careful about what you supply for the #pre_creation_cmd parameter in your call to sp_addarticle. The default is to drop the table at the subscriber which means that the last publisher added to the mix would win and the rest would be broken. You'll need to specify 'drop' for the first added publisher and 'none' for the rest. Good luck!
I believe that this would be possible, but you'd set it up the opposite way around than you've specified. You'd set the central database as the publisher, and you'd use Merge Replication.
Merge Replication includes an option to allow dynamic filters - so what you'd want to do is set the filters up so that each subscriber only receives the rows that it originated - probably by adding a column to some of your tables to include the HOST_NAME() of the server where the row originated. You shouldn't need to do this to every table, because once you've filtered one table, you can have cascading filters that filter out rows from additional tables using joins.
As to "incremental fields" - I assume you're talking here about IDENTITY columns? Luckily, these have also been thought about - basically, the publisher manages the IDENTITY range, and hands out smaller ranges (of 1000 values, by default) to each subscriber.
Caveat - these are the general principles, but I haven't tried this kind of setup myself before. I'd recommend that you try it in a "toy" database first, and try to get it working.
Related
Looking to see if it's possible to have a Rails app hit multiple dbs dynamically. To be more precise:
I have an app that can operate in different regions.
Each request that comes in will identify the region.
In mysql, one region corresponds to exactly one db.
The dbs are identical in terms of the schema. Implying the AR models are all the same, it's just that depending on the request, I want the model object to be retrieved/updated from one of the per region dbs.
All of the data is isolated to that particular db. There is never any crossover, nor any need to query multiple dbs at the same time.
One way to avoid multiple db's is to add a "region" column to all the models/tables (don't really like that).
Another way to do this would simply be to fire up different instances for different regions. Again, don't really want to do that given all the config overhead (cloud servers, nginx, etc, etc).
Any ideas?
I found that Rails 6.1 introduced the notion of horizontal sharding. That was what I needed. And I found this article useful:
https://www.freshworks.com/horizontal-sharding-in-a-multi-tenant-app-with-rails-61-blog/
I am working on an asset tracking system that also manages the concept of "projects". The users of this application perform maintenance activities on their customer's assets, so they need an action log where actions on an asset start life as a task in a project. For example, "Fix broken frame" might be a task where an action would have something like "Used parts a, b, and c to fix the frame" with a completed time and the employee who performed the action.
The conceptual data model for the application starts with a Customer that has multiple locations and each location has multiple assets. Each asset should have an associated action log so it is easy to view previous actions applied to that asset.
To me, that should all go in one table based upon the logical ownership of that data. Customer owns Locations which own Assets which own Actions.
I believe I should have a second table for projects as this data is tangential to the Customer/Location/Asset data. However, because I read so much about how it should all be one table, I'm not sure if this delineation only exists because I've modeled the data incorrectly because I can't get over the 3NF modeling that I've used for my entire career.
Single table design doesn't forbid you to create multiple tables. Instead in encourages to use only a single table per micro-services (meaning, store correlated data, which you want to access together, in the same table).
Let's look at some anecdotes from experts:
Rick Houlihan tweeted over a year ago
Using a table per entity in DynamoDB is like deploying a new server for each table in RDBMS. Nobody does that. As soon as you segregate items across tables you can no longer group them on a GSI. Instead you must query each table to get related items. This is slow and expensive.
Alex DeBrie responded to a tweet last August
Think of it as one table per service, not across your whole architecture. Each service should own its own table, just like with other databases. The key around single table is more about not requiring a table per entity like in an RDBMS.
Based on this, you should answer to yourself ...
How related is the data?
If you'd build using a relational database, would you store it in separate databases?
Are those actually 2 separate micro services, or is it part of the same micro service?
...
Based on the answers to those (and similar) questions you can argue to either keep it in one table, or to split it across 2 tables.
I have already implemented two pretty fast ways of paging a large MS SQL Server database table which contains at least 1,000,000 records, but have failed to determine the pros and cons of either method, advise on either would be greatly appreciated:
The first is to run the SQL query and return only the Primary Key values of the filtered records, specifying a TOP clause of maybe 100-1000. These can then be placed into a session variable on the web server and paged through accordingly by supplying a subset of Primary Key values back to the server.
One positive to this is the filtering of the records only occurs once when the user initially performs the search and secondly it gives the ability to page through the entire record set one item at a time if needs be, such as in previews of upcoming records, etc. This method also provides some benefits with regards to further filtering already filtered records, as filterable options can be determined by supplying the list of Primary Key values back to the SQL Server - such as common attributes, title, directors, etc.
The second option is to perform both the filtering and paging via SQL Server, supplying variables such as 'records per page' and 'page number', etc .
The benefit to this is that there is no need to clog up the web server with user sessions that size wise will undoubtedly be at least 1000+ bytes each, which will only cause problems in the long run as the number of web site users increases. In contrast, the down sides are ultimately what I've listed as being the positives for the first option, such as the lack of the ability to determine filtering options for the whole records set having only a single page or subset of Primary Key values to work with on the web server.
What are everyone's thoughts given the above, especially with regards to storing the Primary Key values for the results in a session variable, memory, alternate options, etc?
I'm the kind of person who thinks that Database time is more valuable than Webserver time. But that's my approach.
In your case, how do you retrieve the data? Do you use DataSet/DataTables or do you use Strong typed containers? Why not using linq or another filtering technique in the web server side or even the client Side? (You can display all the record to the user and use javascript to filter).
I am designing a simple messaging service using ASP.NET MVC / Windows Azure Table Storage. I have two kinds of entities - messages and message threads. Relation between them is simple - each thread can have multiple messages but the message can only be assigned to one thread.
Table storage is not a relational DB, so representing relations is always a bit tricky. I need to decide between 2 approaches:
Having one big table for threads and one for messages. And having threadId as a partition key of message entity so that messages are partitioned by threads.
Dynamically creating a special table for each message thread and having threadId as a name of the table.
I tend to prefer the second because it fits better into architecture of the rest of the service. But there will obviously be large number of tables created in a storage account.
Do you think this may be a problem?
You could also consider having just one table, that stores both Thread and Message entities. This would give you transaction support, and you could use Lucifure's hybrid approach on this table.
Creating a large number of tables may be an issue, depending on how you want to manage them. The underlying REST API for listing tables works like a query for table entities. It only returns the first 1000 tables, after that you have to use a continuation token. All of the storage explorers I've seen don't allow you to query tables based on name, they simply like the first 1000 tables. If you end up with 20000 threads, it could take you a while to get to the table you want.
One way you could mitigate this is to put your message table in its own storage account. This way your storage account with all of your other tables won't get crowded out by all of these dynamic tables that you will be creating and possibly deleting.
Deleting is actually one of the ways in which using a separate table for each thread would be easier. To delete all of the related messages you simply have to delete one table rather than iterating over each message and deleting it.
Everything else however will be more complicated than keeping all of the messages in one table. If this is core functionality to your app and you can dedicate enough time to develop it this way, one table per thread is probably a good idea. Otherwise the easy way to do things is with one big table.
You may consider a hybrid approach to keep the number of tables to a manageable level, depending on your scalability needs.
My experience has been that date based partitioning at the table level is a very effective approach and can be leverage across the board.
For example you could partition tables based on date and with a granularity of day or month. So a table name like “Thread201202” could be used for all threads started in February 2012.
Your thread id would implicitly include the “201202” and be something like “201202-myid01” although you would not need to explicitly store it in the partition key since it would be implied in the table name.
Aged threads could then be easily disposed by deleting tables say more than a year old.
I have worked on a timesheet application application in MVC 2 for internal use in our company. Now other small companies have showed interest in the application. I hadn't considered this use of the application, but it got me interested in what it might imply.
I believe I could make it work for several clients by modifying the database (Sql Server accessed by Entity Framework model). But I have read some people advocating multiple databases (one for each client).
Intuitively, this feels like a good idea, since I wouldn't risk having the data of various clients mixed up in the same database (which shouldn't happen of course, but what if it did...). But how would a multiple database solution be implemented specifically?
I.e. with a single database I could just have a client register and all the data needed would be added by the application the same way it is now when there's just one client (my own company).
But with a multiple database solution, how would I create a new database programmatically when a user registers? Please note that I have done all database stuff using Linq to Sql, and I am not very familiar with regular SQL programming...
I would really appreciate a clear detailed explanation of how this could be done (as well as input on whether it is a good idea or if a single database would be better for some reason).
EDIT:
I have also seen discussions about the single database alternative, suggesting that you would then add ClientId to each table... But wouldn't that be hard to maintain in the code? I would have to add "where" conditions to a lot of linq queries I assume... And I assume having a ClientId on each table would mean that each table would have need to have a many to one relationship to the Client table? Wouldn't that be a very complex database structure?
As it is right now (without the Client table) I have the following tables (1 -> * designates one to many relationship):
Customer 1 -> * Project 1 -> * Task 1 -> * TimeSegment 1 -> * Employee
Also, Customer has a one to many relationship directly with TimeSegment, for convenience to simplify some queries.
This has worked very well so far. Wouldn't it be possible to simply have a Client table (or UserCompany or whatever one might call it) with a one to many relationship with Customer table? Wouldn't the data integrity be sufficient for the other tables since the rest is handled by the relationships?
as far as whether or not to use a single database or multiple databases, it really all depends on the use cases. more databases means more management needs, potentially more diskspace needs, etc. there are alot more things to consider here than just how to create the database, such as how will you automate the backup process creation, etc. i personally would use one database with a good authentication system that would filter the data to the appropriate client.
as to creating a database, check out this blog post. it describes how to use SMO (sql management objects) in c#.net to create a database. they are a really neat tool, and you'll definitely want to familiarize yourself with them.
to deal with the follow up question, yes, a single, top level relationship between clients and customers should be enough to limit the new customers to their appropriate data.
without any real knowledge about your application i can't say how complex adding that table will be, but assuming your data layer is up to snuff, i would assume you'd really only need to limit the customers class by the current client, and then get all the rest of your data based on the customers that are available.
did that make any sense?
See my answer here, it applies to your case as well: c# database architecture