Rails how to have models hit a different database dynamically - ruby-on-rails-7

Looking to see if it's possible to have a Rails app hit multiple dbs dynamically. To be more precise:
I have an app that can operate in different regions.
Each request that comes in will identify the region.
In mysql, one region corresponds to exactly one db.
The dbs are identical in terms of the schema. Implying the AR models are all the same, it's just that depending on the request, I want the model object to be retrieved/updated from one of the per region dbs.
All of the data is isolated to that particular db. There is never any crossover, nor any need to query multiple dbs at the same time.
One way to avoid multiple db's is to add a "region" column to all the models/tables (don't really like that).
Another way to do this would simply be to fire up different instances for different regions. Again, don't really want to do that given all the config overhead (cloud servers, nginx, etc, etc).
Any ideas?

I found that Rails 6.1 introduced the notion of horizontal sharding. That was what I needed. And I found this article useful:
https://www.freshworks.com/horizontal-sharding-in-a-multi-tenant-app-with-rails-61-blog/

Related

Apache Ignite Data Seggregation

I have an application that creates persistent caches on a fixed region (MYAPP_REGION) with fixed cached names (MyApp.Data.Class1, MyApp.Data.Class2, ...etc.)
I am deploying 2 instances of this application for 2 different customers, but they use the same ignite clusters.
What is the correct way to discriminate the data between the instances: do I change the cache name to be by customer or a region per customer is enough?
In a rdbms scenario, we would create 2 different databases; so I am wondering how we would achieve the same thing when using ignite as storage solution.
Well, as you have mentioned, there are a variety of options. If it's only logical division and you are OK with resource sharing, just like with a regular RDBM, then use multiple caches/tables or different SQL schemas. Keep in mind the desired data distribution and the amount of caches/tables per customer. I.e. if you have 3 nodes and 3 customers with about the same amount of data, most likely you'd like to use a custom affinity function to make them collocated on a single node, but it's a bit different question.
If you want more physical division, for example, if one of the customers needs more resources or special features like native persistence, then it's better to follow the different regions approach which might end up having separate clusters though.

At what point do you need more than one table in dynamodb?

I am working on an asset tracking system that also manages the concept of "projects". The users of this application perform maintenance activities on their customer's assets, so they need an action log where actions on an asset start life as a task in a project. For example, "Fix broken frame" might be a task where an action would have something like "Used parts a, b, and c to fix the frame" with a completed time and the employee who performed the action.
The conceptual data model for the application starts with a Customer that has multiple locations and each location has multiple assets. Each asset should have an associated action log so it is easy to view previous actions applied to that asset.
To me, that should all go in one table based upon the logical ownership of that data. Customer owns Locations which own Assets which own Actions.
I believe I should have a second table for projects as this data is tangential to the Customer/Location/Asset data. However, because I read so much about how it should all be one table, I'm not sure if this delineation only exists because I've modeled the data incorrectly because I can't get over the 3NF modeling that I've used for my entire career.
Single table design doesn't forbid you to create multiple tables. Instead in encourages to use only a single table per micro-services (meaning, store correlated data, which you want to access together, in the same table).
Let's look at some anecdotes from experts:
Rick Houlihan tweeted over a year ago
Using a table per entity in DynamoDB is like deploying a new server for each table in RDBMS. Nobody does that. As soon as you segregate items across tables you can no longer group them on a GSI. Instead you must query each table to get related items. This is slow and expensive.
Alex DeBrie responded to a tweet last August
Think of it as one table per service, not across your whole architecture. Each service should own its own table, just like with other databases. The key around single table is more about not requiring a table per entity like in an RDBMS.
Based on this, you should answer to yourself ...
How related is the data?
If you'd build using a relational database, would you store it in separate databases?
Are those actually 2 separate micro services, or is it part of the same micro service?
...
Based on the answers to those (and similar) questions you can argue to either keep it in one table, or to split it across 2 tables.

How to handle large quantities of inter-related data

Currently I have a DB structure in MySQL with a few dozen tables with various foreign key linkages among them. All of the data is in files that I'm going to load in, so I'm hoping I can port the design over to a storage system that works with Drupal 7, since I can simply setup something (using the Feeds module?) to get my data in a way that Drupal 7 likes. The ultimate purpose of this is for lots of manual human revision, linking together entries in tables with relations and possibly revising some field data that looks wrong. So the whole goal is to make the human interface for viewing and editing (particularly adding) relations in Drupal 7. The question is, what is the proper way for the data to be stored so I have to write as little module code as needed?
It seems to me that I would choose one of 3 modules to accomplish this task:
Relation
Entity Reference
Data
Relation and Entity Reference would allow me to store all of my data in nodes (entities?) in Drupal 7, so Drupal would have "native support" for handling all of the stuff. However, I expect there to be 100s of millions to low billions of nodes with perhaps up to ~3 relations to other nodes in each. How efficiently does Relation or Entity Reference handle this when referencing foreign data (and perhaps grabbing referenced data from that reference, and so on) with Views and the like? Can they support a node having a null reference, since many will be null until a user can set them (so I'd need a way to also have a view to find nodes with particular null references)?
Data is another possibility, but it's in alpha and I wonder about its stability and efficiency. It also seems to me that having all of my data stored in an external MySQL database instead of in Drupal nodes defeats the entire purpose of using Drupal in the first place. Is my feeling on this correct?
I'm having a difficult time nailing down what I would need to manage my content, which seems odd considering Drupal 7 is a CMS. I have to be missing something here, but I'm not sure what it is. What is the most mature module(s?) for handling/interfacing with this large quantity of inter-related data and being able to go through and have a user mostly setting up and managing the links (so "foreign keys") between "tables", along with perhaps field data review and revision? Are there any that would suffice?
Relation and Entity Reference would allow me to store all of my data
in nodes (entities?) in Drupal 7, so Drupal would have "native
support" for handling all of the stuff. However, I expect there to be
100s of millions to low billions of nodes with perhaps up to ~3
relations to other nodes in each. How efficiently does Relation or
Entity Reference handle this when referencing foreign data (and
perhaps grabbing referenced data from that reference, and so on) with
Views and the like?
If your need the Data to display, another model will not help you. You might need some Pagination or give your query an maximum depth of recursion to load only the entities you are interested in. I guess using something that is already supported will save you a lot of work.
Without benchmarks you can barely find your bottlenecks. So go the easiest way which seems to be the Entity References and optimize as needed. You can create some test data to find limitations early. But there will be ways to optimize the requests later for sure.

Storing multiple graphs in Neo4J

I have an application that stores relationship information in a MySQL table (contact_id, other_contact_id, strength, recorded_at). This is fine if all I need to do is show who a contact's relationships are or even to generate a list of mutual contacts for two contacts.
But now I need to generate stats like: 'what was the total number of 2-way connections of strength 3 or better in January 2011' or (assuming that each contact is part of a group) 'which group has the most number of connections to other groups' etc.
I quickly found that the SQL for generating these stats became unwieldy real fast.
So I wrote a script that for any given date it will generate a graph in memory. I could then run whatever stat I wanted against that graph. Much easier to understand and in general, much more performant also -- except for the generating the graph part.
My next thought was to cache those graphs so I could call on them whenever I needed to run a new stat (or generate a later graph: eg for today's graph I take yesterday's graph and apply any changes that happened since yesterday). I tried memcached which worked great until the graphs grew > 1 MB.
So now I'm thinking about using a graph database like Neo4J.
Only problem is, I don't have just one graph. Or I do, but it is one that changes over time and I need to be able to query it with different reference times.
So, can I:
store multiple graphs in Neo4J and rertrieve/interact with them separately? i would then create and store separate social graphs for each date.
or
add valid to and from timestamps to each edge and filter the graph appropriately: so if i wanted a graph for "May 1st" i would only follow the newest edge between two noeds that was created before "May 1st" (and if all the edges were created after May 1st then those nodes wouldn't be connected).
I'm pretty new to graph databases so any help/pointers/hints will be appreciated.
Right now you can store just one graph database in a single Neo4j instance, but this one graphdb can contain as many different sub-graphs as you like. You only have to keep that in mind when doing global operations (like index queries) but there you can do compound queries that include timestamped properties as well to limit the results.
One way of doing that is, as you said adding temporal information to edges to represent the structure of a graph for a given date you can then traverse the structure of the graph back then.
Reference node has a different meaning in Neo4j.
Using category nodes per day (and linking them and also aggregating them for higher level timespans) is the more graphy way of categorizing nodes than indexed properties. (Effectively these are in-graph indices that you can easily include in your traversals and graph queries).
You don't have to duplicate the nodes as long as you are only interested in different temporal structures. If your nodes are also different (e.g. changing properties, you could either duplicate them, and so effectively creating different subgraphs) or create a connected list of history nodes on each node that contain just the changes (or the full snapshot depending on your requirements).
Your domain sounds very fitting for the graph database. If you have more and detailed questions feel free to join the Neo4j mailing list.
Not the easiest solution (I'm assuming you only work with one machine), but if you really want to separate your graphs, you only need to remember that a graph is a directory.
You can then create a dynamic loader class which takes the path of the database you want, load it in memory for the query, and close it after you getting your answer. You could also configure a proxy server, and send 2 parameters to your loader: your query (which I presume is a cypher query in this case) and the path of the database you want to query.
This is not adequate if you have tons of real-time queries to answer. But if it is simply for storing and doing some analytics over data sets, it can definitly answer your needs.
This is an old question, but starting with Neo4j 4.x, multi-tenancy is supported and you can have different databases within the same Neo4j server (with distinct RBAC permissions).

Multiple Publishers, Single Subscriber (Data Warehouse)

I'm trying to figure out the best way to populated the staging database in a
data warehouse. I will have a number of databases (identical schema, SQL
Server 2005 Standard). Ideally I'd set up each as a publisher, with the
same publication. There will be a single subscriber database (SQL Server
2005 Enterprise) that will subscribe to each of the publisher databases.
Data in the publisher databases will be modified. The subscriber database
will only be updated by its subscriptions, and therefore does not need to
send changes back to any of the publishers. Publisher databases don't need
to update each other. Replication will be occurring over the internet
(although VPN could be used).
I'm
not clear on what kind of replication I should be using for this.
Can I do it with replication? what about Incremental fields?
Replication can definitely handle this. You don't have to do anything aside from the bog-standard setup unless there's any overlap between the different publishers' tables. That is, if you have pub_a and pub_b as publishers that both have a table tbl_a, then you either have to publish them to different tables at the subscriber (the destination table is defined in your call to sp_addarticle) or you have to guarantee that the data between disparate publishers will never collide. In the latter case, you also need to be careful about what you supply for the #pre_creation_cmd parameter in your call to sp_addarticle. The default is to drop the table at the subscriber which means that the last publisher added to the mix would win and the rest would be broken. You'll need to specify 'drop' for the first added publisher and 'none' for the rest. Good luck!
I believe that this would be possible, but you'd set it up the opposite way around than you've specified. You'd set the central database as the publisher, and you'd use Merge Replication.
Merge Replication includes an option to allow dynamic filters - so what you'd want to do is set the filters up so that each subscriber only receives the rows that it originated - probably by adding a column to some of your tables to include the HOST_NAME() of the server where the row originated. You shouldn't need to do this to every table, because once you've filtered one table, you can have cascading filters that filter out rows from additional tables using joins.
As to "incremental fields" - I assume you're talking here about IDENTITY columns? Luckily, these have also been thought about - basically, the publisher manages the IDENTITY range, and hands out smaller ranges (of 1000 values, by default) to each subscriber.
Caveat - these are the general principles, but I haven't tried this kind of setup myself before. I'd recommend that you try it in a "toy" database first, and try to get it working.

Resources