Silo data for users into particular physical location - azure-cosmosdb

We have a system we're designing which has to hold data for people globally, including countries with very strict data protection policies, specifically where data about its citizens must physically reside that country.
Now we could engineer a mechanism for silo-ing/querying the data where it must be pulled from a particular location but as the system will be azure based, we were hoping that cosmosDB's partitioning feature might be an option.
Based on the information available to date for partitioning, it seems like it's possible to assign a location-specific partition for some data but it's not very clear. Any search for partitioning in general goes on about high availability and low latency - good things - but not what I'm looking for.
To this end, can location-specific data be assigned in cosmosDB as part of its feature set or is this something that has to be engineered on top?

For data sovereignty, you must engineer a data access layer across multiple Cosmos DB accounts. Cosmos DB by default will replicate your data across all regions within your account (which is not what you need).
While not specifically for this scenario, you can see a description of how to build such a layer here: https://learn.microsoft.com/en-us/azure/cosmos-db/multi-region-writers

Related

Apache Ignite Data Seggregation

I have an application that creates persistent caches on a fixed region (MYAPP_REGION) with fixed cached names (MyApp.Data.Class1, MyApp.Data.Class2, ...etc.)
I am deploying 2 instances of this application for 2 different customers, but they use the same ignite clusters.
What is the correct way to discriminate the data between the instances: do I change the cache name to be by customer or a region per customer is enough?
In a rdbms scenario, we would create 2 different databases; so I am wondering how we would achieve the same thing when using ignite as storage solution.
Well, as you have mentioned, there are a variety of options. If it's only logical division and you are OK with resource sharing, just like with a regular RDBM, then use multiple caches/tables or different SQL schemas. Keep in mind the desired data distribution and the amount of caches/tables per customer. I.e. if you have 3 nodes and 3 customers with about the same amount of data, most likely you'd like to use a custom affinity function to make them collocated on a single node, but it's a bit different question.
If you want more physical division, for example, if one of the customers needs more resources or special features like native persistence, then it's better to follow the different regions approach which might end up having separate clusters though.

Storing data on edges of GraphDB

It's being proposed that we store a data about a relationship between two vertices on the edge between them. The idea would be that these two vertices are related and there are user level pieces of information that are looking to be stored in graph. The best example I can think of would be a Book, and a Reader, and the Reader can store cliff notes on the edges for retrieval later on.
Is this common practice? It seems to me that we should minimize the amount of data living in edges and that a vast majority of GraphDB data be derived data, rather than using it as an actual data store. Given that its in memory, what happens when it goes down? (We're using Neptune so.. there are technically backups).
Sorry if the question is a bit vague, but I'm not sure else how to ask. I've googled around looking for best practices and its all pretty generic data related to the concepts and theories of graph db.
An additional question, is it common practice to expose the gremlin API directly to users, or should there always be a GraphQL (or other) API in front of it?
Without too much additional detail it is hard to provide exact modeling advice , but in general one of the advantages of using a graph databases is that edges are first class citizens and allow for properties on edges. A common use case for this would be something like PERSON - purchases -> Product where you might have a purchase_date on the purchases edge to represent the date of the purchase, as someone might buy the same thing multiple times.
I am not sure what exactly you mean by that a vast majority of GraphDB data be derived data as you can use graphs to derive and infer data/relationships based on the connections but they do fully support storing data in them as well.
Given that its in memory, what happens when it goes down? - Amazon Neptune (and most other DBS) use a buffer cache to store some data in memory, but that data is also persisted to disk, so if the instance goes down, there is no problem with recovering it from the durable storage.
An additional question, is it common practice to expose the gremlin API directly to users, or should there always be a GraphQL (or other) API in front of it? - Just as with any database, I would not recommend exposing the Gremlin API directly to consumers, as doing so comes with a whole host of potential security risks. Generally, the underlying data store of any application should be transparent to the users. They should be interacting with an interface like REST/GraphQL that is designed to answer business related questions and not really know or care that there is a graph database backing those requests.

Which of the Azure Cosmos DB types of database should I use to log simple events from my mobile application?

I would like to set up event logging for my application. Simple information such as date (YYYYMMDD), activity and appVersion. Later I would like to query this to give me some simple information such as how many times a certain activity occurred for each month.
From what I see there are a few different types of database in Cosmos such as NoSQL and Casandra.
Which would be the most suitable to meet my simple needs?
You can use Cosmos DB SQL API for storing this data. It has rich querying capabilities and also has a great support for aggregate functions.
One thing you would need to keep in mind is your data partitioning strategy and design your container's partition key accordingly. Considering you're going to do data aggregation on a monthly basis, I would recommend creating a partition key for year and month so that data for a month (and year) stays in a single logical partition. However, please note that a logical partition can only contain 10GB data (including indexes) so you may have to rethink your partitioning strategy if you expect the data to go above 10GB.
A cheaper alternative for you would be to use Azure Table Storage however it doesn't have that rich querying capabilities and also it doesn't have aggregation capability. However with some code (running in Azure Functions), you can aggregate the data yourself.

How to implement content based authorization in ASP.net and SQL Server?

I am developing data application for governments, and I have a situation in which I need to make data shared to all users in one page but with different privileges levels that can control authorization based on locations, not just simple admins and viewers or editors roles.For Example, I have Locations Table that contains regions, cities, and districts in a hierarchal pattern, and all data will be displayed on the page will be affected by this location changes then the user who is authorized for a city can see only data related to this city and users can be authorized for multiple cities and multiple datasets within the page, If we maintained user inside a record then we need records number multiplied by authorized locations all multiplied by authorized datasets which can be infinity.So what's the best practice to store those user roles for each single data record related to specific location?
I'd have a look at row level security in the first instance. It allows you to set up security policies that allow different rows to be available to different groups/users.
Microsoft's thread for this starts here: https://learn.microsoft.com/en-us/sql/relational-databases/security/row-level-security
And there's also a good tutorial on Plural Site.
I'd post some examples but it's too broad a subject to answer here.
One point to note - there are performance limitations on row level security.
If security isn't as much of a concern then a mapping table and some clever joins is a popular way to go, and also much faster than row level security - depending on your data size.

Does Riak (open source) support some form of multi-site replication?

The site is unclear on this one, only saying
Masterless multi-site replication
Does this imply that there is some master-master or master-slave system to replicate to another site?
What are the other options to back up a single-server or multi-server Riak DB to another site?
We only provide multi-site replication in the enterprise product. It is separate feature not present in the open source code. As the description notes, it is not a master-slave system - this allows for nodes to be down at either end.
Riak is partition tolerant because it's eventually consistant (AP in CAP Theorem) however just having nodes in two data centers doesn't give you all the benefits of full replication. You may not have any copies of a particular piece of data in one data center just because you have nodes there. If a data center went down or there was a routing issue on the net, when it became available again the data would eventually become consistant but during the outage the full set of data would not be in both places.
For example, the default bucket property for r (read quorum) is n_val/2 + 1 - this means if you are configured for 3 replicas (n_val) at least 2 nodes have to respond. This would mean even if that one data center that was still up had a node with a copy of a piece of data, it wouldn't be considered a valid read because the other two nodes were in the data center that was down.
For information on backing up a Riak cluster see: http://wiki.basho.com/Backups.html
If you have specific questions, please feel free to contact us on the riak-users mailing list:
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Masterless means exactly that. There is no one master node and therefore no slave nodes in the system.
Riak partitions your data among whatever servers (Basho people call them nodes) you give it and then it replicates, by default, each node's data to 2 other nodes. In essence, if your nodes are in separate data centres, then your data is replicated to multiple sites automatically.
There are few extra details I've left out, like virtual nodes and I'm willing to expand on that, if you need it. The gist of my answer, though, is servers in multiple data centres added to the system and managed by Riak will give you multi-site replication.

Resources