How Do I Purge Data From Corda?

How Do I Purge Data From Corda? - corda

One of the business requirements I have been presented with is a potential process to purge customer related data (e.g. under GPDR).
This is a hosted solution where I have admin access on all the nodes.
Is there a way to delete states from Corda.
Can it be done without breaking potential links/references?
i.e. without "corrupting" the database or causing lots of errors when people walk the chain history etc

At the current stage, we don't support data deletion. You can manually delete the data from the database. However, if you ever need the data for future transaction, your node will throw an error.
This question was also answered here: Is Corda support state deletion scenario?

Related

DDD: persisting domain objects into two databases. How many repositories should I use?

I need to persist my domain objects into two different databases. This use case is purely write-only. I don't need to read back from the databases.
Following Domain Driven Design, I typically create a repository for each aggregate root.
I see two alternatives. I can create one single repository for my AG, and implement it so that it persists the domain object into the two databases.
The second alternative is to create two repositories, one each for each database.
From a domain driven design perspective, which alternative is correct?

My requirement is that it must persist the AR in both databases - all or nothing. So if the first one goes through and the second fails, I would need to remove the AG from the first one.
If you had a transaction manager that were to span across those two databases, you would use that manager to automatically roll back all of the transactions if one of them fails. A transaction manager like that would necessarily add overhead to your writes, as it would have to ensure that all transactions succeeded, and while doing so, maintain a lock on the tables being written to.
If you consider what the transaction manager is doing, it is effectively writing to one database and ensuring that write is successful, writing to the next, and then committing those transactions. You could implement the same type of process using a two-phase commit process. Unfortunately, this can be complicated because the process of keeping two databases in sync is inherently complex.
You would use a process manager or saga to manage the process of ensuring that the databases are consistent:
Write to the first database and leave the record in a PENDING status (not visible to user reads).
Make a request to second database to write the record in a PENDING status.
Make a request to the first database to leave the record in a VALID status (visible to user reads).
Make a request to the second database to leave the record in a VALID status.
The issue with this approach is that the process can fail at any point. In this case, you would need to account for those failures. For example,
You can have a process that comes through and finds records in PENDING status that are older than X minutes and continues pushing them through the workflow.
You can can have a process that cleans up any PENDING records after X minutes and purges them from the database.
Ideally, you are using something like a queue based workflow that allows you to fire and forget these commands and a saga or process manager to identify and react to failures.
The second alternative is to create two repositories, one each for each database.
Based on the above, hopefully you can understand why this is the correct option.

If you don't need to write why don't build some sort of commands log?
The log acts as a queue, you write the operation in it, and two processes pulls new command from it and each one update a database, if you can accept that in worst case scenario the two dbs can have different version of the data, with the guarantees that eventually they will be consistent it seems to me much easier than does transactions spanning two different dbs.
I'm not sure how much DDD is your use case, as if you don't need to read back you don't have any state to manage, so no need for entities/aggregates

Does Firebase Realtime Database guarantees FCFS order when serving requests?

This is rather just a straight forward question.
Does Firebase Realtime Database guarantees to follow 'First come first serve' rule when handling requests?
But when there is a write-request, and then instantaneously followed by a read-request, is the read-request will fetch updated data?
When there is a write-read-write sequence of requests, does for sure read-request fetch the data written by first write?
Suppose there is a write-request, which was unable to perform (due to some connection issues). As firebase can work even in offline, that change will be saved locally. Now from somewhere else another write-request was made and it completed successfully. After this, if the first device comes online, does it modify the values(since it arrived latest)? or not(since it was initiated prior to the latest changes)?

There are a lot of questions in your post, and many of them depend on how you implement the functionality. So it's not nearly as straightforward as you may think.
The best I can do is explain a bit of how the database works in the scenarios you mention. If you run into more questions from there, I recommend implementing the use-case and posting back with an MCVE for each specific question.
Writes from a single client are handled in the order in which that client makes them.
But writes from different clients are handled with a last-write-wins logic. If your use-case requires something else, include a client-side timestamp in the write and use security rules to reject writes that are older than the current state.
Firebase synchronizes state to the listeners, and not necessarily all (write) events that led to this state. So it is possible (and fairly common) for listeners to not get all state changes that happened, for example if multiple changes to the same state happened while they were offline.
A read of data on a client that this client itself has changed, will always see the state including its own changes.

About the pattern to overcome the one update per second/entity limit on google datastore

I read this document and among several very relevant topics, some of them are key to a scalability problem I am facing.
Basically the document states that it is possible to overcome the 1 per second update ratio per entity that basically me drove me to redis in a use case that would not demand me to do it.
"a (google) software engineer in the Datastore team had mentioned a technique to obtain much higher throughput than one update per second on an entity group"
"The basic idea of Job Aggregation is to use a single thread to process a batch of updates. Because there is only one thread and only one transaction open on the entity group, there are no transaction failures due to concurrent updates. You can find similar ideas in other storage products such as VoltDb and Redis."
This is very useful to me but I don't have any clue on how this works.
Just creating a service and serialising (pull queue) upserts to a specific Kind could solve the issue? How datastore could be sure that no other thread would suddenly begin to upsert?
Thanks

It is important to keep in mind that Job Aggregation is not part of Datastore. As the documentation says, you need to use a single batch of updates. You can take a look here Batch operations to know how to upsert multiple entities.
About your second question, Datastore is not the responsible to ensure that other thread begin to upsert, you must to ensure that this not happens to get a better performance.
Here Datastore best practices there are other best practices that Google recommends to get better performance.

Using Slave database as primary while Master gets updated?

I currently have a problem where the performance of my database is impacted by several million lines of updates that run (it takes more or less 3 days, so we usually run them over a weekend)
However since the site is live, search performance is impacted. A 3 second query to pull 1.3 million records and page through them takes in excess of the timeout values by default in sql server sometimes. This obviously creates a user experience no one wants (or can afford to) to have happen.
My question now. If I setup replication on the Master to a Slave on the same server; Would I be able to point the website to the Slave and avoid that performance impact? Or would it just be duplicating the same problem since the Master will push any updates through to the Slave in any case?

I don't think replication is going to help you here, it is only going to make things on the source system worse IMHO.
Is it possible you can make a static copy of the data for the users running queries while the updates are going on? For reporting solutions that don't need to be up-to-the-minute, I've done this in several cases using two schemas - one that holds the static versions of the tables for querying, and one where the work is being done; when the work is done, switch. I go into this methodology in a little more detail here: What is the best way to refresh a rollup table under load?
Perhaps another thought is to make your updates more efficient, such that they don't take 3 days? Do you only do this on long weekends?

Question is, what is nature of your site ?? Do users use it just to "Search" or it does CRUD operations ?? If it is just for "Search" and Report generation then I agree with #Aaron. You can have some database just for reporting purposes, you can even use Log Shipping to automatically update your reporting database at very brief interval.
Is it possible that user can change data at the same time while records are being updated by your update process ?? In that case, you will have to update your Primary Database using your update job, and then again update Primary database for changes made by users using Slave database.

How to implement locking across a server farm?

Are there well-known best practices for synchronizing tasks across a server farm? For example if I have a forum based website running on a server farm, and there are two moderators trying to do some action which requires writing to multiple tables in the database, and the requests of those moderators are being handled by different servers in the server farm, how can one implement some locking functionality to ensure that they can't take that action on the same item at the same time?
So far, I'm thinking about using a table in the database to sync, e.g. check the id of the item in the table if doesn't exsit insert it and proceed, otherwise return. Also probably a shared cache could be used for this but I'm not using this at the moment.
Any other way?
By the way, I'm using MySQL as my database back-end.

Your question implies data level concurrency control -- in that case, use the RDBMS's concurrency control mechanisms.
That will not help you if later you wish to control application level actions which do not necessarily map one to one to a data entity (e.g. table record access). The general solution there is a reverse-proxy server that understands application level semantics and serializes accordingly if necessary. (That will negatively impact availability.)
It probably wouldn't hurt to read up on CAP theorem, as well!

You may want to investigate a distributed locking service such as Zookeeper. It's a reimplementation of a Google service that provides very high speed distributed resource locking coordination for applications. I don't know how easy it would be to incorporate into a web app, though.

If all the state is in the (central) database then the database transactions should take care of that for you.
See http://en.wikipedia.org/wiki/Transaction_(database)

It may be irrelevant for you because the question is old, but it still may be useful for others so i'll post it anyway.
You can use a "SELECT FOR UPDATE" db query on a locking object, so you actually use the db for achieving the lock mechanism.
if you use ORM, you can also do that. for example, in nhibernate you can do:
session.Lock(Member, LockMode.Upgrade);

Having a table of locks is a OK way to do it is simple and works.
You could also have the code as a Service on a Single Server, more of a SOA approach.
You could also use the the TimeStamp field with Transactions, if the timestamp has changed since you last got the data you can revert the transaction. So if someone gets in first they have priority.