In corda open documentation I read the following:
The ORM mapping is specified using the Java Persistence API (JPA) as annotations and is converted to database table rows by the node automatically every time a state is recorded in the node’s local vault as part of a transaction.
Presently the node includes an instance of the H2 database but any database that supports JDBC is a candidate and the node will in the future support a range of database implementations via their JDBC drivers. Much of the node internal state is also persisted there.
Can I replace h2 DB with an SQL one using JDBC?
As I understood, the FinalityFlow is used to record the transaction in the local Vault using h2 DB.
If I implement a custom Flow to record in an SQL DB, i have to avoid the FinalityFlow call?
Yes, it is possible to run a node with a SQL database other than H2. In fact, support for PostgreSQL and SQLServer has been contributed by the open-source community. See the set-up instructions here. However, be aware that the Corda continuous integration pipeline does not run unit tests or integration tests of these databases, so they must be used at your own risk.
Note that in both cases, you configure the node to use the alternative database via the configuration file, and it stores all its data in this alternative database (transactions, states, identities, etc.). You are not expected to access the database directly in a flow to do this, and can rely upon the standard ServiceHub operations and standard flows like FinalityFlow.
Related
I am reading the documentation about Corda Persistence https://docs.corda.net/api-persistence.html and I have several points that are not clear to me.
Am I right that data are persisted in parallel with vault storing. I.e. vault storage is not changed and new tables are being added to store data also.
When we use
cordaRPCClient.vaultQueryBy method will it understand by itself what to use: vault or data persisted in the custom database tables?
How the choice is done when for example only part of the data are available in the tables? is there any way to tell Corda explicitly that persisted data should be used for the query?
Here are the answers to your queries:
Yes, you are correct, new tables are created in the vault corresponding to your QueryableState. All states that are required to be persisted should implement the QueryableState interface.
Your states are stored as the normal binary format as well, thus cordaRPCClient.vaultQueryBy would always query the vault for the ContractState, not the PersistentState. You could, however, query the custom database table using a jdbc session/ jpa.
What part of the state is needed to be persisted is a call you make depending on your requirement. Persisted data could be queried using custom JDBC queries/ JPA. The vaultQuery API always works in ContractState.
We have two states stored in Corda Vault (policy and event). Policy can have many events associated with it.
We are attempting to get a joined result (as if we run SQL with JOIN statement) via RPC client and we can't find a graceful way: either we should make several VaultQueries or just use direct JDBC connection to the underlying database and extract the required data. Neither of ways looks appealing and we wonder if there is a good way to extract the data.
As we cannot use JPA/Hibernate annotations to link objects inside the CordApp, we have just policy_id stored in event state.
For more complex queries, it is fine and even expected that the user will query the node's database directly using the JDBC connection.
I'm thinking about learn JanusGraph to use in my new big project but i can't understand some things.
Janus can be used like any database and supports "insert", "update", "delete" operations so JanusGraph will write data into Cassandra or other database to store these data, right?
Where JanusGraph store the Nodes, Edges, Attributes etc, it will write these into database, right?
These data should be loaded in memory by Janus or will be read from Cassandra all the time?
The data that JanusGraph read, must be load in JanusGraph in every query or it will do selects in database to retrieve the data I need?
The data retrieved in database is only what I need or Janus will read all records in database all the time?
Should I use JanusGraph in my project in production or should I wait until it becomes production ready?
I'm developing some kind of social network that need to store friendship, posts, comments, user blocks and do some elasticsearch too, in this case, what database backend should I use?
Janus will write data into Cassandra or other database to store these data, right?
Where Janus store the Nodes, Edges, Attributes etc, it will write these into database, right?
Janus Graph will write the data into whatever storage backend you configure it to use. This includes Cassandra. It writes this data into the underlaying database using the data model roughly outlined here
These data should be loaded in memory by Janus or will be read from Cassandra all the time?
The data retrieved in database is only what I need or Janus will read all records in database all the time?
Janus Graph will only load into memory vertices and edges which you touch during a query/traversal. So if you do something like:
graph.traversal().V().hasLabel("My Amazing Label");
Janus will read and load into memory only the vertices with that label. So you don't need to worry about initializing a graph connection and then waiting for the entire graph to be serialised into memory before you can query. Janus is a lazy reader.
Should I use Janus in my project in production or should I wait until it becomes production ready?
That is entirely up to you and your use case. Janus is being used in production already as can be seen here at the bottom of the page. Janus was forked from and improved on TitanDB which is also used in several production use cases. So if you wondering "is it ready" then I would say yes, it's clearly ready given it's existing uses.
what database backend should I use?
Again, that's entirely up to you. I use Cassandra because it can scale horizontally and I find it easier to work with. It also seems to suit all different sizes of data.
I have toyed with Google Big Table and that seems very powerful as well. However, it's only really suited for VERY big data and it's also only on the cloud where as Cassandra can be hosted locally very easily.
I have not used Janus with HBase or BerkeleyDB so I can't comment there.
It's very simple to change between backends though (all you need to do is adjust some configs and check your dependencies are in place) so during your development feel free to play around with the backends. You only really need to commit to a backend when you go production or are more sure of each backend.
When considering what storage backend to use for a new project it's important to consider what tradeoffs you'd like to make. In my personal projects, I've enjoyed using NoSQL graph databases due to the following advantages over relational dbs
Not needing to migrate schemas increases productivity when rapidly iterating on a new project
Traversing a heavily normalized data-model is not as expensive as with JOINs in an RDBMS
Most include in-memory configurations which are great for experimenting & testing.
Support for multi-machine clusters and Partition Tolerance.
Here are sample JanusGraph and Neo4j backends written in Kotlin:
https://github.com/pm-dev/janusgraph-exploration
https://github.com/pm-dev/neo4j-exploration
The main advantage with JanusGraph is the flexibility of pluging-in whichever storage backend you'd like.
I saw in the tests spring cloud dataflow used to store the SpringDefinition - HashMap, is it possible to override the configuration of DateFlowServerConfiguration for storing streams and Tasks in an InMemory, for example in the same HashMap, if so, how?
I don't think it would be a trivial change. The server needs a backend to store it's metadata. By default it actually uses H2 in memory, and it relies on Spring Data JPA abstraction to give users the chance to select their RDBMS.
Storing on a different storage engine, would require not only replacing all the *Repository definitions on several configuration modules, but we do as well some pre population of data. It would become a bit hard to maintain this over time.
Is there a reason why a traditional RDBMS is not suitable here? or if you want in-memory just go with the ephemeral approach of H2?
I have a multi-threaded Linux C++ application that needs a high performance reference data lookup facility. I have been looking at using an in-memory SQLite database for this but can't see a way to get this to scale in my multi-threaded environment.
The default threading mode (serialized) seems to suffer from a single coarse grained lock even when all transactions are read only. Moreover, I don't believe I can use multi-thread mode because I can't create multiple connections to a single in-memory database (because every call to sqlite3_open(":memory:", &db) creates a separate in-memory database).
So what I want to know is: is there something I've missed in the documentation and it is possible to have multiple threads share access to the same in-memory database from my C++ application.
Alternatively, is there some alternative to SQLite that I could be considering ?
Yes!
see the following extracted from the documentation at:
http://www.sqlite.org/inmemorydb.html
But its not a direct connection to DB memory, instead to the shared cache.Its a workaround. see the picture.
In-memory Databases And Shared Cache
In-memory databases are allowed to use shared cache if they are opened using a URI filename. If the unadorned ":memory:" name is used to specify the in-memory database, then that database always has a private cache and is this only visible to the database connection that originally opened it. However, the same in-memory database can be opened by two or more database connections as follows:
rc = sqlite3_open("file::memory:?cache=shared", &db);
Or,
ATTACH DATABASE 'file::memory:?cache=shared' AS aux1;
This allows separate database connections to share the same in-memory database. Of course, all database connections sharing the in-memory database need to be in the same process. The database is automatically deleted and memory is reclaimed when the last connection to the database closes.
If two or more distinct but shareable in-memory databases are needed in a single process, then the mode=memory query parameter can be used with a URI filename to create a named in-memory database:
rc = sqlite3_open("file:memdb1?mode=memory&cache=shared", &db);
Or,
ATTACH DATABASE 'file:memdb1?mode=memory&cache=shared' AS aux1;
When an in-memory database is named in this way, it will only share its cache with another connection that uses exactly the same name.
No, with SQLite you cannot access the same in-memory database from different threads. That's by design. More info at SQLite documentation.