Is it possible to use my data store in Spring Cloud Dataflow (for example, Apache Ignite or another InMemory store) for Spring Cloud Stream? - in-memory

I saw in the tests spring cloud dataflow used to store the SpringDefinition - HashMap, is it possible to override the configuration of DateFlowServerConfiguration for storing streams and Tasks in an InMemory, for example in the same HashMap, if so, how?

I don't think it would be a trivial change. The server needs a backend to store it's metadata. By default it actually uses H2 in memory, and it relies on Spring Data JPA abstraction to give users the chance to select their RDBMS.
Storing on a different storage engine, would require not only replacing all the *Repository definitions on several configuration modules, but we do as well some pre population of data. It would become a bit hard to maintain this over time.
Is there a reason why a traditional RDBMS is not suitable here? or if you want in-memory just go with the ephemeral approach of H2?

Related

Scalable delete-all from Google Cloud Datastore

I'm trying to implement a complete backup/restore function for my Google appengine/datastore solution. I'm using the recommended https://cloud.google.com/datastore/docs/export-import-entities for periodic backup and for restore.
One thing I cannot wrap my head around on how to do is how to restore to an empty datastore? The import function won't clear the datastore before importing so I have to implement a total wipe of the datastore myself. (And, a way to clear the datastore might be a good thing also for test purposes etc.)
The datastore admin is not an option since it's being phased out.
The recommended way, according to the google documentation, is to use the bulk delete: https://cloud.google.com/dataflow/docs/templates/provided-templates#cloud-datastore-bulk-delete.
The problem with this method is that I will have to launch 1 dataflow job for each namespace/kind combination. And I have a multi-tenant solution with one namespace per tenant and around 20 kinds per namespace. Thus, if I have e.g. 100 tenants, that would give 2000 dataflow jobs to wipe the datastore. But the default quota is 25 simultaneous jobs... Yes, I can contact Google to get a higher quota, but the difference in numbers suggests that I'm doing it wrong.
So, any suggestions on how to wipe my entire datastore? I'm hoping for a scalable solution (that won't exceed request timeout limits etc) where I don't have to write hundreds of lines of code...
One possibility is to create a simple 1st generation python 2.7 GAE application (or just a service) in that project and use the ndb library (typically more efficient than the generic datastore APIs) to implement an on-demand selective/total datastore wiping as desired, along the lines described in How to delete all the entries from google datastore?
This solution deletes all entries in all namespaces.
By using ndb.metadata, no model classes are needed.
And by using ndb.delete_multi_async it will be able to handle a reasonably large datastore before hitting a request time limit.
from google.appengine.api import namespace_manager
from google.appengine.ext import ndb
...
def clearDb():
for namespace in ndb.metadata.get_namespaces():
namespace_manager.set_namespace(namespace)
for kind in ndb.metadata.get_kinds():
keys = [k for k in ndb.Query(kind=kind).iter(keys_only=True)]
ndb.delete_multi_async(keys)
The solution is a combination of the answers:
GAE, delete NDB namespace
https://stackoverflow.com/a/46802370/10612548
Refer to the latter for tips on how to improve it as time limits are hit and how to avoid instance explosion.

Multiple micro-services backed by the same DynamoDB table?

Is it an anti pattern if multiple micro-services read/write to/from the same DynamoDB table?
Please note that:
We have a fixed schema which will not change any soon.
All read/write will go though REST services though API gateway + Lambda
micro-services are deployed on Kubernetes or Lambda
When you writing microservices it is advised to not share databases. That is there because it provides for easy scaling and provides each to services to have their own say when it comes to their set of data. It also enables one of these services to take a call on how data is to be kept and can change it at their will. This gives services flexibility.
Even if your schema is not changing you can be sure of the fact that one service when throttled will not impact the others.
If all your crud calls are channeled through a rest service you do have a service layer in front of a database and yes you can do that under microservices guidelines
I will still not call the approach an anti pattern. Its just that the collective experience says that its better not to talk to one databases for some of the reasons I mentioned above.
I have a different opinion here and I'll call this approach as anti-pattern. Few of the key principles which are getting violated here:
Each MS should have it's own bounded context, here if all are sharing the same data set then their business boundaries are blurred.
DB is single point of failure if DB goes down, all your services goes down.
If you try to scale single service and spawn multiple instances, it will impact DB performance and eventually, will effect the performance of other microservices.
Solution,..IMO,
First analyze if you have a case for MSA, if your data is very tightly coupled and dependent on each other then you don't need this architecture.
Explore CQRS pattern, you might like to have different DB for read and write and synchronize them via event pattern.
Hope this helps!!

Should I use JanusGraph as main database to store all my data for a new project?

I'm thinking about learn JanusGraph to use in my new big project but i can't understand some things.
Janus can be used like any database and supports "insert", "update", "delete" operations so JanusGraph will write data into Cassandra or other database to store these data, right?
Where JanusGraph store the Nodes, Edges, Attributes etc, it will write these into database, right?
These data should be loaded in memory by Janus or will be read from Cassandra all the time?
The data that JanusGraph read, must be load in JanusGraph in every query or it will do selects in database to retrieve the data I need?
The data retrieved in database is only what I need or Janus will read all records in database all the time?
Should I use JanusGraph in my project in production or should I wait until it becomes production ready?
I'm developing some kind of social network that need to store friendship, posts, comments, user blocks and do some elasticsearch too, in this case, what database backend should I use?
Janus will write data into Cassandra or other database to store these data, right?
Where Janus store the Nodes, Edges, Attributes etc, it will write these into database, right?
Janus Graph will write the data into whatever storage backend you configure it to use. This includes Cassandra. It writes this data into the underlaying database using the data model roughly outlined here
These data should be loaded in memory by Janus or will be read from Cassandra all the time?
The data retrieved in database is only what I need or Janus will read all records in database all the time?
Janus Graph will only load into memory vertices and edges which you touch during a query/traversal. So if you do something like:
graph.traversal().V().hasLabel("My Amazing Label");
Janus will read and load into memory only the vertices with that label. So you don't need to worry about initializing a graph connection and then waiting for the entire graph to be serialised into memory before you can query. Janus is a lazy reader.
Should I use Janus in my project in production or should I wait until it becomes production ready?
That is entirely up to you and your use case. Janus is being used in production already as can be seen here at the bottom of the page. Janus was forked from and improved on TitanDB which is also used in several production use cases. So if you wondering "is it ready" then I would say yes, it's clearly ready given it's existing uses.
what database backend should I use?
Again, that's entirely up to you. I use Cassandra because it can scale horizontally and I find it easier to work with. It also seems to suit all different sizes of data.
I have toyed with Google Big Table and that seems very powerful as well. However, it's only really suited for VERY big data and it's also only on the cloud where as Cassandra can be hosted locally very easily.
I have not used Janus with HBase or BerkeleyDB so I can't comment there.
It's very simple to change between backends though (all you need to do is adjust some configs and check your dependencies are in place) so during your development feel free to play around with the backends. You only really need to commit to a backend when you go production or are more sure of each backend.
When considering what storage backend to use for a new project it's important to consider what tradeoffs you'd like to make. In my personal projects, I've enjoyed using NoSQL graph databases due to the following advantages over relational dbs
Not needing to migrate schemas increases productivity when rapidly iterating on a new project
Traversing a heavily normalized data-model is not as expensive as with JOINs in an RDBMS
Most include in-memory configurations which are great for experimenting & testing.
Support for multi-machine clusters and Partition Tolerance.
Here are sample JanusGraph and Neo4j backends written in Kotlin:
https://github.com/pm-dev/janusgraph-exploration
https://github.com/pm-dev/neo4j-exploration
The main advantage with JanusGraph is the flexibility of pluging-in whichever storage backend you'd like.

some generic questions about neo4j

I'm new to non-php web applications and to nosql databases. I was looking for a smart solution matching my application requirements and I was very surprised when I knew that there exist graph based db. Well I found neo4j very nice and very suitable for my application, but as I've already wrote I'm new to this and I have some limitations in understending how it works. I hope you guys could help me to learn.
If I embed neo4j in a servlet program then the database access I create is shared among the different threads of that servet right? so I need to put database creation in init() method and the shutdown in the destroy() right? And it will be thread safe.(every dot is a "right?") But what if I want to create a database shared among the whole application?
I heard that graph databases in general relies on a relational low level. Is that true for neo4j? But if it is then I see an high level interface to the real persistence layer, so what a Connection is in this case? Are there some techniques like connection pooling or these low level things are all managed by neo4j?
In my application I need to join some objects to users and many other classification stuff. any of these object has an unique id (a String). then If some one asks to view some stuff about object having id=QW then I need to load the vertex associate to object.QW. Is this an easy operation for graph datbases?
If I need to manage authentications, so as I receive the couple (usr,pwd) and I need to check whether exists this couple in my graph. Is the same problem as before or there exist some good variation for managing authentications?
thanks
If you're coming from PHP world in most cases you're better of running Neo4j in server mode and access it either via REST directly or use a client driver like https://github.com/jadell/neo4jphp. If you still want to embed Neo4j in a servlet environment, the GraphDatabaseService is a shared component, maybe stored within the ServletContext. On a per request (and therefore per-thread) basis you start and commit transactions.
Neo4j is a native graph database. The bare metal persistence layer is optimized for navigating from one node to its neighbors as fast as possible and written by the Neo4j devteam themselves. There are other graph databases out there reusing other persistence technologies for their underlying persistence.
Best thing is to run the Neo4j online course at http://www.neo4j.org/learn/online_course.
see SecurityRules
As the Neo4j is NoSql Graph Database,
Genration of the Unique ID you have to handle using the GUID(with 3.x autonincremented proery also supported for particular label),
as the Neo4j default genrated id is unique but can be realocated to the another object once the first assigned object is deleted,
I am .net developer in my project I used the Neo4j rest api it works well, i will sugesst you to go with that,as it is implemented using async-awit programing pattern, so long running operation you can pass to DB and utilize your web server resources in more prominent way.

How to implement locking across a server farm?

Are there well-known best practices for synchronizing tasks across a server farm? For example if I have a forum based website running on a server farm, and there are two moderators trying to do some action which requires writing to multiple tables in the database, and the requests of those moderators are being handled by different servers in the server farm, how can one implement some locking functionality to ensure that they can't take that action on the same item at the same time?
So far, I'm thinking about using a table in the database to sync, e.g. check the id of the item in the table if doesn't exsit insert it and proceed, otherwise return. Also probably a shared cache could be used for this but I'm not using this at the moment.
Any other way?
By the way, I'm using MySQL as my database back-end.
Your question implies data level concurrency control -- in that case, use the RDBMS's concurrency control mechanisms.
That will not help you if later you wish to control application level actions which do not necessarily map one to one to a data entity (e.g. table record access). The general solution there is a reverse-proxy server that understands application level semantics and serializes accordingly if necessary. (That will negatively impact availability.)
It probably wouldn't hurt to read up on CAP theorem, as well!
You may want to investigate a distributed locking service such as Zookeeper. It's a reimplementation of a Google service that provides very high speed distributed resource locking coordination for applications. I don't know how easy it would be to incorporate into a web app, though.
If all the state is in the (central) database then the database transactions should take care of that for you.
See http://en.wikipedia.org/wiki/Transaction_(database)
It may be irrelevant for you because the question is old, but it still may be useful for others so i'll post it anyway.
You can use a "SELECT FOR UPDATE" db query on a locking object, so you actually use the db for achieving the lock mechanism.
if you use ORM, you can also do that. for example, in nhibernate you can do:
session.Lock(Member, LockMode.Upgrade);
Having a table of locks is a OK way to do it is simple and works.
You could also have the code as a Service on a Single Server, more of a SOA approach.
You could also use the the TimeStamp field with Transactions, if the timestamp has changed since you last got the data you can revert the transaction. So if someone gets in first they have priority.

Resources