Erlang MNESIA - one-way replication to aggregate data - asynchronous

I have multiple Erlang nodes running with an MNESIA schema. Most of my tables are local. One table called "stats" stores the usage statistics for its node. Is it possible to have one-way MNESIA replication to aggregate the stats data from all the nodes into one dedicated node?
From everything I read, data replication is a full-mesh unless the data is marked local. I am trying to find a way to do many-to-one replication.
The table is a bag and the key holds the node name to avoid nodes trampling one another. The stats need not be replicated to all the other nodes otherwise that can be a flood of unnecessary data.
Thanks!

Related

DolphinDB Python API: Issues with partitionColName when writing data to dfs table with compo domain

It's my understanding that the PartitionedTableAppender method of DolphinDB Python API can implement concurrent data writes. I'm trying to write data to a dfs table with compo domain where the partitions are determined by the values of "datetime" and "symbol". Now the data I'd like to write include records of 150 symbols in one day. This is what I tried:
But it seems only one partitioning column can be specified in partitionColName. Please inform me if I do have a wrong way of writing this.
Just specify one partitioning column in this case even if it uses compo domain. Based on the given information, it is suggested to specify partitionColName to be "symbol" and then concurrent writes can be implemented. However, the script still works if you set it to be "datetime", but the data cannot be written concurrently because it only contains one day's records with which only one partition is created.
Refer to the basic operating rules when you are using PartitionedTableAppender:
DolphinDB does not allow multiple writers to write data to one
partition at the same time. Therefore, when multiple threads are
writing to the same table concurrently, it is recommended to make sure each of
them writes to a different partition. Python API provides a
convenient way by dividing data by partition automatically.
With DolphinDB server version 1.30 or above, we can write to DFS
tables with the PartitionedTableAppender object in Python API. The
user needs to first specify a connection pool. The system obtains
the partitioning information before assigning them to the connection pool
for concurrent writing. A partition can only be written to by one
connection pool at one time.
Therefore, only one partitioning column is needed for a table with compo domain. Just specify a highly-differentiated partitioning column to create numbers of partitions and assign them to multiple connection pools. Thus, the data can be written concurrently to the dfs tables.

custom partition in clickhouse

I have several questions about custom partitioning in clickhouse. Background: i am trying to build a TSDB on top of clickhouse. We need to support very large batch write and complicated OLAP read.
Let's assume we use the standard partition by month , and we have 20 nodes in our clickhouse cluster. I am wondering will the data from same month all flow to the same node or will clickhouse do some internal balance and put the data from same month to several nodes?
If all the data from same month write to the same node, then it will be very bad for our scenario. I will probably consider patition by (timestamp, tags)where tags are the different tags that define the data source. Our monitoring system will write data to TSDB every 30 seconds. Our read pattern is usually single table range scan or several tables join on a column. Any advice on how should i customize my partition strategy?
Since clickhouse does not support secondary index, and we will run selection query on columns, i think i should put those important columns into the primary key, so my primary key will probably be like (timestamp, ip, port...), any advice on this design or make give a good reason why clickhouse does not support secondary index like bitmap index on other non-primary column?
In ClickHouse, partitioning and sharding are two independent mechanisms. Partitioning by month means that data from different months will never be merged to be stored in same file on a filesystem and has nothing to do with data placement between nodes (which is controlled by choosing how exactly do you setup tables and run your INSERT INTO queries).
Partitioning by months or weeks is usually doing fine, for choosing primary key see official documentation: https://clickhouse.yandex/docs/en/operations/table_engines/mergetree/#selecting-the-primary-key
There are no fundamental issues with adding those, for example bloom filter index development is in progress: https://github.com/yandex/ClickHouse/pull/4499

How to model "dimension" tables in TiDB?

I would like to designate certain tables as replicated to all TiKV stores such that they are always available to join with locally (thereby reducing expensive distributed joins at the TiDB level). This would allow the TiKV coprocessor to join locally to this table because it's always available
(ie: replicated to every TiKV). In the OLAP terminology of "dimensions" and "facts", this is a dimension table. In this scenario, I'd like to shard facts and replicate dimensions. It appears that TiDB treats everything as a sharded fact. Can this be done? If not, can it be approximated with some other technique? How amenable is the code base to allowing this type of feature?
At present, TiDB splits each table into regions, and do the replication in the region level. It's hard to replicate a table into each TiKV server, even if it only contains one region. For example there are 100 nodes in the TiKV cluster but the configured number of region replica is 5.
We don't need to do the join operation in the TiKV coprocessor. We can read each dimension table from TiKV to multiply TiDB nodes and associate each involved TiDB node a portion of the fact table according to the data distribution of the fact table. Thus the join operation is done in the TiDB layer.
The technique described in the above is not implemented yet. But it's already on our roadmap.

Should bulk data be included in the graph?

I have been using ArangoDB for a while now for smaller system requirements and love it. We have recently been tasked by a client to analyze a large amount of financial data which is currently housed in SQL but I was hoping to more efficiently query the data in ArangoDB.
One of the more simplistic requirements is to rollup gl entry amounts to determine account totals across their general ledger. There are approximately 2200 accounts in their general ledger with a maximum depth of approximately 10. The number of gl entries is approximately 150 million and I was wondering what the most efficient method of aggregating account totals would be?
I plan on using a graph to manage the account hierarchy/structure but should edges be created for 150 million gl entries or is it more efficient to traverse the inbound relationships and run sub queries on the gl entry collections to calculate total the amounts?
I would normally just run the tests myself but I am struggling with simply loading the data in my local instance of arango and thought I would get some insight while I work at loading the data.
Thanks in advance!
What is the benefit you're looking to gain by moving the data into a graph model. If it's to build connections between accounts, customers, GL's, and such, then it might be best to go with a hybrid model.
It's possible to build a hierarchical graph style relationship between your accounts and GL's, but then store your GL entries in a flat document collection.
This way you can use AQL style graph queries to quickly determine relationships between accounts and GLs. If you need to SUM entries in a GL, then you can have queries that identify the GL._id's and then sum the flat collections that have foreign keys that reference the GL._id they are associated with.
By adding indexes on your foreign keys you will speed up queries, and by using Foxx Micro Services you can provide a layer of abstraction between a REST style query and the actual data model you are using. That way if you find you need to change your database model under the covers, by updating your Foxx MicroServices the consumer doesn't need to be aware of those changes.
I can't answer your question on performance, you'll just need to ensure your hardware is appropriately spec'ed.

Storing multiple graphs in Neo4J

I have an application that stores relationship information in a MySQL table (contact_id, other_contact_id, strength, recorded_at). This is fine if all I need to do is show who a contact's relationships are or even to generate a list of mutual contacts for two contacts.
But now I need to generate stats like: 'what was the total number of 2-way connections of strength 3 or better in January 2011' or (assuming that each contact is part of a group) 'which group has the most number of connections to other groups' etc.
I quickly found that the SQL for generating these stats became unwieldy real fast.
So I wrote a script that for any given date it will generate a graph in memory. I could then run whatever stat I wanted against that graph. Much easier to understand and in general, much more performant also -- except for the generating the graph part.
My next thought was to cache those graphs so I could call on them whenever I needed to run a new stat (or generate a later graph: eg for today's graph I take yesterday's graph and apply any changes that happened since yesterday). I tried memcached which worked great until the graphs grew > 1 MB.
So now I'm thinking about using a graph database like Neo4J.
Only problem is, I don't have just one graph. Or I do, but it is one that changes over time and I need to be able to query it with different reference times.
So, can I:
store multiple graphs in Neo4J and rertrieve/interact with them separately? i would then create and store separate social graphs for each date.
or
add valid to and from timestamps to each edge and filter the graph appropriately: so if i wanted a graph for "May 1st" i would only follow the newest edge between two noeds that was created before "May 1st" (and if all the edges were created after May 1st then those nodes wouldn't be connected).
I'm pretty new to graph databases so any help/pointers/hints will be appreciated.
Right now you can store just one graph database in a single Neo4j instance, but this one graphdb can contain as many different sub-graphs as you like. You only have to keep that in mind when doing global operations (like index queries) but there you can do compound queries that include timestamped properties as well to limit the results.
One way of doing that is, as you said adding temporal information to edges to represent the structure of a graph for a given date you can then traverse the structure of the graph back then.
Reference node has a different meaning in Neo4j.
Using category nodes per day (and linking them and also aggregating them for higher level timespans) is the more graphy way of categorizing nodes than indexed properties. (Effectively these are in-graph indices that you can easily include in your traversals and graph queries).
You don't have to duplicate the nodes as long as you are only interested in different temporal structures. If your nodes are also different (e.g. changing properties, you could either duplicate them, and so effectively creating different subgraphs) or create a connected list of history nodes on each node that contain just the changes (or the full snapshot depending on your requirements).
Your domain sounds very fitting for the graph database. If you have more and detailed questions feel free to join the Neo4j mailing list.
Not the easiest solution (I'm assuming you only work with one machine), but if you really want to separate your graphs, you only need to remember that a graph is a directory.
You can then create a dynamic loader class which takes the path of the database you want, load it in memory for the query, and close it after you getting your answer. You could also configure a proxy server, and send 2 parameters to your loader: your query (which I presume is a cypher query in this case) and the path of the database you want to query.
This is not adequate if you have tons of real-time queries to answer. But if it is simply for storing and doing some analytics over data sets, it can definitly answer your needs.
This is an old question, but starting with Neo4j 4.x, multi-tenancy is supported and you can have different databases within the same Neo4j server (with distinct RBAC permissions).

Resources