I have a Cassandra DB with a table containing values loaded from sensors, composed of 5 columns(user , sensor, observed param, value, timestamp).
I would like to represent them in a triplestore in order to perform SPARQL query to get knowledge.
Initially i thought to Titan because it's based on Cassandra, but i don't find a way to automatically dump the Cassandra's table in a Titan graph.
Is there a way to do it with Titan or others triplestore?
Related
Is there some operation of the Scan API or the Query API that allows to perform a lookup on a table with a composite key (pk/sk) but that varies only in the pk to optimize the Scan operation of the table ?
Let me introduce a use case:
Suppose I have a partition key defined by the id of a project and within each project I have a huge amount of records (sk)
Now, I need to solve the query "return all projects". So I don't have a partition key and I have to perform a scan.
I know that I could create a GSI that solves this problem, but let's assume that this is not the case.
Is there any way to perform a scan that "hops" between each pk, ignoring the elements of the sk's?
In other words, I will collect the information of the first record of each partition key.
DynamoDB is a NoSQL database, as you already know. It is optimized for LOOKUP, and practices that you used to have in SQL databases or other (low-scale) databases are not always available in DynamoDB.
The concept of a partition key is to put records that are part of the same partition together and sorted by the sort key. The other side of it is that records that don't have the same partition key, are stored in other locations. It is not a long list (or tree) of records that you can scan over.
When you design your schema in a NoSQL database, you need to consider the access pattern to that data. If you need a list of all the projects, you need to maintain an index that will allow it.
I have several questions about custom partitioning in clickhouse. Background: i am trying to build a TSDB on top of clickhouse. We need to support very large batch write and complicated OLAP read.
Let's assume we use the standard partition by month , and we have 20 nodes in our clickhouse cluster. I am wondering will the data from same month all flow to the same node or will clickhouse do some internal balance and put the data from same month to several nodes?
If all the data from same month write to the same node, then it will be very bad for our scenario. I will probably consider patition by (timestamp, tags)where tags are the different tags that define the data source. Our monitoring system will write data to TSDB every 30 seconds. Our read pattern is usually single table range scan or several tables join on a column. Any advice on how should i customize my partition strategy?
Since clickhouse does not support secondary index, and we will run selection query on columns, i think i should put those important columns into the primary key, so my primary key will probably be like (timestamp, ip, port...), any advice on this design or make give a good reason why clickhouse does not support secondary index like bitmap index on other non-primary column?
In ClickHouse, partitioning and sharding are two independent mechanisms. Partitioning by month means that data from different months will never be merged to be stored in same file on a filesystem and has nothing to do with data placement between nodes (which is controlled by choosing how exactly do you setup tables and run your INSERT INTO queries).
Partitioning by months or weeks is usually doing fine, for choosing primary key see official documentation: https://clickhouse.yandex/docs/en/operations/table_engines/mergetree/#selecting-the-primary-key
There are no fundamental issues with adding those, for example bloom filter index development is in progress: https://github.com/yandex/ClickHouse/pull/4499
As a beginner, Want to know the difference between TinkerPop and Titan
TitanDB is a graph database engine with a different backend storage ( like Cassandra etc ) and an optional query index ( like elasticsearch etc.. ) . In other words it creates property graph data models and stores it in one of the many supported backend stores and for optional faster querying relies on products like elasticsearch etc for indexing...
Tinkerpop is a framework that sits on top of titanDB. It also supports other graph databases like Neo4j for example. One of the many features implemented in tinkerpop is the gremlin graph query language (analogous to SQL for relational db) that interacts with graph databases like Titan to help the user create and query graph data.
How to get past 30 days data using dynamo db with group by clause(power).
Having table name lightpowerinfo with fields like id, lightport, sessionTime, power.
Amazon DynamoDB is a NoSQL database, which means that it is not possible to write SQL queries against the data. Therefore, there is no concept of a GROUP BY statement.
Instead, you would need to write an application to retrieve the relevant raw data, and then calculate the results you seek.
I'm looking at using Titan to create a scalable geospatial data store (I'm thinking R trees). In the documentation, there is a GeoShape query, and the docs say that titan can do geo data with Lucene or ElasticSearch. However, it seems like this would be very slow because traversing nodes in cassandra is essentially doing join queries in cassandra which is a really bad idea. I think I might be misunderstanding the data representation.
I read the Titan Data Model doc, and I still don't quite get it. If all the edges are stored in a Cassandra row, then Titan would still have to "join" on a vertex table. One way to solve this would be to make the column value equal to the edge property data, and then you could neatly package the vertex data and the edge data into the row. However, this breaks down when you want to do queries deeper than 1 node, and we're back to the joining problem again.
So. Is titan emulating join queries in Cassandra? - and - How performant is it at geo lookups under these conditions?
I think the question conflates edge traversal with geospatial index lookups. These are separate at both the API and implementation levels. The index is not illustrated in the data model pictures.
Let's make this a little bit more specific. Say I run Titan with ES and Cassandra using Murmur3Partitioner or RandomPartitioner. I declare an ES geospatial index over edges called "place", as documented in the Getting Started page. Looking up edges by geospatial queries, such as this "WITHIN" in the Getting Started docs, first hits ES. ES returns IDs Titan can use to lookup the associated vertex/edge data in Cassandra quickly, without doing an analog to relational joins.
The cost of these edge lookups by geospatial data should be roughly equivalent to the cost of ES's WITHIN implementation (which I think is delegated to Spatial4j), plus the lookups Titan makes on Cassandra after getting IDs, which should be roughly linear in the number of edges found by ES. This is just back-of-the-envelope estimation, so please take it with a big grain of salt.
After I get place edges by geo matching, if I then want to run arbitrary traversals in the neighborhood of each edge in the set, then I would have a look at rooting a MultiQuery on the head/tail vertices and enabling database-level caching. If the query misses cache or cache is cold/disabled, then Titan will still attempt to retrieve all edges the traversal cares about in a single Cassandra slice per vertex, when possible. If you're concerned about Titan's edge traversal efficiency, then you might find Boutique Graph Data with Titan interesting.
HTH