how to use a previous stored item after a fold gremlin - azure-cosmosdb

First, I'm using azure cosmos graph db.
I see this sort of pattern quite a bit:
out('an-edge').fold().coalesce(unfold(),addV('incoming-schedule'))
I want to add an edge immediately after I do an addV in the coalesce. I've been trying to do it in a simple example:
g.V('any-vertex-id').as('a').out('an-edge').coalesce(unfold(),addV('new-vertex').addE('to-v').from('a'))
"a" seems to no longer exist after a fold() since it's a barrier step. I tried store and aggregate but I must not understand those properly. Is it possible to get a reference after a fold()? I need it because it may reference a previous addV in the query to which I wouldn't have the id yet.

What is your requirement here? Do you want to create a new vertex an edge only when out('an-edge') is not present?
If that's the case, I will try this:
g.V('any-vertex-id').as('a').coalesce(out('an-edge'), addV('new-vertex').addE('to-v').from(select('a')))
Fold() is typically used when one needs to aggregate on all the output from the preceding step. I don't think, that is necessary in this case.
http://tinkerpop.apache.org/docs/current/reference/#fold-step

It looks like I can store and then select from it when adding the edge.
g.V('any-vertex-id').store('a').out('an-edge').fold()
.coalesce(unfold(),addV('new-vertex')
.addE('to-v').from(select('a').unfold()))
Not sure if someone has a better alternative or a better suggestion then store, but this seems to work at least in my scenario

Related

How to OR Query for contains in Dynamoose?

I want to search(query) a bunch of strings from a column in DynamoDB. Using Dynamoose https://github.com/dynamoose/dynamoose
But it returns nothing. Can you help if this type of query is allowed or is there another syntax for the same.
Code sample
Cat.query({"breed": {"contains": "Terrier","contains": "husky","contains": "wolf"}}).exec()
I want all these breeds , so these are OR queries. Please help.
Two major things here.
First. Query in DynamoDB requires that you search for where a given hasKey that is equal to something. This must be either the hashKey of the table or hashKey of an index. So even if you could get this working, the query will fail. Since you can't do multiple equals for that thing. It must be hashKey = _______. No or statements or anything for that first condition or search.
Second. Just to answer your question. It seems like what you are looking for is the condition.in function. Basically this would change your code to look like something like:
Cat.query("breed").in(["Terrier", "husky", "wolf"]).exec()
Of course. The code above will not work due to the first point.
If you really want to brute force this to work. You can use Model.scan. So basically changing query to scan` in the syntax. However, scan operations are extremely heavy on the DB at scale. It looks through every document/item before applying the filter, then returning it to you. So you get no optimization that you would normally get. If you only have a handful or couple of documents/items in your table, it might be worth it to take the performance hit. In other cases like exporting or backing up the data it also makes sense. But if you are able to avoid scan operations, I would. Might require some rethinking of your DB structure tho.
Cat.scan("breed").in(["Terrier", "husky", "wolf"]).exec()
So the code above would work and I think is what you are asking for, but keep in mind the performance & cost hit you are taking here.

DynamoDB top item per partition

We are new to DynamoDB and struggling with what seems like it would be a simple task.
It is not actually related to stocks (it's about recording machine results over time) but the stock example is the simplest I can think of that illustrates the goal and problems we're facing.
The two query scenarios are:
All historical values of given stock symbol <= We think we have this figured out
The latest value of all stock symbols <= We do not have a good solution here!
Assume that updates are not synchronized, e.g. the moment of the last update record for TSLA maybe different than for AMZN.
The 3 attributes are just { Symbol, Moment, Value }. We could make the hash_key Symbol, range_key Moment, and believe we could achieve the first query easily/efficiently.
We also assume could get the latest value for a single, specified Symbol following https://stackoverflow.com/a/12008398
The SQL solution for getting the latest value for each Symbol would look a lot like https://stackoverflow.com/a/6841644
But... we can't come up with anything efficient for DynamoDB.
Is it possible to do this without either retrieving everything or making multiple round trips?
The best idea we have so far is to somehow use update triggers or streams to track the latest record per Symbol and essentially keep that cached. That could be in a separate table or the same table with extra info like a column IsLatestForMachineKey (effectively a bool). With every insert, you'd grab the one where IsLatestForMachineKey=1, compare the Moment and if the insertion is newer, set the new one to 1 and the older one to 0.
This is starting to feel complicated enough that I question whether we're taking the right approach at all, or maybe DynamoDB itself is a bad fit for this, even though the use case seems so simple and common.
There is a way that is fairly straightforward, in my opinion.
Rather than using a GSI, just use two tables with (almost) the exact same schema. The hash key of both should be symbol. They should both have moment and value. Pick one of the tables to be stocks-current and the other to be stocks-historical. stocks-current has no range key. stocks-historical uses moment as a range key.
Whenever you write an item, write it to both tables. If you need strong consistency between the two tables, use the TransactWriteItems api.
If your data might arrive out of order, you can add a ConditionExpression to prevent newer data in stocks-current from being overwritten by out of order data.
The read operations are pretty straightforward, but I’ll state them anyway. To get the latest value for everything, scan the stocks-current table. To get historical data for a stock, query the stocks-historical table with no range key condition.

3 column query in DynamoDB using DynamooseJs

My table is (device, type, value, timestamp), where (device,type,timestamp) makes a unique combination ( a candidate for composite key in non-DynamoDB DBMS).
My queries can range between any of these three attributes, such as
GET (value)s from (device) with (type) having (timestamp) greater than <some-timestamp>
I'm using dynamoosejs/dynamoose. And from most of the searches, I believe I'm supposed to use a combination of the three fields (as a single field ; device-type-timestamp) as id. However the set: function of Schema doesn't let me use the object properties (such as this.device) and due to some reasons, I cannot do it externally.
The closest I got (id:uuidv4:hashKey, device:string:GlobalSecIndex, type:string:LocalSecIndex, timestamp:Date:LocalSecIndex)
and
(id:uuidv4:rangeKey, device:string:hashKey, type:string:LocalSecIndex, timestamp:Date:LocalSecIndex)
and so on..
However, while using a Query, it becomes difficult to fetch results of particular device,type as the id, (hashKey or rangeKey) keeps missing from the scene.
So the question. How would you do it for such kind of table?
And point to be noted, this table is meant to gather content from IoT devices, which is generated every 5 mins by each device on an average.
I'm curious why you are choosing DynamoDB for this task. Advanced queries like this seem to be much better suited for a SQL based database as opposed to a NoSQL database. Due to the advanced nature of SQL queries, this task in my experience is a lot easier in SQL databases. So I would encourage you to think about if DynamoDB is truly the right system for what you are trying to do here.
If you determine it is, you might have to restructure your data a little bit. You could do something like having a property that is device-type and that will be the device and type values combined. Then set that as an index, and query based on that and sort by the timestamp, and filter out the results that are not greater than the value you want.
You are correct that currently, Dynamoose does not pass in the entire object into the set function. This is something that personally I'm open to exploring. I'm a member on the GitHub project, and if you would like to submit a PR adding that feature I would be more than happy to help explore that option with you and get that into the codebase.
The other thing you might want to explore is having a DynamoDB stream, that will set that device-type property whenever it gets added to your DynamoDB table. That would abstract that logic out of DynamoDB and your application. I'm not sure if it's necessary for what you are doing to decouple it to that level, but it might be something you want to explore.
Finally, depending on your setup, you could figure out which item will be more unique, device or type, and setup an index on that property. Then just query based on that, and filter out the results of the other property that you don't want. I'm not sure if that is what you are looking for, it will of course work, but I'm not sure how many items you will have in your table, and there get to be questions about scalability at a certain level. One way to solve some of those scalability questions might be to set the TTL of your items if you know that you the timestamp you are querying for is constant, or predictable ahead of time.
Overall there are a lot of ways to achieve what you are looking to do. Without more detail about how many items, what exactly those properties will be doing, the amount of scalability you require, which of those properties will be most unique, etc. it's hard to give a good solution. I would highly encourage you to think about if NoSQL is truly the best way to go. That query you are looking to do seems a LOT more like a SQL query. Not saying it's impossible in DynamoDB, but it will require some thought about how you want to structure your data model, and such.
Considering opinion of #charlie-fish, I decided to jump into Dynamoose and improvise the code to pass the model to the set function of the attribute. However, I discovered that the model is already being passed to default parameter of the attribute. So I changed my Schema to the following:
id:hashKey;default: function(model){ return model.device + "" + model.type; }
timestamp:rangeKey
For anyone landing here on this answer, please note that the default & set functions can access attribute options & schema instance using this . However both those functions should be regular functions, rather than arrow functions.
Keeping this here as an answer, but I won't accept it as an answer to my question for sometime, as I want to wait for someone else to hit out a better approach.
I also want to make sure that if a value is passed for id field, it shouldn't be set. For this I can use set to ignore the actual incoming value, which I don't know how, as of yet.

Search query to find documents that have multiple element

I have a few XML documents in marklogic which have the structure
<abc:doc>
<abc:doc-meta>
<abc:meetings>
<abc:meeting>
</abc:meeting>
<abc:meeting>
</abc:meeting>
</abc:meetings>
</abc:doc-meta>
</abc:doc>
We can have more than one <abc:meeting> element under the <abc:meetings> element.
I am trying to write a cts:search query to get only documents that have more than one <abc:meeting> element in the document.
Please advise
This is tricky. Ideally, you'd want to drive searches from indexes for best performance. Unfortunately, MarkLogic doesn't keep track of element counts in its universal index, and aggregating counts from a range index can be cumbersome.
The overall simplest solution would be to add a count attribute on abc:meetings, and then add a range index on that. It does mean you'd have to change your data, and you'd have to keep that attribute in synch with each change.
You could also just search on the presence of abc:meeting with cts:element-query(), and append an XPath predicate to count the number of elements afterwards. Something like:
cts:search(
collection(),
cts:element-query(xs:QName('abc:meeting'), cts:true-query())
)[count(.//abc:meeting) > 1]
If not many documents contain meetings, this might work fairly well for you, but it still requires pulling up all documents containing meetings, hence could be expensive.
I played with the thought of leveraging cts:near-query(), but that is driven on word positions, so depends on the actual amount of tokens inside a meeting. If that were always an exact number of tokens (unlikely I'd guess), you could use the minimal-distance option on a double cts:element-query() wrapped in a cts:near-query(). It might help optimize the previous option a little though.
Most performant option I can think of right now, involves adding a User-Defined aggregate Function. It unfortunately means compiling c++ code. I happen to have written such a UDF in the past, that you should be able to use as-is after compilation and installation. For details see:
https://github.com/grtjn/doc-count-udf
and
http://docs.marklogic.com/guide/app-dev/aggregateUDFs
HTH!
It boils down to how many "a few" is. If it's thousands or fewer, than what grtjn presents above for a cts:search plus an XPath expression will work fine. If it's more, I'd add the count attribute to abc:meetings and then use a pre-commit trigger (e.g. on the collection of these documents) to ensure that the count attribute value is kept in sync. You'd need a range index to be able to query for "Documents that have a count of meetings of 2 or greater".
Of course, if all you need to query on is whether there's more than one meeting, then just add a "multiple" attribute to abc:meetings with a value of "true". Then you don't need a range index - you can do a cts:element-attribute-value-query on abc:meetings and multiple="true".

GAE -- Queries on sharded properties

I understand the theory of sharding values in Google App Engine,as outlined here:
http://code.google.com/appengine/articles/sharding_counters.html
but what happens when I want to run a query on a value that I've sharded? I can't simply query against the value, because it's been split up randomly amongst N different counters. Is the solution just to sum these values back up occasionally to update my main entity? I'm curious to see what solutions others have come up with to this problem.
EDIT: I just discovered the Task Queue API, and it looks like it might be a solution to updating the main value in the background. Anyone tried using this in parallel with sharding?
you're right, you can't use the total sum in another datastore query in a single shot, since it's split between the shards. however, you can run an initial query to gather all of the shards, sum them in memory, and then run your original query using that sum.
beyond that, yes, the task queue is definitely a good approach to doing work like this in the background. take a look at this talk for ideas:
http://www.google.com/events/io/2010/sessions/high-throughput-data-pipelines-appengine.html

Resources