Order by Gremlin (on AWS Neptune) descending puts 0 at the top - gremlin

I have a Neptune Gremlin query that should order vertices by the number of times they've been saved by other users in descending order. It works perfectly for vertices where the property value is > 0, but for some reason puts the vertices where the property is equal to zero at the top.
When adding the vertex, the property is created without quotes (so not a string), and I am able to sum on the property when I increment it in other scenarios, so they should all be numbers. When ordering in ascending order it works as expected too (zero values come up first and then ordering is correct).
Has anyone seen this before or knows why it might be happening? I don't want to have to pre-filter out zero values.
The relevant part of my query is the following (and acts the same way with incorrect ordering, but has some stuff in the results that isn't relevant for this question), but I have attached an image for the full query I'm using with the results it gives g.V().hasLabel('trip').order().by('numSaves', desc)
Query and results

I was able to reproduce the issue thanks to the very helpful additional information. In the near term, the workaround of using fold().unfold() will work as it causes a different code path through the query engine to be taken. I will update this answer when more information is available. The issue seems to be related to the sum step. Another workaround that worked for me is to use a sack to do the "add one". Not a very elegant query but it does seem to avoid the order problem.
g.V("some-id").
property(single, "numSaves",
sack(assign).by('numSaves').
sack(sum).by(constant(1)).sack())
UPDATED July 29th 2021:
An Amazon Neptune update (1.0.5.0) was just released that contains a fix for this issue.

Related

DynamoDB top item per partition

We are new to DynamoDB and struggling with what seems like it would be a simple task.
It is not actually related to stocks (it's about recording machine results over time) but the stock example is the simplest I can think of that illustrates the goal and problems we're facing.
The two query scenarios are:
All historical values of given stock symbol <= We think we have this figured out
The latest value of all stock symbols <= We do not have a good solution here!
Assume that updates are not synchronized, e.g. the moment of the last update record for TSLA maybe different than for AMZN.
The 3 attributes are just { Symbol, Moment, Value }. We could make the hash_key Symbol, range_key Moment, and believe we could achieve the first query easily/efficiently.
We also assume could get the latest value for a single, specified Symbol following https://stackoverflow.com/a/12008398
The SQL solution for getting the latest value for each Symbol would look a lot like https://stackoverflow.com/a/6841644
But... we can't come up with anything efficient for DynamoDB.
Is it possible to do this without either retrieving everything or making multiple round trips?
The best idea we have so far is to somehow use update triggers or streams to track the latest record per Symbol and essentially keep that cached. That could be in a separate table or the same table with extra info like a column IsLatestForMachineKey (effectively a bool). With every insert, you'd grab the one where IsLatestForMachineKey=1, compare the Moment and if the insertion is newer, set the new one to 1 and the older one to 0.
This is starting to feel complicated enough that I question whether we're taking the right approach at all, or maybe DynamoDB itself is a bad fit for this, even though the use case seems so simple and common.
There is a way that is fairly straightforward, in my opinion.
Rather than using a GSI, just use two tables with (almost) the exact same schema. The hash key of both should be symbol. They should both have moment and value. Pick one of the tables to be stocks-current and the other to be stocks-historical. stocks-current has no range key. stocks-historical uses moment as a range key.
Whenever you write an item, write it to both tables. If you need strong consistency between the two tables, use the TransactWriteItems api.
If your data might arrive out of order, you can add a ConditionExpression to prevent newer data in stocks-current from being overwritten by out of order data.
The read operations are pretty straightforward, but I’ll state them anyway. To get the latest value for everything, scan the stocks-current table. To get historical data for a stock, query the stocks-historical table with no range key condition.

how to use a previous stored item after a fold gremlin

First, I'm using azure cosmos graph db.
I see this sort of pattern quite a bit:
out('an-edge').fold().coalesce(unfold(),addV('incoming-schedule'))
I want to add an edge immediately after I do an addV in the coalesce. I've been trying to do it in a simple example:
g.V('any-vertex-id').as('a').out('an-edge').coalesce(unfold(),addV('new-vertex').addE('to-v').from('a'))
"a" seems to no longer exist after a fold() since it's a barrier step. I tried store and aggregate but I must not understand those properly. Is it possible to get a reference after a fold()? I need it because it may reference a previous addV in the query to which I wouldn't have the id yet.
What is your requirement here? Do you want to create a new vertex an edge only when out('an-edge') is not present?
If that's the case, I will try this:
g.V('any-vertex-id').as('a').coalesce(out('an-edge'), addV('new-vertex').addE('to-v').from(select('a')))
Fold() is typically used when one needs to aggregate on all the output from the preceding step. I don't think, that is necessary in this case.
http://tinkerpop.apache.org/docs/current/reference/#fold-step
It looks like I can store and then select from it when adding the edge.
g.V('any-vertex-id').store('a').out('an-edge').fold()
.coalesce(unfold(),addV('new-vertex')
.addE('to-v').from(select('a').unfold()))
Not sure if someone has a better alternative or a better suggestion then store, but this seems to work at least in my scenario

Search query to find documents that have multiple element

I have a few XML documents in marklogic which have the structure
<abc:doc>
<abc:doc-meta>
<abc:meetings>
<abc:meeting>
</abc:meeting>
<abc:meeting>
</abc:meeting>
</abc:meetings>
</abc:doc-meta>
</abc:doc>
We can have more than one <abc:meeting> element under the <abc:meetings> element.
I am trying to write a cts:search query to get only documents that have more than one <abc:meeting> element in the document.
Please advise
This is tricky. Ideally, you'd want to drive searches from indexes for best performance. Unfortunately, MarkLogic doesn't keep track of element counts in its universal index, and aggregating counts from a range index can be cumbersome.
The overall simplest solution would be to add a count attribute on abc:meetings, and then add a range index on that. It does mean you'd have to change your data, and you'd have to keep that attribute in synch with each change.
You could also just search on the presence of abc:meeting with cts:element-query(), and append an XPath predicate to count the number of elements afterwards. Something like:
cts:search(
collection(),
cts:element-query(xs:QName('abc:meeting'), cts:true-query())
)[count(.//abc:meeting) > 1]
If not many documents contain meetings, this might work fairly well for you, but it still requires pulling up all documents containing meetings, hence could be expensive.
I played with the thought of leveraging cts:near-query(), but that is driven on word positions, so depends on the actual amount of tokens inside a meeting. If that were always an exact number of tokens (unlikely I'd guess), you could use the minimal-distance option on a double cts:element-query() wrapped in a cts:near-query(). It might help optimize the previous option a little though.
Most performant option I can think of right now, involves adding a User-Defined aggregate Function. It unfortunately means compiling c++ code. I happen to have written such a UDF in the past, that you should be able to use as-is after compilation and installation. For details see:
https://github.com/grtjn/doc-count-udf
and
http://docs.marklogic.com/guide/app-dev/aggregateUDFs
HTH!
It boils down to how many "a few" is. If it's thousands or fewer, than what grtjn presents above for a cts:search plus an XPath expression will work fine. If it's more, I'd add the count attribute to abc:meetings and then use a pre-commit trigger (e.g. on the collection of these documents) to ensure that the count attribute value is kept in sync. You'd need a range index to be able to query for "Documents that have a count of meetings of 2 or greater".
Of course, if all you need to query on is whether there's more than one meeting, then just add a "multiple" attribute to abc:meetings with a value of "true". Then you don't need a range index - you can do a cts:element-attribute-value-query on abc:meetings and multiple="true".

How many points in an InfluxDB measurement?

Since there is no way to delete points by field values in InfluxDB, I'd like to get a count of all the points, SELECT INTO excluding the points with unwanted values, then get a count of the second measurement.
However,
SELECT COUNT(*) FROM measurement1
returns an array of counts for each field and tag, which doesn't tell me how many data points there are total.
It seems there is currently no way to do this without knowing a name of a column/value that is present in all points.
Although time is always present in all points, it is unfortunately not possible to do count(time) for now, either.
This issue addresses the problem, but it is closed and a bit outdated. Someone should open a new one because the problem is still there.
use this command
SHOW SERIES CARDINALITY
works for tag

Is a scan query always expensive in DynamoDB or should you use a range key

I've been playing around with Amazon DynamoDB and looking through their examples but I think I'm still slightly confused by the example. I've created the example data on a local dynamodb instance to get used to querying data etc. The sample data sets up 3 tables of 'Forum'->'Thread'->'Reply'
Now if I'm in a specific forum, the thread table has a ForumName key I can query against to return relevant threads, but would the very top level (displaying the forums) always have to be a scan operation?
From what I can gather the only way to "select *" in dynamodb is to use a scan and I assume in this instance - where forum is very high level and might have a relatively small number of rows - that it wouldn't be that expensive or are you actually better creating a hash and range key and using that to query this table? I'm not sure what the range key would be in this instance, maybe just a number and then specify in the query that the value has to be > 0? Or perhaps a date it was created and the query always uses a constant date in the past?
I did try a sample query on the 'Forum' table example data using a ComparisonOperator of 'GE' (Greater than or equal) with an attribute value list of 'S'=>'a' but this states that any conditions on the hash key must be of type EQ which implies I couldn't do the above as I would always need to know my 'Name' values upfront
Maybe I'm still struggling having come from an RDBS background especially seen as there are many forum examples out there.
thanks
I think using Scan to get all the forums is fine. I think it is very efficient because it will not return you anything that you don't need (all of the work that scan does is necessary). Also since Scan operation is so simple it is easier to implement and more likely to be efficient

Resources