0 preceding and 0 following in teradata - teradata

I have just found out that '0 preceding' and '0 following' are not the same things in Teradata SQL. Maybe it's not that important to know the reason behind this, but i`m still curious about that and want to know the logic. So, does anyone know what is the difference between them?

Based on Standard SQL there's no difference, both are equivalent to current row.
But Teradata seems to do it a bit differenty, when it's a cumulative windows: In ROWS BETWEEN UNBOUNDED/n PRECEDING AND 0 PRECEDING it's the same as 1 PRECEDING, very strange.
Of course I can't imagine why anyone will actually use this instead of current row.

Related

Order by Gremlin (on AWS Neptune) descending puts 0 at the top

I have a Neptune Gremlin query that should order vertices by the number of times they've been saved by other users in descending order. It works perfectly for vertices where the property value is > 0, but for some reason puts the vertices where the property is equal to zero at the top.
When adding the vertex, the property is created without quotes (so not a string), and I am able to sum on the property when I increment it in other scenarios, so they should all be numbers. When ordering in ascending order it works as expected too (zero values come up first and then ordering is correct).
Has anyone seen this before or knows why it might be happening? I don't want to have to pre-filter out zero values.
The relevant part of my query is the following (and acts the same way with incorrect ordering, but has some stuff in the results that isn't relevant for this question), but I have attached an image for the full query I'm using with the results it gives g.V().hasLabel('trip').order().by('numSaves', desc)
Query and results
I was able to reproduce the issue thanks to the very helpful additional information. In the near term, the workaround of using fold().unfold() will work as it causes a different code path through the query engine to be taken. I will update this answer when more information is available. The issue seems to be related to the sum step. Another workaround that worked for me is to use a sack to do the "add one". Not a very elegant query but it does seem to avoid the order problem.
g.V("some-id").
property(single, "numSaves",
sack(assign).by('numSaves').
sack(sum).by(constant(1)).sack())
UPDATED July 29th 2021:
An Amazon Neptune update (1.0.5.0) was just released that contains a fix for this issue.

How to OR Query for contains in Dynamoose?

I want to search(query) a bunch of strings from a column in DynamoDB. Using Dynamoose https://github.com/dynamoose/dynamoose
But it returns nothing. Can you help if this type of query is allowed or is there another syntax for the same.
Code sample
Cat.query({"breed": {"contains": "Terrier","contains": "husky","contains": "wolf"}}).exec()
I want all these breeds , so these are OR queries. Please help.
Two major things here.
First. Query in DynamoDB requires that you search for where a given hasKey that is equal to something. This must be either the hashKey of the table or hashKey of an index. So even if you could get this working, the query will fail. Since you can't do multiple equals for that thing. It must be hashKey = _______. No or statements or anything for that first condition or search.
Second. Just to answer your question. It seems like what you are looking for is the condition.in function. Basically this would change your code to look like something like:
Cat.query("breed").in(["Terrier", "husky", "wolf"]).exec()
Of course. The code above will not work due to the first point.
If you really want to brute force this to work. You can use Model.scan. So basically changing query to scan` in the syntax. However, scan operations are extremely heavy on the DB at scale. It looks through every document/item before applying the filter, then returning it to you. So you get no optimization that you would normally get. If you only have a handful or couple of documents/items in your table, it might be worth it to take the performance hit. In other cases like exporting or backing up the data it also makes sense. But if you are able to avoid scan operations, I would. Might require some rethinking of your DB structure tho.
Cat.scan("breed").in(["Terrier", "husky", "wolf"]).exec()
So the code above would work and I think is what you are asking for, but keep in mind the performance & cost hit you are taking here.

Cosmos db ARRAY_LENGTH performance

I had a issue where a valid query didn't return anything while it should:
SELECT *
FROM root
WHERE
(ARRAY_LENGTH(root["orderData"]["_attachments"]) > 0
AND root["orderData"]["_status"] = "ARCHIVEDVALIDATED")
OR root["orderData"]["_status"] = "ARCHIVEDREJECTED"
Thanks to stackoverflow community, I found out that it was because it was taking too much RU and nothing would return.
After digging and trying serveral things, i found out that if I remove ARRAY_LENGTH(root["orderData"]["_attachments"]) > 0, my query goes from 13k RU to 600 RU..
I can't seem to find a way to fix this, the hotfix i found so far, is to remove the ARRAY_LENGTH(root["orderData"]["_attachments"]) > 0 from the query and filter it later in memory (which is not good...)
Am I missing something? How could I fix this?
Thank you!
600RU is still very-very bad. That is not a solution.
The reason for such bad performance is that your query cannot use indexes and doing full scan can never scale. Being bad now, it will get worse as your collection grows.
What you need is make sure your query can use an index to only examine the smallest possible numbers of documents. Hard to propose exact solution without knowing your value data distribution on orderdata.status and orderdata._attachments.length, but you should consider:
Drop the OR. Queries of "this or that" cannot use index. CosmosDB uses just 1 index per query. If orderdata.status values are selective enough you would get
a lot better RU/performance by doing 2 calls and merging results in client.
Precalculate your condition to a separate property and put an index on that. Yes, that's duplicating data, but a few extra bytes cost you nothing, while RU and performance cost you a lot in money as well as in user experience.
You can combine them as well, for example by having 2 queries and storing the array count only. Think about your data and test it out.
To figure out the discrepancy in RUs between the two queries, you may want to check the Query Metrics for both queries as per https://learn.microsoft.com/en-us/azure/cosmos-db/sql-api-sql-query-metrics.
You could also try to swap the first two expressions and see if this makes any difference. Basically try this query:
SELECT * FROM root WHERE (((root["orderData"]["_status"] = "ARCHIVEDVALIDATED") AND (ARRAY_LENGTH(root["orderData"]["_attachments"]) > 0)) OR (root["orderData"]["_status"] = "ARCHIVEDREJECTED"))

Timestamps and database structure

I'm currently trying to figure out the best way to save entries for my app in a way that I can effectively query them by day. I am stuck between two different approaches.
To simplify my problem, let's say I'm making a journal app. For this app, a journal entry contains {title, timestamp}.
Approach #1:
-Journal
--[user_id]
---journal entry
Approach #2
-Journal
--[user_id]
---[Unix timestamp of beginning of day]
----journal entry
As of now, I'm leaning towards approach #1, specifically because it could potentially allow me to grab entries within the last 24 hours by querying rather than for a specific day.
At the same time, however, approach #2 would allow me to more easily handle the potentially sparse amount of entries with client side logic, and it will allow me to avoid using any querying functions. The appeal to me in this approach is that I could just generate very simple functions for generating a timestamp for the beginning of any given day, and I could just get the journal entries childed to that timestamp.
I really want to go with approach #1 as it seems like the right thing to do, and it feels like something that would give me a freedom in many ways that approach #2 would not, but I worry that the querying functions required for part 1 are not very straight forward and would be a huge hastle.
If I go with approach #1, is there a specific way to query?
I'm not sure if using orderbychild(timestamp).startAt(starting timestamp) and .endAt(ending timestamp) would work as it intuitively should, as I've tried this approach in the past and it didn't behave correctly.
Please let me know which approach I should go with, and if it's approach 1, how I should be properly querying it.
Thanks.

SQLite: Downsides of ANALYZE

Does the ANALYZE command have any downsides (except a slighty larger db)? If not, why is not executed by default?
There is another downside. The ANALYZE results may cause the query planner to ignore indexes that you really want to use.
For example suppose you have a table with a boolean column "isSpecial". Most of the rows have isSpecial = 0 but there are a few with isSpecial = 1.
When you do a query SELECT * FROM MyTable WHERE isSpecial = 1, in the absence of ANALYZE data the query planner will assume the index on isSpecial is good and will use it. In this case it will happen to be right. If you were to do isSpecial = 0 then it would still use the index, which would be inefficient, so don't do that.
After you have run ANALYZE, the query planner will know that isSpecial has only two values, so the selectivity of the index is bad. So it won't use it, even in the isSpecial = 1 case above. For it to know that the isSpecial values are very unevenly distributed it would need data that it only gathers when compiled with the SQLITE_ENABLE_STAT4 option. That option is not enabled by default and it has a big downside of its own: it makes the query plan for a prepared statement depend on its bound values, so sqlite will re-prepare the statement much more often. (Possibly every time it's executed, I don't know the details)
tl;dr: running ANALYZE makes it almost impossible to use indexes on boolean fields, even when you know they would be helpful.
Short answer: it may take more time to calculate than time saved.
Unlike indices the ANALYZE-statistics are not kept up-to-date automatically when data is added or updated. You should rerun ANALYZE any time a significant amount of data has been added of updated.

Resources