Querying an extents directly - azure-data-explorer

Is it possible to query database extents directly without specifying name of a table ?
For example when I fire the command .show database extents , I get a list of extents in the given database. If I pick a specific extent id from the result , or in general any extent id belonging to that database , is there a way to query it without reference to table name?

It's not recommended to take any kind of dependency on an extent ID.
Though your use case isn't clear nor standard, it is possible to run a query as follows:
union *
| where extent_id() == '6810147e-1234-1234-1234-d3649e3d3a83'
| take 10

Related

Using millisecond timestamp as the global secondary index in DynamodDb for range queries?

We have a Dynamodb table Events with about 50 million records that look like this:
{
"id": "1yp3Or0KrPUBIC",
"event_time": 1632934672534,
"attr1" : 1,
"attr2" : 2,
"attr3" : 3,
...
"attrN" : N,
}
The Partition Key=id and there is no Sort Key. There can be a variable number of attributes other than id (globally unique) and event_time, which are required.
This setup works fine for fetching by id but now we'd like to efficiently query against event_time and pull ALL attributes for records that match within that range (could be a million or two items). The criteria would be equal to something like WHERE event_date between 1632934671000 and 1632934672000, for example.
Without changing any existing data or transforming it through an external process, is it possible to create a Global Secondary Index using event_date and projecting ALL attributes that could allow a range query? By my understanding of DynamoDB this isn't possible but maybe there's another configuration I'm overlooking.
Thanks in advance.
(Edit: I rewrote the answer because the OP's comment clarified that the requirement is to query event_time ranges ignoring id. OP knows the table design is not ideal and is trying to make the best of a bad situation).
Is it possible to create a Global Secondary Index using event_date and projecting ALL attributes that could allow a range query?
Yes. You can add a Global Secondary Index to an existing table and choose which attributes to project. You cannot add an LSI to an existing table or change the table's primary key.
Without changing any existing data or transforming it through an external process?
No. You will need to manipulate the attibutes. Although arbitrary range queries are not its strength, DynamoDB has a time series pattern that can be adapted to your query pattern.
Let's say you query mostly by a limitied number of days. You would add a GSI with yyyy-mm-dd PK (Partition Key). Rows are made unique by a SK (Sort Key) that concatenates the timestamp with the id: event_time#id. PK and SK together are the Index's Composite Primary Key.
GSIPK1 = yyyy-mm-dd # 2022-01-20
GSISK1 = event_time#id # 1642709874551#1yp3Or0KrPUBIC
Querying for a single day needs 1 query operation, for a calendar week range needs 7 operations.
GSI1PK = "2022-01-20" AND GSI1SK > ""
Query a range within a day by adding a SK between condition:
GSI1PK = "2022-01-20" AND GSI1SK BETWEEN "1642709874" AND "16427098745"
It seems like one can create a global secondary index at any point.
Below is an excerpt from the Managing Global Secondary Indexes documentation which can be found here https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.OnlineOps.html
To add a global secondary index to an existing table, use the UpdateTable operation with the GlobalSecondaryIndexUpdates parameter.

why filtering on extents_tags() is slow

Why the following command is slow (5 mins)?
mytable | where extent_tags() contains "20210613" | count
I know this is not the best way to get count , I could have used .show table extents and could have simply calculated sum(RowCount) using summarize operator. But I am just testing. Ideally ADX should be able to search tags across extents and get counts , so it is only metadata search and once it finds correct extent, row count is already stored as part of the extent metadata anyways, so why should it take 5 mins? And by the, the extent(s) I am interested in has the following tag:-
drop-by:20210613
ingest-by:20210613
There is a datetime field in the table which I could have used to filter too , which is what adx ideally recommends in general scenarios and I can guess the reason that min and max of every datetime field in the table is stored in every extent of the table -- but then similarly even tag is stored in every extent. So which method is more efficient , filtering on a datetime field if available or tags?
a. you're correct that using .show table T extents where tags contains 'string' | ... would be much more efficient
b. as mentioned in the documentation: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/extenttagsfunction
Filtering on the value of extent_tags() performs best when one of the following string operators is used: has, has_cs, !has, !has_cs.
c. which method is more efficient , filtering on a datetime field if available or tags?
The former, especially when your filter is on a substring, and not on the full content of the tag. Tags are a non-indexed metadata property of shards, and isn't an indexed data column. Also see: https://yonileibowitz.github.io/blog-posts/datetime-columns.html

Tags not showing up while querying an ADX table

If I fire .show table <TableName> extents I do see the list of extents with tags. But if I try to list those tags along with the table data I get blank, so the following query returns blank tags:-
<TableName> | extend tags=extent_tags()
Why is it so ?
are you querying the table directly, or after applying some operators/functions on the raw data? (perhaps, hidden by a function that shares the same name as the table, and therefore 'hides' it)
as the docs mention:
Applying this function to calculated data which is not attached to a data shard returns an empty value

Cosmos db Order by on 'computed field'

I am trying to select data based on a status which is a string. What I want is that status 'draft' comes first, so I tried this:
SELECT *
FROM c
ORDER BY c.status = "draft" ? 0:1
I get an error:
Unsupported ORDER BY clause. ORDER BY item expression could not be mapped to a document path
I checked Microsoft site and I see this:
The ORDER BY clause requires that the indexing policy include an index for the fields being sorted. The Azure Cosmos DB query runtime supports sorting against a property name and not against computed properties.
Which I guess makes what I want to do impossible with queries... How could I achieve this? Using a stored procedure?
Edit:
About stored procedure: actually, I am just thinking about this, that would mean, I need to retrieve all data before ordering, that would be bad as I take max 100 value from my database... IS there any way I can do it so I don t have to retrieve all data first? Thanks
Thanks!
ORDER BY item expression could not be mapped to a document path.
Basically, we are told we can only sort with properties of document, not derived values. c.status = "draft" ? 0:1 is derived value.
My idea:
Two parts of query sql: The first one select c.* from c where c.status ='draft',second one select c.* from c where c.status <> 'draft' order by c.status. Finally, combine them.
Or you could try to use stored procedure you mentioned in your question to process the data from the result of select * from c order by c.status. Put draft data in front of others by if-else condition.

Datastore: Is there plan to add GQLQuery support?

I am using gcloud-python library for a project which needs to serve following use case:
Get a batch of entities with a subset of its properties (projection)
gcloud.datastore.api.get_multi() provides me batch get but not projection
and gcloud.datastore.api.Query() provides me projection but not batch get (like a IN query)
AFAIK, GQLQuery provides both IN query(batch get) and projections. Is there a plan to support GQLQueries in gcloud-python library? OR, is there another way to get batching and projection in single request?
Currently there is no way to request a subset of an entities properties. When you have the list of keys that you need, you should use get_multi().
Projection Query Background
In Datastore, projection queries are simply index scans.
For example, consider you are writing the query SELECT * FROM MyKind ORDER BY myFirstProp, mySecondProp. This query will execute against an index: Index(MyKind, myFirstProp, mySecondProp). This index may look something like:
myFirstProp | mySecondProp | __key__
------------------------------------
a 1 k1
a 2 k2
b 1 k3
For each result in the index, Datastore then looks up the key associated with that index result. If you do a projection query where you project only myFirstProp or mySecondProp or both, Datastore can avoid doing the random access lookup to find the associated entity for each result. This is generally where you get the large performance gain from using projections -- not from the savings of transporting it over the network.
Likewise, if you know the list of keys that you need, you can lookup the key directly -- there is no need to look in an index first.
IN Operator
In Python GQL (not in the similar Cloud Datastore GQL), there is the IN operator, which allows you to write a query that looks something like:
SELECT * FROM MyKind WHERE myFirstProp IN ['a', 'b'].
However, Datastore does not actually support this query natively. Inside the python client, this will get converted into disjunctive normal form:
SELECT * FROM MyKind WHERE myFirstProp = 'a'
UNION
SELECT * FROM MyKind WHERE myFirstProp = 'b'
This means for each value inside your IN, you'll be issuing a separate Datastore query.

Resources