I've used a query like
requests
| getschema
to get a table containing the names and types of all columns in the requests table. How can I get the same result for requests.customDimensions?
It looks like this can be done using buildschema and then transforming the output into a consumable format (thanks Dmitry Matveev for helping me out with that piece 🙂):
requests
| summarize schema=buildschema(customDimensions)
| mvexpand bagexpansion=array schema
| project name=schema[0], type=schema[1]
Related
If I am run query:
SELECT HEX(BINARY(CONVERT('ßÁÁÁÁÁȵ$€Łß' USING ucs2)));
I am get:
00DF00C100C100C100C100C1010C00B5002420AC014100DF
and I suppose that sequence is BE, because in txt file in UTF-16 BE is the same sequence.
How to get sequence in UTF-16 LE?
You ask why I want LE? Because the query on MS SQL server:
SELECT CONVERT(varbinary(100), N'ßÁÁÁÁÁȵ$€Łß',0)
return:
0xDF00C100C100C100C100C1000C01B5002400AC204101DF00
Thank
Jaroslav
You need to cast with a little endian character set:
SELECT HEX(BINARY(CONVERT('ßÁÁÁÁÁȵ$€Łß' USING utf16le)));
+----------------------------------------------------------------+
| HEX(BINARY(CONVERT('ßÁÁÁÁÁȵ$€Łß' USING utf16le))) |
+----------------------------------------------------------------+
| DF00C100C100C100C100C1000C01B5002400AC204101DF00 |
+----------------------------------------------------------------+
I'm using bag_unpack to explode the customDimensions column in the AppInsights traces table and want to "shape" the resultant table. All is fine if there are rows to work with. If there are not, subsequent operations that reference the exploded columns fail. For example (I boiled it down to an isolated repro),
datatable (Date:datetime, JSON:string )
[datetime(1910-06-11), '{"key": "1"}', datetime(1930-01-01), '{"key": "2"}',
datetime(1953-01-01), '{"key": "3"}', datetime(1997-06-25), '{"key": "4"}']
| where Date > datetime(2000-01-01)
| project parsed = parse_json(JSON)
| evaluate bag_unpack(parsed)
| project-rename value = key
// lots more data shaping here
Since the where filters out all rows, there is nothing to unpack. OK, that's fine but the data shaping ops (e.g., project-rename) fail saying
project-rename: Failed to resolve column reference 'key'
If you change the date in the where to be say 1900-01-01 then everything works as expected.
Note as well that if you remove the bag_unpack the project-rename some other column, it works fine with no rows. For example,
datatable (Date:datetime, JSON:string )
[datetime(1910-06-11), '{"key": "1"}', datetime(1930-01-01), '{"key": "2"}',
datetime(1953-01-01), '{"key": "3"}', datetime(1997-06-25), '{"key": "4"}']
| where Date > datetime(2000-01-01)
| project-rename value = JSON
I can see how the unpack creates the columns so if it didn't run the column doesn't get created but at the same time, why run the project at all if there are no rows?
In theory I could move the where down but I'm not sure if the query planning will recognize that and only do the subsequent project/data shaping on the reduced set of rows (filtered by the where). I've got a lot of rows and typically only need to operate on a few of them.
Pointers on how to work with bag_unpack and empty tables? Or columns that may or may not be there?
You could use the column_ifexists() function: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/columnifexists
For example:
... | project value = column_ifexists("key", "")
I'm fairly new to BigQuery (3rd day of using it with no training), I'm just trying to get my head around nested fields etc.
I've looked at the following resources and used the personsdata example on the google bigquery docs link
https://cloud.google.com/bigquery/docs/data
https://chartio.com/resources/tutorials/how-to-flatten-data-using-google-bigquerys-legacy-vs-standard-sql/
I'd like to run the below query:
select *
from [dataset.tableid]
where fullname = 'John Doe'
If I run this, I get the following error:
Error: Cannot output multiple independently repeated fields at the same time. Found children_age and citiesLived_place
From reading the above articles this isn't possible because you need to flatten the results, which from what I can understand just duplicates all the none repeated variables i.e.
Fullname | age | gender | Children.name | children.age
John Doe | 22 | Male | John | 5
John Doe | 22 | Male | Jane | 7
One of the above articles suggests that you can still use the where statements by using the flatten function in bigquery:
select fullname,
age,
gender,
citiesLived.place
FROM (FLATTEN([dataset.tableId], children))
WHERE
(citiesLived.yearLived > 1995) AND
(children.age > 3)
GROUP BY fullName, age, gender, citiesLived.place
If I change this to:
select *
FROM (FLATTEN([dataset.tableId], children))
WHERE fullname = 'John Doe'
Then this works fine and gives me what I need however if I change to this:
select *
FROM (FLATTEN([dataset.tableId], citieslived))
WHERE fullname = 'John Doe'
Then I get the following error:
Error: Cannot output multiple independently repeated fields at the same time. Found children_age and citiesLived_yearsLived
Can someone explain why this will work flattening based on "Children" but not "CitiesLived" and how to know what variables to use within flatten with more complex datasets with multiple nested variables?
Thank you in advance
Can someone explain why this will work flattening based on "Children" but not "CitiesLived"
Check schema of this table again
Schema
-----------------------------------
|- kind: STRING
|- fullName: STRING (required)
|- age: INTEGER
|- gender: STRING
+- phoneNumber: RECORD
| |- areaCode: INTEGER
| |- number: INTEGER
+- children: RECORD (repeated)
| |- name: STRING
| |- gender: STRING
| |- age: INTEGER
+- citiesLived: RECORD (repeated)
| |- place: STRING
| +- yearsLived: INTEGER (repeated)
As you can see - when you flatten children repeated record – the only repeated record that is left for output is citiesLived and even though it has inside it yet another repeated field – yearsLived – they are not independent – thus BigQuery Legacy SQL can output result
Now, when you flatten by citiesLived – what you get in result are two repeated fileds - children and yearsLived. Those two are independent - thus BigQuery Legacy SQL cannot output such result.
how to know what variables to use within flatten with more complex datasets with multiple nested variables?
To make it work - you should add yet another flattening with (for example) yearsLived filed. Something like below
FROM (FLATTEN(FLATTEN([dataset.tableId], citieslived), yearsLived))
Adding all those multiple FLATTENs can become cumbersome so using BigQuery Standard SQL is really the way to go!
See Migrating from Legacy SQL to BigQuery Standard SQL
If you run this query:
SELECT
*
FROM
(FLATTEN((FLATTEN(([project_id:dataset_id.table]), citiesLived.yearsLived)), citiesLived))
It will flatten as expected.
When using the Legacy SQL, BQ tries to flatten automatically the results for you.
What I have noticed though is that if you try to flatten repeated fields that have other repeated fields inside then sometimes you might run into these errors (notice that the fields citiesLived and citiesLived.yearsLived are both repeated).
So one way to solve that is by forcing the flatten operation on all repeated fields you want to work with (in the example I showed you I first flattened the yearsLived and then citiesLived) and not relying on the automatic flattening operation that the Legacy SQL offers.
But what I strongly recommend and encourage you to do is to learn the Standard SQL version for BQ as Elliot suggested in his comment. It might have a steeper learning curve at first but it will totally pay off in the long run (and you won't have the risk of eventually having to migrate all your legacy queries to standard as we had to do in our company)
I have this query in application insights analytics
let total = exceptions
| where timestamp >= ago(7d)
| where problemId contains "Microsoft.ServiceBus"
| summarize sum(itemCount);
let nullContext = exceptions
| where timestamp >= ago(7d)
| where problemId contains "Microsoft.ServiceBus"
| where customDimensions.["SpecificTelemetry.Message"] == "HttpContext.Current is null"
| summarize sum(itemCount);
let result = iff(total == nullContext, "same", "different");
result
but I get this error
Invalid relational operator
I am surprised as yesterday with the same code (as far as I remember) I was getting a different error saying that both sides of the check would need to be scalar but my understanding was that the aggregation even if it displays a value (under sum_countItem) it's not a scalar. But couldn't find a way to transform it or now to get rid of this work.
Thanks
Couple of issues.
First - the Invalid relational operator is probably due to the empty lines between your let statements. AI Analytics allows you to write several queries in the same window, and uses empty lines to separate those. So in order to run all the statements as a single query you need to eliminate the empty lines.
Regarding the error of "Left and right side of the relational operator must be scalars" - the result of the "summarize" operator is a table and not scalar. It can contain a single line/column or multiple of those (think of what happens if you add a "by" clause to the summarize).
To achieve what you want to do you might want to use a single query as follows:
exceptions
| where timestamp >= ago(7d)
| where problemId contains "Microsoft.ServiceBus"
| extend nullContext = customDimensions.["SpecificTelemetry.Message"] == "HttpContext.Current is null"
| summarize sum(itemCount) by nullContext
I've got an sqlite table which contains start/stop timestamps. I would like to create a query which returns a total elapsed time from there.
Right now I have a SELECT (e.g. SELECT t,type FROM event WHERE t>0 AND (name='start' or name='stop) and eventId=xxx ORDER BY t) which returns a table which looks something like this:
+---+-----+
|t |type |
+---+-----+
| 1|start|
| 20|stop |
|100|start|
|150|stop |
+---+-----+
To produce the total elapsed time in the above example would be accomplished by (20-1)+(150-100) = 69
One idea I had was this: I could run two separate queries, one for the "start" fields and one for the "stop" fields, on the assumption that they would always line up like this:
+---+---+
|(1)|(2)|
+---+---+
| 1| 20|
|100|150|
+---+---+
(1) SELECT t FROM EVENT where name='start' ORDER BY t
(2) SELECT t FROM EVENT where name='stop' ORDER BY t
Then it would be simple (I think!) to just sum the differences. The only problem is, I don't know if I can join two separate queries like this: I'm familiar with joins that combine every row with every other row and then eliminate those that don't match some criteria. In this case, the criteria is that the row index is the same, but this isn't a database field, it's the order of the resulting rows in the output of two separate selects - there isn't any database field I can use to determine this.
Or perhaps there is some other way to do this?
I do not use SQLite but this may work. Let me know.
SELECT SUM(CASE WHEN type = 'stop' THEN t ELSE -t END) FROM event
This assumes the only values in type are start/stop.