How do you perform the equivalent of an SQL sum SELECT SUM(column_name) FROM table_name in Kusto Query Language for Azure Data Explorer?
app("your-app").tableName
| summarize sum(columnToSum)
You don't need to have a "by" statement in your summarize, but you can add it for performing a group by, for example,
app("your-app").tableName
| summarize sum(columnToSum) by columnToGroupBy
Related
I have data in a table for azure data explorer, let's say the following columns:
Day, non-unique-ID, Message-Content
What I want as an output is a table containing:
Day, Count of records per day, distinct Count of non-unique-ID per day
I know how to get one or the other:
summarize count() by Day
summarize dcount(non-unique-ID) by Day
but I don't know how to get a table containing both of those columns, because summarize will only let me run a single aggregate query per command.
You can use multiple aggregation functions in the same summarize operator, all you have to do is separate them with commas. So this will work:
summarize count(), dcount(non-unique-ID) by Day
Is there a way to use summarize to group 3 or more columns? I've been able to successfully get data from 1 or 2 columns then group by another column, but it breaks when trying to add a 3rd. This question asks how to add a column, but only regards adding a 2nd, not a 3rd or 4th. Using the sample help cluster on Azure Data Explorer and working with the Covid19 table, ideally I would be able to do this:
Covid19
| summarize by Country, count() Recovered, count() Confirmed, count() Deaths
| order by Country asc
And return results like this
But that query throws an error "Syntax Error. A recognition error occurred. Token: Recovered. Line: 2, Position: 36"
I had the right basic idea, you just can't use count repeatedly inline like that. You can use sum, dcount, or max:
Covid19
| summarize sum(Recovered), sum(Confirmed), sum(Deaths) by Country
| order by Country asc
Another example:
Covid19
| where Timestamp == max_of(Timestamp, Timestamp)
| summarize confirmedCases = max(Confirmed), active = max(Active), recovered = max(Recovered), deaths = max(Deaths) by Country
| order by Country asc
In this example I'm getting the latest data for each of the selected columns. Since I initially used the where clause to get the latest data you would think I could just list the columns, but when using summarize you have to use an aggregate function so I used max on each column
I have created a subset of the pg_table_def table with table_name,col_name and data_type. I have also added a column active with 'Y' as value for some of the rows. Let us call this table as config.Table config looks like below:
table_name column_name
interaction_summary name_id
tag_transaction name_id
interaction_summary direct_preference
bulk_sent email_image_click
crm_dm web_le_click
Now I want to be able to map the table names from this table to the actual table and fetch values for the corresponding column. name_id will be the key here which will be available in all tables. My output should look like below:
name_id direct_preference email_image_click web_le_click
1 Y 1 2
2 N 1 2
The solution needs to be dynamic so that even if the table list extends tomorrow, the new table should be able to accommodate. Since I am new to Redshift, any help is appreciated. I am also considering to do the same via R using the dplyr package.
I understood that dynamic queries don't work with Redshift.
My objective was to pull any new table that comes in and use their columns for regression analysis in R.
I made this working by using listagg feature and concat operation. And then wrote the output to a dataframe in R. This dataframe would have 'n' number of select queries as different rows.
Below is the format:
df <- as.data.frame(tbl(conn,sql("select 'select ' || col_names|| ' from ' || table_name as q1 from ( select distinct table_name, listagg(col_name,',') within group (order by col_name)
over (partition by table_name) as col_names
from attribute_config
where active = 'Y'
order by table_name )
group by 1")))
Once done, I assigned every row of this dataframe to a new dataframe and fetched the output using below:
df1 <- tbl(conn,sql(df[1,]))
I know this is a round about solution. But it works !! Fetches about 17M records under 1 second.
I have the following query:
customEvents
| summarize count(datepart("Second", timestamp) )
by toint(customMeasurements.Latency)
This is counting the number of seconds past the minute and grouping it by an integer Latency.
How do I add an order by operator to this to order by these columns?
In order to do this you need to alias the columns.
Aliasing columns is performed by prefixing the value with column_alias=.
customEvents
| summarize Count=count(datepart("Second", timestamp) )
by Latency=toint(customMeasurements.Latency)
Then we can reference the columns by their aliases:
customEvents
| summarize Count=count(datepart("Second", timestamp) )
by Latency=toint(customMeasurements.Latency)
| order by Latency asc nulls last
Super new to SQLite but I thought it can't hurt to ask.
I have something like the following table (Not allowed to post images yet) pulling data from multiple tables to calculate the TotalScore:
Name TotalScore
Course1 15
Course1 12
Course2 9
Course2 10
How the heck do I SELECT only the max value for each course? I've managed use
ORDER BY TotalScore LIMIT 2
But I may end up with multiple Courses in my final product, so LIMIT 2 etc won't really help me.
Thoughts? Happy to put up the rest of my query if it helps?
You can GROUP the resultset by Name and then use the aggregate function MAX():
SELECT Name, max(TotalScore)
FROM my_table
GROUP BY Name
You will get one row for each distinct course, with the name in column 1 and the maximum TotalScore for this course in column 2.
Further hints
You can only SELECT columns that are either grouped by (Name) or wrapped in aggregate functions (max(TotalScore)). If you need another column (e.g. Description) in the resultset, you can group by more than one column:
...
GROUP BY Name, Description
To filter the resulting rows further, you need to use HAVING instead of WHERE:
SELECT Name, max(TotalScore)
FROM my_table
-- WHERE clause would be here
GROUP BY Name
HAVING max(TotalScore) > 5
WHERE filters the raw table rows, HAVING filters the resulting grouped rows.
Functions like max and sum are "aggregate functions" meaning they aggregate multiple rows together. Normally they aggregate them into one value, like max(totalscore) but you can aggregate them into multiple values with group by. group by says how to group the rows together into aggregates.
select name, max(totalscore)
from scores
group by name;
This groups all the columns together with the same name and then does a max(totalscore) for each name.
sqlite> select name, max(totalscore) from scores group by name;
Course1|15
Course2|12