Kusto summarize 3 or more columns - azure-data-explorer

Is there a way to use summarize to group 3 or more columns? I've been able to successfully get data from 1 or 2 columns then group by another column, but it breaks when trying to add a 3rd. This question asks how to add a column, but only regards adding a 2nd, not a 3rd or 4th. Using the sample help cluster on Azure Data Explorer and working with the Covid19 table, ideally I would be able to do this:
Covid19
| summarize by Country, count() Recovered, count() Confirmed, count() Deaths
| order by Country asc
And return results like this
But that query throws an error "Syntax Error. A recognition error occurred. Token: Recovered. Line: 2, Position: 36"

I had the right basic idea, you just can't use count repeatedly inline like that. You can use sum, dcount, or max:
Covid19
| summarize sum(Recovered), sum(Confirmed), sum(Deaths) by Country
| order by Country asc
Another example:
Covid19
| where Timestamp == max_of(Timestamp, Timestamp)
| summarize confirmedCases = max(Confirmed), active = max(Active), recovered = max(Recovered), deaths = max(Deaths) by Country
| order by Country asc
In this example I'm getting the latest data for each of the selected columns. Since I initially used the where clause to get the latest data you would think I could just list the columns, but when using summarize you have to use an aggregate function so I used max on each column

Related

KQL multiple aggregates in a summarize statement

I have data in a table for azure data explorer, let's say the following columns:
Day, non-unique-ID, Message-Content
What I want as an output is a table containing:
Day, Count of records per day, distinct Count of non-unique-ID per day
I know how to get one or the other:
summarize count() by Day
summarize dcount(non-unique-ID) by Day
but I don't know how to get a table containing both of those columns, because summarize will only let me run a single aggregate query per command.
You can use multiple aggregation functions in the same summarize operator, all you have to do is separate them with commas. So this will work:
summarize count(), dcount(non-unique-ID) by Day

`dpylr` count function for unique items in field

I have searched for this on here a few times, so apologies if this is a duplicate.
I am working with dplyr for the first time, and I am having trouble coming up with what I'd like. If I was doing SQL, the query would look like:
select count(customer_id), sum(sales), (sum(sales) / count(customer_id), *
from data_table
group by salesperson_id
In words, I want to:
group the data by salesperson
add up the total sales
count the number of unique customers
find the average sales per customer for each sales person.
I don't want to strip away "irrelevant" fields at this point, because they will become relevant in later steps.
I am getting stuck, specifically because the only counting function dplyr provides doesn't take any arguments. What aggregate function should I use to count distinct items in a field?
Responding to the question: What aggregate function should I use to count distinct items in a field?
n_distinct()
See docs here.
A broader example, though a reprex in the original question would help:
data_table %>%
group_by(salesperson_id) %>%
mutate(
customers = n_distinct(customer_id),
sales = sum(sales),
sales_per_customer = sales / customers
)

Application Insights order by aggregate

I have the following query:
customEvents
| summarize count(datepart("Second", timestamp) )
by toint(customMeasurements.Latency)
This is counting the number of seconds past the minute and grouping it by an integer Latency.
How do I add an order by operator to this to order by these columns?
In order to do this you need to alias the columns.
Aliasing columns is performed by prefixing the value with column_alias=.
customEvents
| summarize Count=count(datepart("Second", timestamp) )
by Latency=toint(customMeasurements.Latency)
Then we can reference the columns by their aliases:
customEvents
| summarize Count=count(datepart("Second", timestamp) )
by Latency=toint(customMeasurements.Latency)
| order by Latency asc nulls last

Selecting multiple maximum values? In Sqlite?

Super new to SQLite but I thought it can't hurt to ask.
I have something like the following table (Not allowed to post images yet) pulling data from multiple tables to calculate the TotalScore:
Name TotalScore
Course1 15
Course1 12
Course2 9
Course2 10
How the heck do I SELECT only the max value for each course? I've managed use
ORDER BY TotalScore LIMIT 2
But I may end up with multiple Courses in my final product, so LIMIT 2 etc won't really help me.
Thoughts? Happy to put up the rest of my query if it helps?
You can GROUP the resultset by Name and then use the aggregate function MAX():
SELECT Name, max(TotalScore)
FROM my_table
GROUP BY Name
You will get one row for each distinct course, with the name in column 1 and the maximum TotalScore for this course in column 2.
Further hints
You can only SELECT columns that are either grouped by (Name) or wrapped in aggregate functions (max(TotalScore)). If you need another column (e.g. Description) in the resultset, you can group by more than one column:
...
GROUP BY Name, Description
To filter the resulting rows further, you need to use HAVING instead of WHERE:
SELECT Name, max(TotalScore)
FROM my_table
-- WHERE clause would be here
GROUP BY Name
HAVING max(TotalScore) > 5
WHERE filters the raw table rows, HAVING filters the resulting grouped rows.
Functions like max and sum are "aggregate functions" meaning they aggregate multiple rows together. Normally they aggregate them into one value, like max(totalscore) but you can aggregate them into multiple values with group by. group by says how to group the rows together into aggregates.
select name, max(totalscore)
from scores
group by name;
This groups all the columns together with the same name and then does a max(totalscore) for each name.
sqlite> select name, max(totalscore) from scores group by name;
Course1|15
Course2|12

Join Tables in R or Python [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 6 years ago.
I have two tables-Price_list and order_list. The price_list table gives me all the prices that were active with date from all stores by product_id. While order_list gives me the list of orders placed i.e. who placed the order and from which store.
Price_list - date, product_id, store_id, selling_price
order_list - date, product_id, store_id, selling_price, order_id, email, product_order_id (unique key - concatenation of product_id and order_id as there could more than one product in an order)
I want to combine the above two tables in such a way that for each product_order_id i get a list of all prices that were available for the product. Basically i want to see what were the prices available and what did the customer choose. The table below illustrates my query.
|product_order_id Date product_id store_id selling_price Placed|
|134323_3545 2016/03/11 134323 6433 2560.00 Yes |
|134323_3545 2016/03/11 134323 6343 2534.00 No |
|134323_3545 2016/03/11 134323 1243 2313.00 No |
|134323_3545 2016/03/11 134323 2424 2354.00 No |
|145565_9965 2016/03/11 145565 9887 5432.00 No |
|145565_9965 2016/03/11 145565 7645 5321.00 Yes |
I am not able to get around to solving this in R. Although i prefer R for this, i am open if there is a solution in mysql or python. The steps to get this done is (a) select product_order_id (B) on that date for each product_id in the product_order_id search for all entries in price_list (C) append this to a table and add a column specifying product_order_id this list applies to (d) repeat the steps for the next product_order_id. Once the dataframe is prepared i can left join order_list table on column(product_order_id) to get the final dataframe. I have not yet been able to grasp how to do it in R.
After reading about loops and some help i was able to create a loop for searching all price entries for each product_id on a day (product_date is a concatenation of date and product_id):
datalist <- list()
for(i in (orderlisit_test$product_date){
dat <- filter(pricelist, pricelist$product_date==i)
datalist[[i]] <- dat
}
big_data = do.call("rbind", datalist)
However, i also want to add another column specifying the order_id or product_order_id for each iteration. So if anyone could help me in how should i loop as well as add another column at the same time that will help me a lot.
This will retain all the rows for every product_id
library(dplyr)
order_list_joined<-full_join(Price_list,order_list,by="product_id")
Then if there is no order_id for a given product_id, we assume there is no order place.
order_list_joined<-order_list_joined %>% mutate(Placed = ifelse(is.na(order_id),"No","Yes")

Resources