Get Other columns based on max of one column in Kusto - azure-data-explorer

I am trying to write a Kusto query to find record who has max value in column grouped by another column but also requires 3rd(remaining) columns with it.
Let there be three columns A(timestamp) B(impvalue: number) and C (anothervalue:string).
I need to get records grouped by C with max timestamp and its corresponding B column too.
In Sql, I am well aware how to do it using self join. I am new to Kusto, I tried few combination with summarize, join and top operator but wasn't able to make it work.
Example:
Output:

you can use the arg_max() aggregation function: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/arg-max-aggfunction.
For example:
datatable(A:datetime, B:long, C:string)
[
datetime(2020-08-20 12:00:00), 50, "abc",
datetime(2020-08-20 12:10:00), 30, "abc",
datetime(2020-08-20 12:05:00), 100, "abc",
datetime(2020-08-20 12:00:00), 40, "def",
datetime(2020-08-20 12:05:00), 120, "def",
datetime(2020-08-20 12:10:00), 80, "def",
]
| summarize arg_max(A, *) by C
C
A
B
abc
2020-08-20 12:10:00.0000000
30
def
2020-08-20 12:10:00.0000000
80

This isn't the most elegant solution, but it works:
let X = datatable (a: string, b: int, c: string) [
"8/24/2021, 12:40:00.042 PM", 50, "abc",
"8/24/2021, 12:40:10.042 PM", 30, "abc",
"8/24/2021, 12:40:05.042 PM", 100, "abc",
"8/24/2021, 12:40:00.042 PM", 40, "def",
"8/24/2021, 12:40:05.042 PM", 120, "def",
"8/24/2021, 12:40:10.042 PM", 80, "def"
];
X
| summarize Answer = max(a)
| join X on $left.Answer == $right.a
| project a,b,c

Related

Sum data from data frame 1 according to matching conditions in data frame 2 within an interval

I have two data frames. The first one holds observations for a specific ID in a time interval with given StartDate and EndDate. The second data frame holds observations for the same specific IDs but on a specific date.
ID <- c(86041, 87371, 98765, 90010)
DateStart <- as.Date((c("2022-02-04", "2022-02-04", "2022-02-08", "2022-02-08")))
DateEnd <- as.Date((c("2022-02-07", "2022-02-10","2022-02-11", "2022-02-11")))
Interaction <- c(122, 73, 105, 82)
df1 <- data.frame(ID, DateStart, DateEnd, Interaction)
ID <- c(86041, 86041, 87371, 87371, 98765, 98765, 90010, 90010)
date <- as.Date(c("2022-02-04", "2022-02-05", "2022-02-06", "2022-02-09", "2022-02-09", "2022-02-11", "2022-02-08", "2022-02-10"))
view <- c(25, 67, 21, 36, 43, 61, 14, 34)
read <- c(13, 37, 29, 15, 37, 51, 9, 25)
df2 <- data.frame(ID, date, view, read)
I want to sum all the events from the second data frame for a specific ID within the interval between StartDate and Enddate and add this as another column in the first data frame.
So I tried writing a function to get the aggregate of "view" for a certain ID in a specific time interval and applying it to df1, but I only get 0 as return and it does not look very elegant:
calc_view <- function(ID, StartDate, EndDate) {
sum(df2$view[which(df2$ID == ID &
df2$date >= StartDate &
df2$date <= EndDate)])
}
df1$view <- apply(df1, 1, calc_view, StartDate = df1$StartDate, EndDate = df1$EndDate)
The desired output should present the count of aggregated events for "view" and "read" for a specific ID in the interval between StartDate and EndDate given in df1. So something like this:
ID DateStart DateEnd Interaction view read
1 86041 2022-02-04 2022-02-07 122 92 50
2 87371 2022-02-04 2022-02-10 73 57 44
3 98765 2022-02-08 2022-02-11 105 104 88
4 90010 2022-02-08 2022-02-11 82 48 34
I'm quite new to r and suppose there's a better option, so any help is highly appreciated.

Adjusting Time Averages to Show Weekdays or Weekends

I am tracking the date/time when I say goodnight to Alexa. The entries are super weird and unhelpful strings, not dates and times:
Sample data:
January 25, 2021 at 12:03AM
January 25, 2021 at 11:27PM
January 26, 2021 at 11:17PM
Alexa just dumps these unconventional date/time strings into A1-A??? on the first tab.
I am using this formula to show my average bedtime each month:
= QUERY(ARRAYFORMULA(
IF(LEN(A2:A), {
MONTH(REGEXEXTRACT(A2:A, "\D+") & 1),
REGEXEXTRACT(A2:A, "\D+"),
IF(TIMEVALUE(REGEXEXTRACT(A2:A, "\d+:\d+.*")) > 0.5,
TIMEVALUE(REGEXEXTRACT(A2:A, "\d+:\d+.*")),
TIMEVALUE(REGEXEXTRACT(A2:A, "\d+:\d+.*")) + 1)
}, "")),
"Select Col1,Col2 ,avg(Col3) where Col1 is not null
group by Col1, Col2 Order By Col1 asc label Col1 '#', Col2 'Month', avg(Col3)
'Average bedtime'
")
But really, I don’t care so much about weekends as I do about weeknights. Stumped on how to adjust the formula so that it only shows Sun-Thu nights.
To make it trickier.. if I went to bed after midnight on a Thursday (gasp), that should still be included.
Turning to those who have madder skills than me!
Thanks for your help,
Drew
I think this should take into account the edge cases mentioned above:
cell B2:
=arrayformula(
regexreplace(A2:A, "^([\w, ]+) at ([\w: ]+)$", "$1 $2")
)
cell C2:
=arrayformula(
query(
{
text(B2:B - (B2:B - int(B2:B) < timevalue("4:00 AM")), "mmmm"),
text(B2:B - (B2:B - int(B2:B) < timevalue("4:00 AM")), "ddd"),
timevalue(B2:B) + (B2:B - int(B2:B) < timevalue("4:00 AM"))
},
"select Col1, avg(Col3)
where Col2 matches 'Sun|Mon|Tue|Wed|Thu'
group by Col1
pivot Col2",
0
)
)
Format the result cells as Format > Number > Time.
Here is a partial solution.
=ArrayFormula(query(if(A2:A = "",, {
month(REGEXEXTRACT(A2:A, "(.*) at")),
REGEXEXTRACT(A2:A, "\D+"),
REGEXEXTRACT(A2:A, "\d+:.*") + (REGEXEXTRACT(A2:A, "\D\D$")="AM"),
mod(REGEXEXTRACT(A2:A, "(.*) at") + REGEXEXTRACT(A2:A, "at (.*)")-1, 7)
}),
"Select Col1,Col2 ,avg(Col3) where Col4 > 0.5 and Col4 < 5.5
group by Col1, Col2 Order By Col1 asc label Col1 '#', Col2 'Month', avg(Col3)
'Average bedtime'
"))
It does not address the year and end of month issues raised by Erik. Any before noon entries on the first of a month should be included in the previous month.
Col4 at 0.5 would be Sunday noon and 5.5 would be Friday noon. Ed

Get Fastest Time Difference on Order and Response in Two DataFrames

I have 2 dataframes as shown. The first (df1) has orders IDs, user IDS, and the time the user orders something. In df2, I have the orderIds and the time the order was responded to (timeResponse). What I need is a dataframe that takes these two dataframes and outputs each order ID, and if it was responded to, the time difference between the order time and the fastest order response. Thus, in the first order (order ID 1), there were 3 responses, with the first one being at 2pm - so it would be a 2 hour response.
I'm looking for a way to do this in R.
df1 <- data.frame(
orderID = c(1,2,3,4,5),
userID = c(101, 102, 103, 104, 105),
timeOrdered = c("1/1/2020 12:00:00 PM", "1/2/20 1:00PM", "1/3/20 12:00 AM", "1/4/20 12:00 AM", "1/5/20 12:00 AM"))
df2 <- data.frame(responseID = c(1,2,3,4,5),
orderID = c(101, 102, 103, 104, 105),
timeResponse = c("1/1/20 2:00 PM", "1/1/20 3:00 PM", "1/1/20 4:00 PM", "1/4/20 2:00 PM", "1/5/20 2:00 PM"))

Converting Monthly time series to daily time series

I am using a monthly time series data which is infact a xts object. My aim is to covert the monthly data to daily data, such that each day in a Month has a value of the particular month.
For example:
library("xts")
observation_dates <- as.Date(c("01.12.1993", "01.01.1994",
"01.02.1994", "01.03.1994", "01.04.1994", "01.05.1994",
"01.06.1994", "01.07.1994", "01.08.1994", "01.09.1994",
"01.10.1994", "01.11.1994", "01.12.1994"), format = "%d.%m.%Y")
air_data <- zoo(matrix(c(21, 21, 21, 30, 35.5, 36, 38.5,
33, 37, 37, 30, 24, 21), ncol = 1), observation_dates)
colnames(air_data) = "air_temperature"
The series is as shown above.
I want to have all the 31 days in December 1993 to have a value of 21 (Air temp) so that average of the month still remains 21. And similarly i want to proceed for the rest of the months as shown.
I have tried using to.period(x, period="days") but nothing changes.
please does anyone have any idea?
Your help would be appreciated
Thank you so much for your response. However i was able to solve the problem. The approach i used is similar to as suggested by Ekatef. In my case i created empty xts object containing all the dates and converted all the variables in empty xts to numeric using lapply().
Then i merged the empty xts with monthly data series using:
merge(x,y,fill=na.locf). here na.locf carries forward the last observation in the monthly series to all the days in the month and subsequently follows for the other month.
The xts package isn't applicable to your problem as according to help of to.period:
It is not possible to convert a series from a lower periodicity to a
higher periodicity - e.g. weekly to daily or daily to 5 minute bars,
as that would require magic
It seems, approx() function may be the best solution if interpolation is desired
# emulation of the original monthly dates
observation_dates <- as.Date(c("01.12.1993", "01.01.1994",
"01.02.1994", "01.03.1994", "01.04.1994", "01.05.1994",
"01.06.1994", "01.07.1994", "01.08.1994", "01.09.1994",
"01.10.1994", "01.11.1994", "01.12.1994"), format = "%d.%m.%Y")
t_air <- c(21, 23, 20, 30, 35.5, 36, 38.5, 33, 37, 37, 30, 24, 27)
# target dates
seq_date <- seq(from = as.Date("01.12.1993", format = "%d.%m.%Y"),
to = as.Date("31.12.1994", format = "%d.%m.%Y"), by = 1)
ans <- approx(observation_dates, y = t_air, xout = seq_date)
If only one value for each month should be used, I would solve your problems using two data frames. The first one obs_data to keep the observation data with a column of the dates in a convenient "year-month" format
ym_dates <- format(observation_dates, "%Y-%m")
t_air <- c(21, 23, 20, 30, 35.5, 36, 38.5, 33, 37, 37, 30, 24, 27)
obs_data <- data.frame(observation_dates, ym_dates ,t_air)
The second one res_df to keep the target dates seq_date of daily resolution. The column air_t is filled with NA first
res_df <- data.frame(seq_date, ym = format(seq_date, "%Y-%m"),
stringsAsFactors = FALSE, air_t = NA)
Then fill the air_t column with data from the obs_data using correspondence of the years and months as a condition
dates_to_int <- unique(res_df$ym)
for (i in seq(along.with = dates_to_int))
{
res_df[which(res_df$ym %in% dates_to_int[i]), "air_t"] <-
obs_data[which(obs_data$ym_dates %in% dates_to_int[i]), "t_air"]
}
Hope, it'll be helpful :)

R: aggregate by date - (every 30min mean) [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 6 years ago.
Improve this question
I have been struggling with this for a while now:
I have a data frame that contains 5-minute measurements (for around 6 months) of different parameters. I want to aggregate them and get the mean of every parameter every 30 min. Here is a short example:
TIMESTAMP <- c("2015-12-31 0:30", "2015-12-31 0:35","2015-12-31 0:40", "2015-12-31 0:45", "2015-12-31 0:50", "2015-12-31 0:55", "2015-12-31 1:00", "2015-12-31 1:05", "2015-12-31 1:10", "2015-12-31 1:15", "2015-12-31 1:20", "2015-12-31 1:25", "2015-12-31 1:30")
value1 <- c(45, 50, 68, 78, 99, 100, 5, 9, 344, 10, 45, 68, 33)
mymet <- as.data.frame(TIMESTAMP, value1)
mymet$TIMESTAMP <- as.POSIXct(mymet$TIMESTAMP, format = "%Y-%m-%d %H:%M")
halfhour <- aggregate(mymet, list(TIME = cut(mymet$TIMESTAMP, breaks = "30 mins")),
mean, na.rm = TRUE)
What I want to get is the average between 00:35 and 1:00 and call this DATE-1:00AM, however, what I get is: average between 00:30 and 00:55 and this is called DATE-12:30am.
How can I change the function to give me the values that I want?
The trick (I think) is looking at when your first observation starts. If the first observation is 00:35 and you do the 30 minute cut then the intervals should follow the logic you want. Regarding the name of the Breaks it's just a matter of adding 25 minutes to the name and then you get what you want. Here is an example for 6 months of 2015:
require(lubridate)
require(dplyr)
TIMESTAMP <- seq(ymd_hm('2015-01-01 00:00'),ymd_hm('2015-06-01 23:55'), by = '5 min')
TIMESTAMP <- data.frame(obs=1:length(TIMESTAMP),TS=TIMESTAMP)
TIMESTAMP <- TIMESTAMP[-(1:7),] #TO start with at 00:35 minutes
TIMESTAMP$Breaks <- cut(TIMESTAMP$TS, breaks = "30 mins")
TIMESTAMP$Breaks <- ymd_hms(as.character(TIMESTAMP$Breaks)) + (25*60)
Averages <- TIMESTAMP %>% group_by(Breaks) %>% summarise(MeanObs=mean(obs,na.rm = TRUE))
If you get mymet constructed properly, you can cut TIMESTAMP into bins (which you can do with cut.POSIXt) so you can aggregate:
mymet$half_hour <- cut(mymet$TIMESTAMP, breaks = "30 min")
aggregate(value1 ~ half_hour, mymet, mean)
## half_hour value1
## 1 2015-12-31 00:30:00 73.33333
## 2 2015-12-31 01:00:00 80.16667
## 3 2015-12-31 01:30:00 33.00000
Data
mymet <- structure(list(TIMESTAMP = structure(c(1451539800, 1451540100,
1451540400, 1451540700, 1451541000, 1451541300, 1451541600, 1451541900,
1451542200, 1451542500, 1451542800, 1451543100, 1451543400), class = c("POSIXct",
"POSIXt"), tzone = ""), value1 = c(45, 50, 68, 78, 99, 100, 5,
9, 344, 10, 45, 68, 33)), .Names = c("TIMESTAMP", "value1"), row.names = c(NA,
-13L), class = "data.frame")

Resources