Find max from first row to current row in Kusto (Timeseries) - azure-data-explorer

I'm trying to find out how to extend a column to show the max value looking back from the first row to the current row (e.g. looping through the table from the row with the earliest timestamp to the latest).
Here's a sample input table:
let T = datatable(Timestamp:datetime, Count:int)
[
datetime(2021-01-01), 1,
datetime(2021-01-02), 1,
datetime(2021-01-03), 2,
datetime(2021-01-04), 1,
datetime(2021-01-05), 1,
datetime(2021-01-06), 3,
datetime(2021-01-07), 1,
datetime(2021-01-08), 2,
];
The desired output is:
Timestamp
Count
MaxToDate
2021-01-01
1
1
2021-01-02
1
1
2021-01-03
2
2
2021-01-04
1
2
2021-01-05
1
2
2021-01-06
3
3
2021-01-07
1
3
2021-01-08
2
3
Thanks!

you can use the scan operator: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/scan-operator
for example:
datatable(Timestamp:datetime, Count:int)
[
datetime(2021-01-01), 1,
datetime(2021-01-02), 1,
datetime(2021-01-03), 2,
datetime(2021-01-04), 1,
datetime(2021-01-05), 1,
datetime(2021-01-06), 3,
datetime(2021-01-07), 1,
datetime(2021-01-08), 2,
]
| order by Timestamp asc
| scan declare (max_to_date:int = 0) with
(
step s1: true => max_to_date = case(Count > s1.max_to_date,
Count,
s1.max_to_date);
)
Timestamp
Count
max_to_date
2021-01-01 00:00:00.0000000
1
1
2021-01-02 00:00:00.0000000
1
1
2021-01-03 00:00:00.0000000
2
2
2021-01-04 00:00:00.0000000
1
2
2021-01-05 00:00:00.0000000
1
2
2021-01-06 00:00:00.0000000
3
3
2021-01-07 00:00:00.0000000
1
3
2021-01-08 00:00:00.0000000
2
3

Related

Fill missing values of dates using the number of the weekday

I have a dataset in long format. Every subject in the dataset was observed five times during the week. I have a column with the number of the day of the week in which the observation was supposed to happen/happened and another column with the actual dates of the observations. The latter column has some missing values. I would like to use the information on the first column to fill the missing values in the second column. Here is a toy dataset:
df <- data.frame(case = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2),
day = c(1, 2, 3, 4, 5, 1, 2, 3, 4, 5),
date = as.Date(c("2023-01-02", "2023-01-03", NA, NA, "2023-01-06",
NA, "2021-05-11", "2021-05-12", "2021-05-13", NA)))
df
# case day date
# 1 1 2023-01-02
# 1 2 2023-01-03
# 1 3 <NA>
# 1 4 <NA>
# 1 5 2023-01-06
# 2 1 <NA>
# 2 2 2021-05-11
# 2 3 2021-05-12
# 2 4 2021-05-13
# 2 5 <NA>
And here is the desired output:
# case day date
#1 1 1 2023-01-02
#2 1 2 2023-01-03
#3 1 3 2023-01-04
#4 1 4 2023-01-05
#5 1 5 2023-01-06
#6 2 1 2021-05-10
#7 2 2 2021-05-11
#8 2 3 2021-05-12
#9 2 4 2021-05-13
#10 2 5 2021-05-14
Does this work for you? No linear models are used.
library(tidyverse)
df2 <-
df %>%
mutate(
ref_date = case_when(
case == 1 ~ as.Date("2023-01-01"),
case == 2 ~ as.Date("2021-05-09")
),
date2 = as.Date(day, origin = ref_date)
)
Output:
> df2
case day date ref_date date2
1 1 1 2023-01-02 2023-01-01 2023-01-02
2 1 2 2023-01-03 2023-01-01 2023-01-03
3 1 3 <NA> 2023-01-01 2023-01-04
4 1 4 <NA> 2023-01-01 2023-01-05
5 1 5 2023-01-06 2023-01-01 2023-01-06
6 2 1 <NA> 2021-05-09 2021-05-10
7 2 2 2021-05-11 2021-05-09 2021-05-11
8 2 3 2021-05-12 2021-05-09 2021-05-12
9 2 4 2021-05-13 2021-05-09 2021-05-13
10 2 5 <NA> 2021-05-09 2021-05-14
I concede that G.G.'s answer has the advantage that you don't need to hardcode the reference date.
P.S. here is a pure tidyverse solution without any hardcoding:
df2 <-
df %>%
mutate(ref_date = date - day) %>%
group_by(case) %>%
fill(ref_date, .direction = "downup") %>%
ungroup() %>%
mutate(date2 = as.Date(day, origin = ref_date))
1) Convert case to factor and then use predict with lm to fill in the NA's. No packages are used.
within(df, {
case <- factor(case)
date <- .Date(predict(lm(date ~ case/day), data.frame(case, date)))
})
giving
case day date
1 1 1 2023-01-02
2 1 2 2023-01-03
3 1 3 2023-01-04
4 1 4 2023-01-05
5 1 5 2023-01-06
6 2 1 2021-05-10
7 2 2 2021-05-11
8 2 3 2021-05-12
9 2 4 2021-05-13
10 2 5 2021-05-14
2) Find the mean day and date and then use day to appropriately offset each row.
library(dplyr) # version 1.1.0 or later
df %>%
mutate(date = {
Mean <- Map(mean, na.omit(pick(date, day)))
Mean$date + day - Mean$day
}, .by = case)

Count observations over rolling 30 day window

I need to create a variable that counts the number of observations that have occurred in the last 30 days for each id.
For example, imagine an observation that occurs on 1/2/2021 (d / m / y) for the id "a". If this observation is the first between 1/1/2021 and 1/2/2021 for the id "a" the variable must give 1. If it is the second, 2, etc.
Here is a larger example:
dat <- tibble::tribble(
~id, ~q, ~date,
"a", 1, "01/01/2021",
"a", 1, "01/01/2021",
"a", 1, "21/01/2021",
"a", 1, "21/01/2021",
"a", 1, "12/02/2021",
"a", 1, "12/02/2021",
"a", 1, "12/02/2021",
"a", 1, "12/02/2021",
"b", 1, "02/02/2021",
"b", 1, "02/02/2021",
"b", 1, "22/02/2021",
"b", 1, "22/02/2021",
"b", 1, "13/03/2021",
"b", 1, "13/03/2021",
"b", 1, "13/03/2021",
"b", 1, "13/03/2021")
dat$date <- lubridate::dmy(dat$date)
The result should be:
id q date newvar
a 1 01/01/2021 1
a 1 01/01/2021 2
a 1 21/01/2021 3
a 1 21/01/2021 4
a 1 12/02/2021 3
a 1 12/02/2021 4
a 1 12/02/2021 5
a 1 12/02/2021 6
b 1 02/02/2021 1
b 1 02/02/2021 2
b 1 22/02/2021 3
b 1 22/02/2021 4
b 1 13/03/2021 3
b 1 13/03/2021 4
b 1 13/03/2021 5
b 1 13/03/2021 6
Thank you very much.
With sapply and between, count the number of observations prior to the current observation that are within 30 days.
library(lubridate)
library(dplyr)
dat %>%
group_by(id) %>%
mutate(newvar = sapply(seq(length(date)),
function(x) sum(between(date[1:x], date[x] - days(30), date[x]))))
# A tibble: 16 x 4
# Groups: id [2]
id q date newvar
<chr> <dbl> <date> <int>
1 a 1 2021-01-01 1
2 a 1 2021-01-01 2
3 a 1 2021-01-21 3
4 a 1 2021-01-21 4
5 a 1 2021-02-12 3
6 a 1 2021-02-12 4
7 a 1 2021-02-12 5
8 a 1 2021-02-12 6
9 b 1 2021-02-02 1
10 b 1 2021-02-02 2
11 b 1 2021-02-22 3
12 b 1 2021-02-22 4
13 b 1 2021-03-13 3
14 b 1 2021-03-13 4
15 b 1 2021-03-13 5
16 b 1 2021-03-13 6
Left join dat to itself on the indicated condition grouping by the rows of the left hand data frame. We assume that you want a 30 day window ending at current row but if you wanted 30 days ago (31 day window) then change 29 to 30. Both give the same result for this data.
library(sqldf)
sqldf("select a.*, count(b.date) as newvar
from dat a left join dat b
on a.id = b.id and b.date between a.date - 29 and a.date and b.rowid <= a.rowid
group by a.rowid")
giving:
id q date newvar
1 a 1 2021-01-01 1
2 a 1 2021-01-01 2
3 a 1 2021-01-21 3
4 a 1 2021-01-21 4
5 a 1 2021-02-12 3
6 a 1 2021-02-12 4
7 a 1 2021-02-12 5
8 a 1 2021-02-12 6
9 b 1 2021-02-02 1
10 b 1 2021-02-02 2
11 b 1 2021-02-22 3
12 b 1 2021-02-22 4
13 b 1 2021-03-13 3
14 b 1 2021-03-13 4
15 b 1 2021-03-13 5
16 b 1 2021-03-13 6
To write it in a pipeline using [.] to denote the input data frame works.
dat %>% {
sqldf("select a.*, count(b.date) as newvar
from [.] a left join [.] b
on a.id = b.id and b.date between a.date - 29 and a.date and b.rowid <= a.rowid
group by a.rowid")
}
This runs roughly twice as fast as sapply on the data in the question.
library(microbenchmark)
microbenchmark(
sqldf = sqldf("select a.*, count(b.date) as newvar
from dat a left join dat b
on a.id = b.id and b.date between a.date - 29 and a.date and b.rowid <= a.rowid
group by a.rowid"),
sapply = dat %>%
group_by(id) %>%
mutate(newvar = sapply(seq(length(date)),
function(x) sum(between(date[1:x], date[x] - days(30), date[x]))))
)
giving:
Unit: milliseconds
expr min lq mean median uq max neval cld
sqldf 26.2768 26.77340 27.97039 27.0082 27.29515 63.1032 100 a
sapply 42.8800 43.69345 48.53094 44.1089 45.25275 285.4861 100 b

Sequentially update rows by group using data.table

I am fairly new to R. I have a hypothetical dataset containing prescriptions from various different patients and drug types. What I would like to do is to create episodes of drug use, i.e., I would like to see for how long a patient used the drug. The loop mentioned in post sequentially update rows in data.table works for me, but I am not sure how I can make sure that the loop starts over when encountering a new patient identifier or drug type.
These are some rows from the dataset "AllDrugs":
DrugType ID Duration StartPrescr EndPrescr n
1 1 90 5-3-2020 3-6-2020 1
1 2 30 7-1-2020 6-2-2020 1
1 2 30 14-1-2020 12-6-2020 2
1 2 30 21-01-2020 19-6-2020 3
Note: n is a number indicating the prescription by ID and DrugType
This is the current loop:
for (i in 2:nrow(AllDrugs)) {
if (AllDrugs[i,StartPrescr] >= AllDrugs[i-1,EndPrescr]) {
AllDrugs[i, EndPrescr:= StartPrescr+ Duration]
} else {
AllDrugs[i, EndPrescr:= AllDrugs[i-1,EndPrescr] + Duration]
}
}
This is what I get:
DrugType ID Duration StartPrescr EndPrescr n
1 1 90 5-3-2020 3-6-2020 1
1 2 30 7-1-2020 3-7-2020 1
1 2 30 14-1-2020 2-8-2020 2
1 2 30 21-01-2020 1-9-2020 3
This is what I want:
DrugType ID Duration StartPrescr EndPrescr n
1 1 90 5-3-2020 3-6-2020 1
1 2 30 7-1-2020 6-2-2020 1
1 2 30 14-1-2020 7-3-2020 2
1 2 30 21-01-2020 6-4-2020 3
How can I shift the prescriptions based on the duration of the prescription by ID and DrugType? Note: this is an example of one drug type, but DrugType could also be 2, or 3 etc.
Does this work for you?
shift_end <- function(en,dur) {
if(length(en)>1) for(i in 2:length(en)) en[i] = en[i-1] + dur[i]
return(en)
}
df[order(ID, DrugType,StartPrescr), EndPrescr:=shift_end(EndPrescr,Duration), by=.(ID,DrugType)]
Result:
DrugType ID Duration StartPrescr EndPrescr n
1: 1 1 90 2020-03-05 2020-06-03 1
2: 1 2 30 2020-01-07 2020-02-06 1
3: 1 2 30 2020-01-14 2020-03-07 2
4: 1 2 30 2020-01-21 2020-04-06 3
Data Source:
df <- structure(list(
DrugType = c(1, 1, 1, 1),
ID = c(1, 2, 2, 2),
Duration = c(90, 30, 30, 30),
StartPrescr = structure(c(18326,18268, 18275, 18282), class = "Date"),
EndPrescr = structure(c(18416, 18298, 18425, 18432), class = "Date"),
n = c(1, 1, 2, 3)), row.names = c(NA,-4L),
class = c("data.table", "data.frame")
)

R function to give group all previous dates when a condition occurs

My code is written in R where I have a table existing of 3 variables: a date, an ID and a path. The table is sorted by ID first, then by date. When path is 0, I need to group all previous path numbers for that ID in one line and register the first date (Data_Start) and data where Path = 0 occurred (Date_End). This needs to be done per ID.
For example the second row in the desired result table: path = 0 occurred on 2018-10-08 for ID 5, meaning that all the paths of the previous dates needs to be grouped together as path = 1,0,3,4, Data_Start = 2018-10-05 and Data_End = 2018-10-08.
Source table
Date ID Path
2018-10-05 5 1
2018-10-06 5 0
2018-10-07 5 3
2018-10-08 5 0
2018-10-06 5 4
2018-10-08 7 5
2018-10-07 8 2
2018-10-08 8 1
2018-10-09 8 0
Desired result:
Date_Start Date_End ID Index Path
2018-10-05 2018-10-06 5 1 1,0
2018-10-05 2018-10-08 5 2 1,0,3,0
2018-10-06 2018-10-06 5 3 4
2018-10-08 2018-10-08 7 4 5
2018-10-07 2018-10-09 8 5 2,1,0
Thank you in advance!
Along with ID, we can create another group where Path becomes 0, get the first and last Date of each group. To get all the previous Path numbers we check if the last value ends with 0 and also replace their Date_Start with the first value.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(group = lag(cumsum(Path == 0), default = 0)) %>%
group_by(ID, group) %>%
summarise(Date_Start = first(Date),
Date_End = last(Date),
Path = toString(Path)) %>%
mutate(Path = paste_values(Path),
Date_Start = replace(Date_Start,endsWith(Path,"0"),first(Date_Start))) %>%
ungroup %>%
dplyr::select(-group) %>%
mutate(Index = row_number())
# A tibble: 5 x 5
# ID Date_Start Date_End Path Index
# <int> <fct> <fct> <chr> <int>
#1 5 2018-10-05 2018-10-06 1, 0 1
#2 5 2018-10-05 2018-10-08 1, 0, 3, 0 2
#3 5 2018-10-06 2018-10-06 4 3
#4 7 2018-10-08 2018-10-08 5 4
#5 8 2018-10-07 2018-10-09 2, 1, 0 5
where I define paste_values function as
paste_values <- function(value) {
sapply(seq_along(value), function(x) {
if (endsWith(value[x], "0")) toString(value[1:x])
else value[x]
})
}

Multi-conditional mutate

I have a data frame that requires conditional recoding of a column based on the date listed in certain rows for each subset of IDs. I am trying to figure out how to best achieve this using the mutate function in dplyr. Suggestions and alternate solutions are welcome, but I would like to avoid using for loops.
I know how to write a really verbose and inefficient for loop that would solve this problem, but would like to know how to do it more efficiently.
The sample data frame:
df<-data.frame(ID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
date = as.Date(c("2016-02-01","2016-02-01","2016-02-01","2016-03-21", "2016-03-21", "2016-03-21", "2016-10-05", "2016-10-05", "2016-10-05", "2016-10-05", "2016-03-01","2016-03-01","2016-03-01","2016-04-21", "2016-04-21", "2016-04-21", "2016-11-05", "2016-11-05", "2016-11-05", "2016-11-05")),
trial = c(NA, NA, NA, 1, 1, 1, NA, NA, NA, NA, NA, NA, NA, 1, 1, 1, NA, NA, NA, NA)
My pseudo code - the second logical argument in the first two case_when statements is where I am stuck.
df%>%
group_by(ID)%>%
mutate(results = case_when(
is.na(trial) & date < date where trial = 1 ~ 0,
is.na(trial) & date > date where trial = 1 ~ 2,
trial == trial
))
The expected result being:
data.frame(ID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
date = as.Date(c("2016-02-01","2016-02-01","2016-02-01","2016-03-21", "2016-03-21", "2016-03-21", "2016-10-05", "2016-10-05", "2016-10-05", "2016-10-05", "2016-03-01","2016-03-01","2016-03-01","2016-04-21", "2016-04-21", "2016-04-21", "2016-11-05", "2016-11-05", "2016-11-05", "2016-11-05")),
trial = c(0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2)
)
An option would be to group by 'ID' and transform the 'trial' by applying the run-length-id on (rleid) on the 'trial' column
library(dplyr)
library(data.table)
df %>%
group_by(ID) %>%
mutate(trial = rleid(trial)-1)
# A tibble: 20 x 3
# Groups: ID [2]
# ID date trial
# <dbl> <date> <dbl>
# 1 1 2016-02-01 0
# 2 1 2016-02-01 0
# 3 1 2016-02-01 0
# 4 1 2016-03-21 1
# 5 1 2016-03-21 1
# 6 1 2016-03-21 1
# 7 1 2016-10-05 2
# 8 1 2016-10-05 2
# 9 1 2016-10-05 2
#10 1 2016-10-05 2
#11 2 2016-03-01 0
#12 2 2016-03-01 0
#13 2 2016-03-01 0
#14 2 2016-04-21 1
#15 2 2016-04-21 1
#16 2 2016-04-21 1
#17 2 2016-11-05 2
#18 2 2016-11-05 2
#19 2 2016-11-05 2
#20 2 2016-11-05 2
Or using rle
df %>%
group_by(ID) %>%
mutate(trial = with(rle(is.na(trial)),
rep(seq_along(values), lengths))-1)
Converting your pseudo code to code we can use which.max(trial == 1) to get first occurrence where trial = 1 for each group. This also assumes that there would be at least one entry of 1 in trial for each ID.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(trial = case_when(is.na(trial) & date < date[which.max(trial == 1)] ~ 0,
is.na(trial) & date > date[which.max(trial == 1)] ~ 2,
TRUE ~ trial))
# ID date trial
# <dbl> <date> <dbl>
# 1 1 2016-02-01 0
# 2 1 2016-02-01 0
# 3 1 2016-02-01 0
# 4 1 2016-03-21 1
# 5 1 2016-03-21 1
# 6 1 2016-03-21 1
# 7 1 2016-10-05 2
# 8 1 2016-10-05 2
# 9 1 2016-10-05 2
#10 1 2016-10-05 2
#11 2 2016-03-01 0
#12 2 2016-03-01 0
#13 2 2016-03-01 0
#14 2 2016-04-21 1
#15 2 2016-04-21 1
#16 2 2016-04-21 1
#17 2 2016-11-05 2
#18 2 2016-11-05 2
#19 2 2016-11-05 2
#20 2 2016-11-05 2

Resources